More Recent Comments

Tuesday, September 24, 2019

How many protein-coding genes in the human genome? (2)

It's difficult to know how many protein-coding genes there are in the human genome because there are several different ways of counting and the counts depend on what criteria are used to identify a gene. Last year I commented on a review by Abascal et al. (2018) that concluded there were somewhere between 19,000 and 20,000 protein-coding genes. Those authors discussed the problems with annotation and pointed out that the major databases don't agree on the number of gene [How many protein-coding genes in the human genome?].

Abascal et al. also said that before publication of the human genome most researchers were expecting about 25,000 - 40,000 genes so the actual number of protein-coding genes is pretty close to those estimates. (Keep in mind that there are several thousand noncoding genes.) This helps to debunk the standard myth that scientists were expecting 100,000 or more genes [False History and the Number of Genes 2010] [Facts and Myths Concerning the Historical Estimates of the Number of Genes in the Human Genome].

Now there's a new review that continues this discussion (Hatje et al. (2019). One of the best things in this latest review is a new figure showing a much better history of gene number estimates. Readers might recall that back in 2010, Pertea and Salzberg published some false information on this subject [see False History and the Number of Genes 2010]. A modified version of their figure (right) was published just last year in Nature (Willyard, 2018).

Here's the new figure from Hatje et al. (below). It's much better but it still gives too much credence to high estimates of gene number that were not supported by reliable data or logic (e.g. US Human Genome Project (1990), CpG Islands, EST data, and GeneSweep). However, the new figure is a far more accurate history than the one published by Pertea and Salzberg—don't you agree?


This really should put an end to the ridiculous myth that experts were "shocked" and "surprised" at the low number of genes in the human genome back in 2001. Let's hope that we never have to hear that canard again, especially in the scientific literature.

There's another figure in the Hatje et al. paper that nicely illustrates the differences between various ways of calculating the number of protein-coding genes. The estimates by Gencode, Ensembl, and RefSeq have drifted downward so that they now cluster around 20,000 genes. Unfortunately, these three databases do not agree on the core number of genes—only about 19,000 are common to all three databases. Those estimates are based mostly on computer models of potential genes with help from human annotators.


CCDS, neXtProt, and PeptideAtlas are databases that require independent evidence that a potential gene is functional. Usually this means identifying the protein product [see How many proteins in the human proteome?, How many different proteins are made in a typical human cell?, How many proteins do humans make?]. These values are increasing over the years so that we can be confident that there are at least 19,000 protein-coding genes but probably not more than 20,000.


Thanks to Martin Kollmar for alerting me to this paper from his lab.

Hatje, K., Mühlhausen, S., Simm, D., and Kollmar, M. (2019) The Protein-Coding Human Genome: Annotating High-Hanging Fruits. BioEssays 1900066. [doi: 10.1002/bies.201900066]

Pertea, M., and Salzberg, S. (2010) Between a chicken and a grape: estimating the number of human genes. Genome biology, 11:206. [doi:10.1186/gb-2010-11-5-206]

11 comments :

Joe Felsenstein said...

I wish there were some data in the big 24-year gap from 1966 to 1990 in the Hatje et al. plot (the second graph that you show here). Just representing it by a zigzaggy thingie on the horizontal axis gives too much visual impression of continuity.

Presumably the 1965 figure in that plot from "genome length / gene length" does not take stretches of junk into account, thus counting them as containing genes. If the gene lengths were computed from protein (and RNA sizes), they would be far too small and the gene count would be much higher. So the 1965 figure must somehow have taken intron lengths into account, even though they hadn't been discovered yet.

Joe Felsenstein said...

typo: ... from protein (and RNA) sizes ...

Larry Moran said...

One of the ways of estimating gene number was to assume that there were 5,000 genes in the fruit fly genome based on the idea that there was one gene per band in polytene chromosomes. Given the size of a gene from that assumption, Vogel estimated that there would be 60,000 genes in the human genome.

However, saturation mapping of parts of the Drosophila genome indicated that there were likely TWO genes per band, on average. By the late 1960s it was known that more than half the human genome was repetitive DNA and it was assumed that genes were confined to unique sequence DNA. Thus, the original estimate of 60,000 genes was still pretty good.

Lots of people, including me, thought that humans were unlikely to have an order of magnitude more genes that fruit flies so the idea that humans had a few tens of thousands of genes was pretty popular among my friends in the 1970s.

Larry Moran said...

Estimates of gene number based of RNA association kinetics (Rot curves) were quite popular in the early 1970s - not 1990 as the chart implies. These estimates came in at around 30,000 genes which agreed with the genetic load estimates.

Larry Moran said...

These three methods: (1) genetic load, (2) RNA association kinetics, and (3) dividing genome size by estimates of gene size, were the only methods available in the early 1970s. The numbers appeared to be consistent with sequence and genetic data throughout the 1980s so there was no reason to change the earlier estimates.

The first large-scale sequence of a portion of the human genome was published in 1992. It covered 4,000 kb of the MHC locus and there were 100 genes in this region for an average of one gene every 40,000 bp. If you extrapolate to the entire genome there should be 80,000 genes but everyone knew that the MHC locus was unusually rich in genes and gen families. Thus it seemed likely that there were a lot fewer that 80,000 genes in the human genome.

Joe Felsenstein said...

Thanks, Larry for the explanations.

Rafal Grochala said...

Nice summary. Straight to the bookmarks.

Unknown said...

Hi Joe,
introns were not known at that time, but it was known from the work by Monod that genes can be far longer than their coding sequence. At that time they thought it's "regulatory sequence". We added that to the Box 2 text, where we provide some more details on gene number estimates than in the main text.
Best, Martin

Unknown said...

Hi Larry,
thanks for commenting on our review. I tried hard to find any publication about gene number estimates between 1965 and 1990, but failed. I know all the methods you mention in your comments above, and I know these were used to estimate gene size and other aspects. But I couldn't find a single study with a sentence "based on our method we propose/extrapolate the number of human genes to xx". All the numbers mentioned in any of the papers of that long 25 years refer to the previous publications. The citation practice seemed to be different at that time. All these papers cited reviews and text books from the late 1960th, and those again cite earlier reviews (mainly from Muller), and only when you dig through all these reviews you end up with the original work from the 1940th. That's maybe also the reason that all these early papers just have a handful citations each. Everybody seemed to believe in these numbers, they were in text books, thus no reason to present new numbers (total human gene numbers). Researchers seemed to concentrate on more important questions, junk, introns, etc, who cared about a slightly different total number? Or do you know a specific publication from the 1970th or 1980th giving an estimate on "total human genes"?
Best wishes, Martin

Larry Moran said...

The Rot data comes mostly from the early 1970s and it represents a new estimate of gene number that doesn't depend on the genetic load argument. Benjamin Lewin wrote several reviews of the data and his 1974 book (Gene Expression) is the best summary of what was known at that time. I think it's fair to say that we had new independent data on gene number in the early 1970s.

The genetic load argument was constantly being revised and updated as we learned more and more about genes and genomes. The C-value debate played an important role in our understanding that humans could have less than 50,000 genes in spite of the fact that we had a large genome. This argument didn't become prominent until the 1970s.

I realize that the original estimates of gene number based on genetic load didn't change much when Ohno published his 1972 paper (~30,000 genes) but, because of more recent data, the validity of that estimate became more reliable in 1972 that it was in 1948.

The relevance of the Drosophila data was constantly being re-evaluated in the 1970s, especially after Judd published his sturation experiments in 1972 and 1973. I think it's fair to point out that estimates of gene number in humans, based on analogy with Drosophila, were stronger in the 70s than they were in the 60s.

For these reasons, I think your graph would have been better if it had included 1970s estimates of the number of genes in humans. It would have shown that a strong consensus was developing that humans had fewer than 50,000 genes (most likely only 30,000).

Unknown said...

I absolutely agree with your comment. But this would be two different information in one plot. a) Papers with gene number estimates, and b) data/methods supporting one or the other estimate. In the 1970th there are many publications, as you point out, that nicely present new methods and data, but in the end "just" say, "well, our data/method supports the earlier estimate of 30,000 genes" and citation of a review from the late 1960th. The 1974 book is great, but cites "only" reviews from the late 1960th as well, and those point to just the few publications shown in the plot. Reviews/papers after 1974 point to this book or newer editions.

Drawing a plot based on "supporting data/methods" is not that easy. Also the methods used in the 199th "support" the high numbers, and the software used have thousands of citations. As much as researchers in the 1960th couldn't understand why genes are far longer than their coding sequence, researchers in the 1990th didn't imagine how complex and diverse genes are (e.g. exon length from 1 bp to 4000 bp, gene length from 50 aa to thousands of aa, alternative splicing etc.). Thus the data and methods from the 1990th are ok, but the conclusions are very wrong. How put that into a plot?

I agree with Joe's comment above that we also should have made a plot with a scalar x-axsis, at least as a small inlet, to demonstrate how long the 30,000 gene number was the agreed number compared to the very short time when high numbers were hyped.