More Recent Comments

Tuesday, September 24, 2019

How many protein-coding genes in the human genome? (2)

It's difficult to know how many protein-coding genes there are in the human genome because there are several different ways of counting and the counts depend on what criteria are used to identify a gene. Last year I commented on a review by Abascal et al. (2018) that concluded there were somewhere between 19,000 and 20,000 protein-coding genes. Those authors discussed the problems with annotation and pointed out that the major databases don't agree on the number of gene [How many protein-coding genes in the human genome?].

Abascal et al. also said that before publication of the human genome most researchers were expecting about 25,000 - 40,000 genes so the actual number of protein-coding genes is pretty close to those estimates. (Keep in mind that there are several thousand noncoding genes.) This helps to debunk the standard myth that scientists were expecting 100,000 or more genes [False History and the Number of Genes 2010] [Facts and Myths Concerning the Historical Estimates of the Number of Genes in the Human Genome].

Now there's a new review that continues this discussion (Hatje et al. (2019). One of the best things in this latest review is a new figure showing a much better history of gene number estimates. Readers might recall that back in 2010, Pertea and Salzberg published some false information on this subject [see False History and the Number of Genes 2010]. A modified version of their figure (right) was published just last year in Nature (Willyard, 2018).

Here's the new figure from Hatje et al. (below). It's much better but it still gives too much credence to high estimates of gene number that were not supported by reliable data or logic (e.g. US Human Genome Project (1990), CpG Islands, EST data, and GeneSweep). However, the new figure is a far more accurate history than the one published by Pertea and Salzberg—don't you agree?

This really should put an end to the ridiculous myth that experts were "shocked" and "surprised" at the low number of genes in the human genome back in 2001. Let's hope that we never have to hear that canard again, especially in the scientific literature.

There's another figure in the Hatje et al. paper that nicely illustrates the differences between various ways of calculating the number of protein-coding genes. The estimates by Gencode, Ensembl, and RefSeq have drifted downward so that they now cluster around 20,000 genes. Unfortunately, these three databases do not agree on the core number of genes—only about 19,000 are common to all three databases. Those estimates are based mostly on computer models of potential genes with help from human annotators.

CCDS, neXtProt, and PeptideAtlas are databases that require independent evidence that a potential gene is functional. Usually this means identifying the protein product [see How many proteins in the human proteome?, How many different proteins are made in a typical human cell?, How many proteins do humans make?]. These values are increasing over the years so that we can be confident that there are at least 19,000 protein-coding genes but probably not more than 20,000.

Thanks to Martin Kollmar for alerting me to this paper from his lab.

Hatje, K., Mühlhausen, S., Simm, D., and Kollmar, M. The Protein-Coding Human Genome: Annotating High-Hanging Fruits. BioEssays 1900066. [doi: 10.1002/bies.201900066]

Pertea, M., and Salzberg, S. (2010) Between a chicken and a grape: estimating the number of human genes. Genome biology, 11:206. [doi:10.1186/gb-2010-11-5-206]


  1. I wish there were some data in the big 24-year gap from 1966 to 1990 in the Hatje et al. plot (the second graph that you show here). Just representing it by a zigzaggy thingie on the horizontal axis gives too much visual impression of continuity.

    Presumably the 1965 figure in that plot from "genome length / gene length" does not take stretches of junk into account, thus counting them as containing genes. If the gene lengths were computed from protein (and RNA sizes), they would be far too small and the gene count would be much higher. So the 1965 figure must somehow have taken intron lengths into account, even though they hadn't been discovered yet.

  2. One of the ways of estimating gene number was to assume that there were 5,000 genes in the fruit fly genome based on the idea that there was one gene per band in polytene chromosomes. Given the size of a gene from that assumption, Vogel estimated that there would be 60,000 genes in the human genome.

    However, saturation mapping of parts of the Drosophila genome indicated that there were likely TWO genes per band, on average. By the late 1960s it was known that more than half the human genome was repetitive DNA and it was assumed that genes were confined to unique sequence DNA. Thus, the original estimate of 60,000 genes was still pretty good.

    Lots of people, including me, thought that humans were unlikely to have an order of magnitude more genes that fruit flies so the idea that humans had a few tens of thousands of genes was pretty popular among my friends in the 1970s.

  3. Estimates of gene number based of RNA association kinetics (Rot curves) were quite popular in the early 1970s - not 1990 as the chart implies. These estimates came in at around 30,000 genes which agreed with the genetic load estimates.

  4. These three methods: (1) genetic load, (2) RNA association kinetics, and (3) dividing genome size by estimates of gene size, were the only methods available in the early 1970s. The numbers appeared to be consistent with sequence and genetic data throughout the 1980s so there was no reason to change the earlier estimates.

    The first large-scale sequence of a portion of the human genome was published in 1992. It covered 4,000 kb of the MHC locus and there were 100 genes in this region for an average of one gene every 40,000 bp. If you extrapolate to the entire genome there should be 80,000 genes but everyone knew that the MHC locus was unusually rich in genes and gen families. Thus it seemed likely that there were a lot fewer that 80,000 genes in the human genome.