Monday, March 19, 2007

Facts and Myths Concerning the Historical Estimates of the Number of Genes in the Human Genome

In April 2005 Gil Ast published an article in Scientific American (Ast, 2005). The title of the article was “The Alternative Genome” and its main point was how alternative splicing in humans could increase the number of different proteins that we produce. He explains why he thinks the proteome is so much larger than the number of genes. (Ast claims that there are 90,000 proteins and only 25,000 genes.)

Ast begins his argument with the quotation below.
Spring of 2000 found molecular biologists placing dollar bets. Trying to predict the number of genes that would be found in the human genome when the sequence of its DNA nucleotides was completed. Estimates at the time ranges as high as 153,000. ... given our complexity we ought to have a bigger genetic assortment than the 1000-cell roundworm, Caenorhabditis elegans, which has a 19,500-gene complement, or corn, with its 40,000 genes.

When a first draft of the human sequence was published the following summer, some observers were therefore shocked by the sequencing team’s calculation of 30,000 to 35,000 protein-coding genes. The low number seemed almost embarrassing.
Ast's remarks illustrate two points that I want to address. The first point is the surprise factor. Ast, and some other scientists, were surprised (and embarrassed) by the low gene count. They imply that most genome experts were also shocked when the genome sequence was published. That’s not quite correct, as I will show below.

The second point will have to be put off for another time but it’s important enough to mention here. Ast thinks that humans need to make many times more proteins than worms and corn because we are so much more complex. There are two problems with such a point of view—are we, in fact, 2-3 times more complex than corn? And, does it take thousands of new proteins to generate the structures that make us unique?

I think some people exaggerate our complexity and the place of humans relative to other species. This incorrect perspective can cause some scientists to put their faith in weakly supported hypotheses that claim to explain why humans really are complex and important in spite of the fact that we don’t have a lot of genes.

But let's put that discussion aside for a few days in order to discuss the historic estimates of the number of genes in the human genome. The statement by Gil Ast is typical of those who are embarrassed. They exaggerate the estimates of the total number of genes in order to make it look like everyone—not just them—thought there would be far more genes than the 25,000 that have been found. Just this month (March 2007) this myth was repeated by Taft et al. (2007).
Predictions of the estimated number of protein-coding genes in the human genome prior to genome sequencing ranged from as low as 50,000 to as high as 140,000, whereas the latest estimates from genome analysis indicate that humans have approximately 20,000 protein-coding genes.
The graphic above was taken from the Genesweep lottery. This is the betting that Asp refers to. It shows the range of gene number estimates by scientists who were involved in genome sequencing projects. Note that there are many estimates in the 40-50,000 range and a fair number below 40,000. The point is obvious—lots of experts anticipated fewer than 50,000 genes in the human genome (see The nature of the number. Nature Genetics 25:127 (2000)).

The earliest estimates of gene number are based on genetic load arguments (see King & Jukes, 1969). Since approximate mutation rates were known by 1960, it was possible to estimate the maximum number of genes that could be mutated without presenting an impossible genetic load. In other words, how many genes could we have before the number of lethal mutations per generation became intolerable? This number was less than 40,000 genes; an estimate that has never been refuted or discredited. Many experts were well aware of this upper bound up until the time the genome sequence was published.

By the 1970's there were good estimates of the total number of unique Drosophila melanogaster genes that could be mutated to lethality. The range was about 5,000-10,000 genes and this correlated well with the genetic map and the organization of polytene chromosomes. It was known that the Drosophila genome was much larger than the total size of the estimated number of genes but studies from a number of labs confirmed that a great deal of genomic DNA was repetitive junk DNA.

As we learned more and more about how genes controlled development, it became clear that huge differences in morphology and "complexity" could be due to very small changes in the either the number of regulatory genes or when they were expressed. Most of the people who assimilated the advances in developmental biology began to appreciate that mammals do not need to have many more genes than fruit flies.

By 1980, the amount of unique sequence DNA in mammalian genomes was known to be capable of encoding fewer than 20,000 genes if the average size of a gene was 10,000 bp (including introns). We now know that much of the intron sequences is not unique sequence DNA but that wasn't known back then. This estimate of gene number was consistent with detailed analysis of the amount of DNA that could be protected by mRNA or by Rot analysis (kinetics of hybridization of RNA to DNA). Mouse embryos (gastrula) appeared to express about 20,000 average-sized mRNAs. Some of these were present at very low abundance leading to the idea that this value may represent most of the mouse genes in the genome (summarized in Lewin, 1980). Certainly it was known that mammals expressed about 10,000 housekeeping genes in most cells and tissues. The general consensus was that the total number of regulatory genes was unlikely to be more than twice this number (probably less) for a total of 30,000 genes at most.

It was about this time that Walter Gilbert made his famous back-of-the-envelope calculation of 100,000 genes in the human genome. This was the estimate that became widely quoted when the human genome project was first proposed. It's interesting to note that Gilbert's estimate was not based on any experimental evidence; indeed, it conflicted with most of the available evidence suggesting far fewer genes. The larger number seemed less threatening to scientists who were worried that we might not have more genes than a fruit fly.

By the late-1990's we had estimates of the total number of human genes from the sequences of chromosomes 21 and 22 and from the sequence of a large contiguous region of the MHC locus. The results suggested fewer than 45,000 genes total—even less if these sequenced regions turned out to be gene rich as was widely suspected. Thus, the number of genes was coming out to be well below 50,000 and this was in line with the data from RNA hybridization studies and genetic load. It also fit with the concept that the number of genes in mammals was probably not more than twice the number in insects.

In contrast to these results, the estimates from expressed sequence tags (ESTs) were often much higher. Expressed sequence tags are short copies of RNA isolated directly from cells. The idea was that these represented bits of mRNA so each one revealed the presence of a protein-coding gene. As more and more ESTs were deposited in the sequence libraries, it became possible to estimate when the library would be complete and the totals were often more than 100,000 distinct mRNAs. For example, just before the human genome sequence was published, (Liang et al., 2000) estimated that there were 120,000 genes based on the analysis of 2 million EST's.

Not everyone believed in the validity of the EST data. There were some who thought that most ESTs were artifacts. They turned out to be correct although this is not widely appreciated. By using the sequences of chromosomes 21 and 22 as controls
Ewing and Green (2000) were able to estimate 35,000 genes based on the EST libraries.

Thus, by the time the draft sequence was published in 2001 there were many scientists who anticipated that the number of genes would be less than 40,000 and that's why there are so many bets in that range in the Genesweep lottery. When the number of genes was announced to be about 30,000 there were many of us who were not the least bit surprised. The only ones who were surprised were those who ignored most of the data and clung to the idea humans had to have far more genes than the so-called "lower" species.

It is simply not true that all the experts were surprised at the low number of genes. Some experts were, but many were not. The interesting thing is that those who wanted there to be more genes have not given up the fight. They continue to publish rationalizations and just-so stories in an attempt to justify why they were wrong.

UPDATE:The latest estimates indicate that the human genome contains about 20,500 protein-encoding genes [Humans Have Only 20,500 Protein-Encoding Genes]. There are probably about 1500 genes for the known stable RNAs for a total of 22,000.

Ast,G. (2005) The alternative genome. Sci. Am. 292; 40-47.

Ewing,B. and Green,P. (2000) Analysis of expressed sequence tags indicates 35,000 human genes. Nat. Genet. 25; 232-234.

King,J.L. and Jukes,T.H. (1969) Non-Darwinian evolution. Science 164; 788-798.

Lewin, B. (1980) Gene Expression-2 2nd ed. Chapter 24; Complexity of mRNA Populations.

Liang,F., Holt,I., Pertea,G., Karamycheva,S., Salzberg,S.L., and Quackenbush,J. (2000) Gene index analysis of the human genome estimates approximately 120,000 genes. Nat. Genet. 25; 239-240.


  1. I recently read Carl Sagan's The Dragons of Eden (1977) wherein he argued that genome size is directly correlated with 'bits of information within the brain' (assuming, of course, that most of the genome was functional). It should be noted that he also estimated the human genome to be somewhere in the range of ~115 gigabases. It's nice to know that people were making rational estimates about total gene number far before that. This history lesson was very enlightening to a relative newcomer like myself, thank you.

  2. Very interesting! A friend and I argued about the whole superior complexity of humans issue, so I can't wait to see what you think about that. (I said we're no more complex than many other organisms)

  3. I took an upper level genetics course as an undergraduate in about 1978 or 79, from Larry Sandler at the University of Washington. At that time, he estimated that there were 10-15,000 genes in Drosophila, and he guessed that there were probably twice that number in humans.

    Sagan was a smart guy, but his biology wasn't that good. We knew very well in 1977 that genome size was all over the map, and that correlating it with brain size made absolutely no sense -- or salamanders and ferns would have been the super geniuses of the natural world.

  4. In his 1972 Junk DNA paper Ohno gives a number of 30000 genes that other authors calculated from the mutation rate.

  5. Martin says,

    In his 1972 Junk DNA paper Ohno gives a number of 30000 genes that other authors calculated from the mutation rate.

    Thanks, I forgot that the estimate of total gene number was a key part of the original conclusion about junk DNA.

    Here's what Ohno said,

    In fact, there seems to be a strict upper limit for the number of genes which we can afford to keep in our genome. Consequently, only a fraction of our DNA appears to function as genes. The observations on a number of structural gene loci of man, mice, and other organisms revealed that each locus has a [1/100,000] per generation probability of sustaining a deleterious mutation. It then follows that the moment we acquire [100,000] gene loci, the overall deleterious mutation rate per generation becomes 1.0 which appears to represent an unbearably heavy genetic load. Taking into consideration the fact that deleterious mutations can be dominant or recessive, the total number of gene loci of man has been estimated to be about [30,000] (Muller, 1967; Crow and Kimura, 1970).

    My point, as I'm sure you know, is that there were many people who predicted fewer than 40,000 genes in the human genome. Many of us believed in those predictions and we weren't surprised by the number of genes when the sequence of the human genome was published.

    Nevertheless, an urban legend has grown up claiming that "scientists" were expecting 100,000 genes and they were surprised at the low number. My goal here is to debunk that legend before it gets elevated to the level of fact.

    There's a method to my madness. As we'll see, some recent papers purport to "explain" this "anomaly" by resorting to bizarre just-so stories about alternative splicing, non-coding RNAs, and huge regulatory sequences. One of the problems with these stories is that they are "solving" a problem that never existed in the first place.

  6. anonymous asks,

    Do you have a comment on this paper?

    My first impression is that it's mostly garbage and should never have been published.

    They're saying that 75% of non-coding DNA in Drosophila melanogaster has a function. I'm going to need a lot more evidence before I believe that.

  7. It appears that we're now down to 18,000, by the way. More here.

  8. The review I wrote in 2004 at least turned out to be the right side of wrong! (25,000, low but not low enough, "Has the Yo-yo stopped? a human gene number update" (2004) Proteomics 6 1712-26. PMID: 15174140