Ast begins his argument with the quotation below.
Spring of 2000 found molecular biologists placing dollar bets. Trying to predict the number of genes that would be found in the human genome when the sequence of its DNA nucleotides was completed. Estimates at the time ranges as high as 153,000. ... given our complexity we ought to have a bigger genetic assortment than the 1000-cell roundworm, Caenorhabditis elegans, which has a 19,500-gene complement, or corn, with its 40,000 genes.Ast's remarks illustrate two points that I want to address. The first point is the surprise factor. Ast, and some other scientists, were surprised (and embarrassed) by the low gene count. They imply that most genome experts were also shocked when the genome sequence was published. That’s not quite correct, as I will show below.
When a first draft of the human sequence was published the following summer, some observers were therefore shocked by the sequencing team’s calculation of 30,000 to 35,000 protein-coding genes. The low number seemed almost embarrassing.
The second point will have to be put off for another time but it’s important enough to mention here. Ast thinks that humans need to make many times more proteins than worms and corn because we are so much more complex. There are two problems with such a point of view—are we, in fact, 2-3 times more complex than corn? And, does it take thousands of new proteins to generate the structures that make us unique?
I think some people exaggerate our complexity and the place of humans relative to other species. This incorrect perspective can cause some scientists to put their faith in weakly supported hypotheses that claim to explain why humans really are complex and important in spite of the fact that we don’t have a lot of genes.
But let's put that discussion aside for a few days in order to discuss the historic estimates of the number of genes in the human genome. The statement by Gil Ast is typical of those who are embarrassed. They exaggerate the estimates of the total number of genes in order to make it look like everyone—not just them—thought there would be far more genes than the 25,000 that have been found. Just this month (March 2007) this myth was repeated by Taft et al. (2007).
Predictions of the estimated number of protein-coding genes in the human genome prior to genome sequencing ranged from as low as 50,000 to as high as 140,000, whereas the latest estimates from genome analysis indicate that humans have approximately 20,000 protein-coding genes.The graphic above was taken from the Genesweep lottery. This is the betting that Asp refers to. It shows the range of gene number estimates by scientists who were involved in genome sequencing projects. Note that there are many estimates in the 40-50,000 range and a fair number below 40,000. The point is obvious—lots of experts anticipated fewer than 50,000 genes in the human genome (see The nature of the number. Nature Genetics 25:127 (2000)).
The earliest estimates of gene number are based on genetic load arguments (see King & Jukes, 1969). Since approximate mutation rates were known by 1960, it was possible to estimate the maximum number of genes that could be mutated without presenting an impossible genetic load. In other words, how many genes could we have before the number of lethal mutations per generation became intolerable? This number was less than 40,000 genes; an estimate that has never been refuted or discredited. Many experts were well aware of this upper bound up until the time the genome sequence was published.
By the 1970's there were good estimates of the total number of unique Drosophila melanogaster genes that could be mutated to lethality. The range was about 5,000-10,000 genes and this correlated well with the genetic map and the organization of polytene chromosomes. It was known that the Drosophila genome was much larger than the total size of the estimated number of genes but studies from a number of labs confirmed that a great deal of genomic DNA was repetitive junk DNA.
As we learned more and more about how genes controlled development, it became clear that huge differences in morphology and "complexity" could be due to very small changes in the either the number of regulatory genes or when they were expressed. Most of the people who assimilated the advances in developmental biology began to appreciate that mammals do not need to have many more genes than fruit flies.
By 1980, the amount of unique sequence DNA in mammalian genomes was known to be capable of encoding fewer than 20,000 genes if the average size of a gene was 10,000 bp (including introns). We now know that much of the intron sequences is not unique sequence DNA but that wasn't known back then. This estimate of gene number was consistent with detailed analysis of the amount of DNA that could be protected by mRNA or by Rot analysis (kinetics of hybridization of RNA to DNA). Mouse embryos (gastrula) appeared to express about 20,000 average-sized mRNAs. Some of these were present at very low abundance leading to the idea that this value may represent most of the mouse genes in the genome (summarized in Lewin, 1980). Certainly it was known that mammals expressed about 10,000 housekeeping genes in most cells and tissues. The general consensus was that the total number of regulatory genes was unlikely to be more than twice this number (probably less) for a total of 30,000 genes at most.
It was about this time that Walter Gilbert made his famous back-of-the-envelope calculation of 100,000 genes in the human genome. This was the estimate that became widely quoted when the human genome project was first proposed. It's interesting to note that Gilbert's estimate was not based on any experimental evidence; indeed, it conflicted with most of the available evidence suggesting far fewer genes. The larger number seemed less threatening to scientists who were worried that we might not have more genes than a fruit fly.
By the late-1990's we had estimates of the total number of human genes from the sequences of chromosomes 21 and 22 and from the sequence of a large contiguous region of the MHC locus. The results suggested fewer than 45,000 genes total—even less if these sequenced regions turned out to be gene rich as was widely suspected. Thus, the number of genes was coming out to be well below 50,000 and this was in line with the data from RNA hybridization studies and genetic load. It also fit with the concept that the number of genes in mammals was probably not more than twice the number in insects.
In contrast to these results, the estimates from expressed sequence tags (ESTs) were often much higher. Expressed sequence tags are short copies of RNA isolated directly from cells. The idea was that these represented bits of mRNA so each one revealed the presence of a protein-coding gene. As more and more ESTs were deposited in the sequence libraries, it became possible to estimate when the library would be complete and the totals were often more than 100,000 distinct mRNAs. For example, just before the human genome sequence was published, (Liang et al., 2000) estimated that there were 120,000 genes based on the analysis of 2 million EST's.
Not everyone believed in the validity of the EST data. There were some who thought that most ESTs were artifacts. They turned out to be correct although this is not widely appreciated. By using the sequences of chromosomes 21 and 22 as controls Ewing and Green (2000) were able to estimate 35,000 genes based on the EST libraries.
Thus, by the time the draft sequence was published in 2001 there were many scientists who anticipated that the number of genes would be less than 40,000 and that's why there are so many bets in that range in the Genesweep lottery. When the number of genes was announced to be about 30,000 there were many of us who were not the least bit surprised. The only ones who were surprised were those who ignored most of the data and clung to the idea humans had to have far more genes than the so-called "lower" species.
It is simply not true that all the experts were surprised at the low number of genes. Some experts were, but many were not. The interesting thing is that those who wanted there to be more genes have not given up the fight. They continue to publish rationalizations and just-so stories in an attempt to justify why they were wrong.
UPDATE:The latest estimates indicate that the human genome contains about 20,500 protein-encoding genes [Humans Have Only 20,500 Protein-Encoding Genes]. There are probably about 1500 genes for the known stable RNAs for a total of 22,000.
Ast,G. (2005) The alternative genome. Sci. Am. 292; 40-47.
Ewing,B. and Green,P. (2000) Analysis of expressed sequence tags indicates 35,000 human genes. Nat. Genet. 25; 232-234.
King,J.L. and Jukes,T.H. (1969) Non-Darwinian evolution. Science 164; 788-798.
Lewin, B. (1980) Gene Expression-2 2nd ed. Chapter 24; Complexity of mRNA Populations.
Liang,F., Holt,I., Pertea,G., Karamycheva,S., Salzberg,S.L., and Quackenbush,J. (2000) Gene index analysis of the human genome estimates approximately 120,000 genes. Nat. Genet. 25; 239-240.