Sandwalk: Facts and Myths Concerning the Historical Estimates of the Number of Genes in the Human Genome

Monday, March 19, 2007

Facts and Myths Concerning the Historical Estimates of the Number of Genes in the Human Genome

In April 2005 Gil Ast published an article in Scientific American (Ast, 2005). The title of the article was “The Alternative Genome” and its main point was how alternative splicing in humans could increase the number of different proteins that we produce. He explains why he thinks the proteome is so much larger than the number of genes. (Ast claims that there are 90,000 proteins and only 25,000 genes.)

Ast begins his argument with the quotation below.

Spring of 2000 found molecular biologists placing dollar bets. Trying to predict the number of genes that would be found in the human genome when the sequence of its DNA nucleotides was completed. Estimates at the time ranges as high as 153,000. ... given our complexity we ought to have a bigger genetic assortment than the 1000-cell roundworm, Caenorhabditis elegans, which has a 19,500-gene complement, or corn, with its 40,000 genes.

When a first draft of the human sequence was published the following summer, some observers were therefore shocked by the sequencing team’s calculation of 30,000 to 35,000 protein-coding genes. The low number seemed almost embarrassing.

Ast's remarks illustrate two points that I want to address. The first point is the surprise factor. Ast, and some other scientists, were surprised (and embarrassed) by the low gene count. They imply that most genome experts were also shocked when the genome sequence was published. That’s not quite correct, as I will show below.

The second point will have to be put off for another time but it’s important enough to mention here. Ast thinks that humans need to make many times more proteins than worms and corn because we are so much more complex. There are two problems with such a point of view—are we, in fact, 2-3 times more complex than corn? And, does it take thousands of new proteins to generate the structures that make us unique?

I think some people exaggerate our complexity and the place of humans relative to other species. This incorrect perspective can cause some scientists to put their faith in weakly supported hypotheses that claim to explain why humans really are complex and important in spite of the fact that we don’t have a lot of genes.

But let's put that discussion aside for a few days in order to discuss the historic estimates of the number of genes in the human genome. The statement by Gil Ast is typical of those who are embarrassed. They exaggerate the estimates of the total number of genes in order to make it look like everyone—not just them—thought there would be far more genes than the 25,000 that have been found. Just this month (March 2007) this myth was repeated by Taft et al. (2007).

Predictions of the estimated number of protein-coding genes in the human genome prior to genome sequencing ranged from as low as 50,000 to as high as 140,000, whereas the latest estimates from genome analysis indicate that humans have approximately 20,000 protein-coding genes.

The graphic above was taken from the Genesweep lottery. This is the betting that Asp refers to. It shows the range of gene number estimates by scientists who were involved in genome sequencing projects. Note that there are many estimates in the 40-50,000 range and a fair number below 40,000. The point is obvious—lots of experts anticipated fewer than 50,000 genes in the human genome (see The nature of the number. Nature Genetics 25:127 (2000)).

The earliest estimates of gene number are based on genetic load arguments (see King & Jukes, 1969). Since approximate mutation rates were known by 1960, it was possible to estimate the maximum number of genes that could be mutated without presenting an impossible genetic load. In other words, how many genes could we have before the number of lethal mutations per generation became intolerable? This number was less than 40,000 genes; an estimate that has never been refuted or discredited. Many experts were well aware of this upper bound up until the time the genome sequence was published.

By the 1970's there were good estimates of the total number of unique Drosophila melanogaster genes that could be mutated to lethality. The range was about 5,000-10,000 genes and this correlated well with the genetic map and the organization of polytene chromosomes. It was known that the Drosophila genome was much larger than the total size of the estimated number of genes but studies from a number of labs confirmed that a great deal of genomic DNA was repetitive junk DNA.

As we learned more and more about how genes controlled development, it became clear that huge differences in morphology and "complexity" could be due to very small changes in the either the number of regulatory genes or when they were expressed. Most of the people who assimilated the advances in developmental biology began to appreciate that mammals do not need to have many more genes than fruit flies.

By 1980, the amount of unique sequence DNA in mammalian genomes was known to be capable of encoding fewer than 20,000 genes if the average size of a gene was 10,000 bp (including introns). We now know that much of the intron sequences is not unique sequence DNA but that wasn't known back then. This estimate of gene number was consistent with detailed analysis of the amount of DNA that could be protected by mRNA or by Rot analysis (kinetics of hybridization of RNA to DNA). Mouse embryos (gastrula) appeared to express about 20,000 average-sized mRNAs. Some of these were present at very low abundance leading to the idea that this value may represent most of the mouse genes in the genome (summarized in Lewin, 1980). Certainly it was known that mammals expressed about 10,000 housekeeping genes in most cells and tissues. The general consensus was that the total number of regulatory genes was unlikely to be more than twice this number (probably less) for a total of 30,000 genes at most.

It was about this time that Walter Gilbert made his famous back-of-the-envelope calculation of 100,000 genes in the human genome. This was the estimate that became widely quoted when the human genome project was first proposed. It's interesting to note that Gilbert's estimate was not based on any experimental evidence; indeed, it conflicted with most of the available evidence suggesting far fewer genes. The larger number seemed less threatening to scientists who were worried that we might not have more genes than a fruit fly.

By the late-1990's we had estimates of the total number of human genes from the sequences of chromosomes 21 and 22 and from the sequence of a large contiguous region of the MHC locus. The results suggested fewer than 45,000 genes total—even less if these sequenced regions turned out to be gene rich as was widely suspected. Thus, the number of genes was coming out to be well below 50,000 and this was in line with the data from RNA hybridization studies and genetic load. It also fit with the concept that the number of genes in mammals was probably not more than twice the number in insects.

In contrast to these results, the estimates from expressed sequence tags (ESTs) were often much higher. Expressed sequence tags are short copies of RNA isolated directly from cells. The idea was that these represented bits of mRNA so each one revealed the presence of a protein-coding gene. As more and more ESTs were deposited in the sequence libraries, it became possible to estimate when the library would be complete and the totals were often more than 100,000 distinct mRNAs. For example, just before the human genome sequence was published, (Liang et al., 2000) estimated that there were 120,000 genes based on the analysis of 2 million EST's.

Not everyone believed in the validity of the EST data. There were some who thought that most ESTs were artifacts. They turned out to be correct although this is not widely appreciated. By using the sequences of chromosomes 21 and 22 as controls Ewing and Green (2000) were able to estimate 35,000 genes based on the EST libraries.

Thus, by the time the draft sequence was published in 2001 there were many scientists who anticipated that the number of genes would be less than 40,000 and that's why there are so many bets in that range in the Genesweep lottery. When the number of genes was announced to be about 30,000 there were many of us who were not the least bit surprised. The only ones who were surprised were those who ignored most of the data and clung to the idea humans had to have far more genes than the so-called "lower" species.

It is simply not true that all the experts were surprised at the low number of genes. Some experts were, but many were not. The interesting thing is that those who wanted there to be more genes have not given up the fight. They continue to publish rationalizations and just-so stories in an attempt to justify why they were wrong.

UPDATE:The latest estimates indicate that the human genome contains about 20,500 protein-encoding genes [Humans Have Only 20,500 Protein-Encoding Genes]. There are probably about 1500 genes for the known stable RNAs for a total of 22,000.

Ast,G. (2005) The alternative genome. Sci. Am. 292; 40-47.

Ewing,B. and Green,P. (2000) Analysis of expressed sequence tags indicates 35,000 human genes. Nat. Genet. 25; 232-234.

King,J.L. and Jukes,T.H. (1969) Non-Darwinian evolution. Science 164; 788-798.

Lewin, B. (1980) Gene Expression-2 2nd ed. Chapter 24; Complexity of mRNA Populations.

Liang,F., Holt,I., Pertea,G., Karamycheva,S., Salzberg,S.L., and Quackenbush,J. (2000) Gene index analysis of the human genome estimates approximately 120,000 genes. Nat. Genet. 25; 239-240.

8 comments :

Carlo said...: I recently read Carl Sagan's The Dragons of Eden (1977) wherein he argued that genome size is directly correlated with 'bits of information within the brain' (assuming, of course, that most of the genome was functional). It should be noted that he also estimated the human genome to be somewhere in the range of ~115 gigabases. It's nice to know that people were making rational estimates about total gene number far before that. This history lesson was very enlightening to a relative newcomer like myself, thank you.; Monday, March 19, 2007 7:03:00 PM
Anonymous said...: Very interesting! A friend and I argued about the whole superior complexity of humans issue, so I can't wait to see what you think about that. (I said we're no more complex than many other organisms); Monday, March 19, 2007 8:49:00 PM
SPARC said...: In his 1972 Junk DNA paper Ohno gives a number of 30000 genes that other authors calculated from the mutation rate.; Tuesday, March 20, 2007 12:02:00 AM
Larry Moran said...: Martin says,

In his 1972 Junk DNA paper Ohno gives a number of 30000 genes that other authors calculated from the mutation rate.

Thanks, I forgot that the estimate of total gene number was a key part of the original conclusion about junk DNA.

Here's what Ohno said,

In fact, there seems to be a strict upper limit for the number of genes which we can afford to keep in our genome. Consequently, only a fraction of our DNA appears to function as genes. The observations on a number of structural gene loci of man, mice, and other organisms revealed that each locus has a [1/100,000] per generation probability of sustaining a deleterious mutation. It then follows that the moment we acquire [100,000] gene loci, the overall deleterious mutation rate per generation becomes 1.0 which appears to represent an unbearably heavy genetic load. Taking into consideration the fact that deleterious mutations can be dominant or recessive, the total number of gene loci of man has been estimated to be about [30,000] (Muller, 1967; Crow and Kimura, 1970).

My point, as I'm sure you know, is that there were many people who predicted fewer than 40,000 genes in the human genome. Many of us believed in those predictions and we weren't surprised by the number of genes when the sequence of the human genome was published.

Nevertheless, an urban legend has grown up claiming that "scientists" were expecting 100,000 genes and they were surprised at the low number. My goal here is to debunk that legend before it gets elevated to the level of fact.

There's a method to my madness. As we'll see, some recent papers purport to "explain" this "anomaly" by resorting to bizarre just-so stories about alternative splicing, non-coding RNAs, and huge regulatory sequences. One of the problems with these stories is that they are "solving" a problem that never existed in the first place.; Tuesday, March 20, 2007 10:19:00 AM
Anonymous said...: Do you have a comment on this paper?; Wednesday, March 21, 2007 2:32:00 PM
Larry Moran said...: anonymous asks,

Do you have a comment on this paper?

My first impression is that it's mostly garbage and should never have been published.

They're saying that 75% of non-coding DNA in Drosophila melanogaster has a function. I'm going to need a lot more evidence before I believe that.; Wednesday, March 21, 2007 3:38:00 PM
Carl said...: It appears that we're now down to 18,000, by the way. More here.; Thursday, March 22, 2007 11:23:00 AM
Anonymous said...: The review I wrote in 2004 at least turned out to be the right side of wrong! (25,000, low but not low enough, "Has the Yo-yo stopped? a human gene number update" (2004) Proteomics 6 1712-26. PMID: 15174140; Saturday, March 01, 2008 11:42:00 AM

Quotations

The old argument of design in nature, as given by Paley, which formerly seemed to me to be so conclusive, fails, now that the law of natural selection has been discovered. We can no longer argue that, for instance, the beautiful hinge of a bivalve shell must have been made by an intelligent being, like the hinge of a door by man. There seems to be no more design in the variability of organic beings and in the action of natural selection, than in the course which the wind blows.Charles Darwin (c1880)

Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed, during a long course of years, from a point of view directly opposite to mine. It is so easy to hide our ignorance under such expressions as "plan of creation," "unity of design," etc., and to think that we give an explanation when we only restate a fact. Any one whose disposition leads him to attach more weight to unexplained difficulties than to the explanation of a certain number of facts will certainly reject the theory.

Charles Darwin (1859)

Science reveals where religion conceals. Where religion purports to explain, it actually resorts to tautology. To assert that "God did it" is no more than an admission of ignorance dressed deceitfully as an explanation...

Peter Atkins

Quotations

The world is not inhabited exclusively by fools, and when a subject arouses intense interest, as this one has, something other than semantics is usually at stake. Stephen Jay Gould (1982)
I have championed contingency, and will continue to do so, because its large realm and legitimate claims have been so poorly attended by evolutionary scientists who cannot discern the beat of this different drummer while their brains and ears remain tuned to only the sounds of general theory. Stephen Jay Gould (2002) p.1339
The essence of Darwinism lies in its claim that natural selection creates the fit. Variation is ubiquitous and random in direction. It supplies raw material only. Natural selection directs the course of evolutionary change. Stephen Jay Gould (1977)
Rudyard Kipling asked how the leopard got its spots, the rhino its wrinkled skin. He called his answers "just-so stories." When evolutionists try to explain form and behavior, they also tell just-so stories—and the agent is natural selection. Virtuosity in invention replaces testability as the criterion for acceptance. Stephen Jay Gould (1980)
Since 'change of gene frequencies in populations' is the 'official' definition of evolution, randomness has transgressed Darwin's border and asserted itself as an agent of evolutionary change. Stephen Jay Gould (1983) p.335
The first commandment for all versions of NOMA might be summarized by stating: "Thou shalt not mix the magisteria by claiming that God directly ordains important events in the history of nature by special interference knowable only through revelation and not accessible to science." In common parlance, we refer to such special interference as "miracle"—operationally defined as a unique and temporary suspension of natural law to reorder the facts of nature by divine fiat. Stephen Jay Gould (1999) p.84

Quotations

My own view is that conclusions about the evolution of human behavior should be based on research at least as rigorous as that used in studying nonhuman animals. And if you read the animal behavior journals, you'll see that this requirement sets the bar pretty high, so that many assertions about evolutionary psychology sink without a trace.

Jerry Coyne
Why Evolution Is True

I once made the remark that two things disappeared in 1990: one was communism, the other was biochemistry and that only one of them should be allowed to come back.

Sydney Brenner
TIBS Dec. 2000

It is naïve to think that if a species' environment changes the species must adapt or else become extinct.... Just as a changed environment need not set in motion selection for new adaptations, new adaptations may evolve in an unchanging environment if new mutations arise that are superior to any pre-existing variations

Douglas Futuyma

One of the most frightening things in the Western world, and in this country in particular, is the number of people who believe in things that are scientifically false. If someone tells me that the earth is less than 10,000 years old, in my opinion he should see a psychiatrist.

Francis Crick

There will be no difficulty in computers being adapted to biology. There will be luddites. But they will be buried.

Sydney Brenner

An atheist before Darwin could have said, following Hume: 'I have no explanation for complex biological design. All I know is that God isn't a good explanation, so we must wait and hope that somebody comes up with a better one.' I can't help feeling that such a position, though logically sound, would have left one feeling pretty unsatisfied, and that although atheism might have been logically tenable before Darwin, Darwin made it possible to be an intellectually fulfilled atheist

Richard Dawkins

Another curious aspect of the theory of evolution is that everybody thinks he understand it. I mean philosophers, social scientists, and so on. While in fact very few people understand it, actually as it stands, even as it stood when Darwin expressed it, and even less as we now may be able to understand it in biology.

Jacques Monod

The false view of evolution as a process of global optimizing has been applied literally by engineers who, taken in by a mistaken metaphor, have attempted to find globally optimal solutions to design problems by writing programs that model evolution by natural selection.

Richard Lewontin

More Recent Comments

Monday, March 19, 2007

Facts and Myths Concerning the Historical Estimates of the Number of Genes in the Human Genome

8 comments :