Sandwalk: How many protein-coding genes in the human genome? (2)

Tuesday, September 24, 2019

How many protein-coding genes in the human genome? (2)

It's difficult to know how many protein-coding genes there are in the human genome because there are several different ways of counting and the counts depend on what criteria are used to identify a gene. Last year I commented on a review by Abascal et al. (2018) that concluded there were somewhere between 19,000 and 20,000 protein-coding genes. Those authors discussed the problems with annotation and pointed out that the major databases don't agree on the number of gene [How many protein-coding genes in the human genome?].

Abascal et al. also said that before publication of the human genome most researchers were expecting about 25,000 - 40,000 genes so the actual number of protein-coding genes is pretty close to those estimates. (Keep in mind that there are several thousand noncoding genes.) This helps to debunk the standard myth that scientists were expecting 100,000 or more genes [False History and the Number of Genes 2010] [Facts and Myths Concerning the Historical Estimates of the Number of Genes in the Human Genome].

Now there's a new review that continues this discussion (Hatje et al. (2019). One of the best things in this latest review is a new figure showing a much better history of gene number estimates. Readers might recall that back in 2010, Pertea and Salzberg published some false information on this subject [see False History and the Number of Genes 2010]. A modified version of their figure (right) was published just last year in Nature (Willyard, 2018).

Here's the new figure from Hatje et al. (below). It's much better but it still gives too much credence to high estimates of gene number that were not supported by reliable data or logic (e.g. US Human Genome Project (1990), CpG Islands, EST data, and GeneSweep). However, the new figure is a far more accurate history than the one published by Pertea and Salzberg—don't you agree?

This really should put an end to the ridiculous myth that experts were "shocked" and "surprised" at the low number of genes in the human genome back in 2001. Let's hope that we never have to hear that canard again, especially in the scientific literature.

There's another figure in the Hatje et al. paper that nicely illustrates the differences between various ways of calculating the number of protein-coding genes. The estimates by Gencode, Ensembl, and RefSeq have drifted downward so that they now cluster around 20,000 genes. Unfortunately, these three databases do not agree on the core number of genes—only about 19,000 are common to all three databases. Those estimates are based mostly on computer models of potential genes with help from human annotators.

CCDS, neXtProt, and PeptideAtlas are databases that require independent evidence that a potential gene is functional. Usually this means identifying the protein product [see How many proteins in the human proteome?, How many different proteins are made in a typical human cell?, How many proteins do humans make?]. These values are increasing over the years so that we can be confident that there are at least 19,000 protein-coding genes but probably not more than 20,000.

Thanks to Martin Kollmar for alerting me to this paper from his lab.

Hatje, K., Mühlhausen, S., Simm, D., and Kollmar, M. (2019) The Protein-Coding Human Genome: Annotating High-Hanging Fruits. BioEssays 1900066. [doi: 10.1002/bies.201900066]

Pertea, M., and Salzberg, S. (2010) Between a chicken and a grape: estimating the number of human genes. Genome biology, 11:206. [doi:10.1186/gb-2010-11-5-206]

11 comments :

Joe Felsenstein said...: I wish there were some data in the big 24-year gap from 1966 to 1990 in the Hatje et al. plot (the second graph that you show here). Just representing it by a zigzaggy thingie on the horizontal axis gives too much visual impression of continuity.

Presumably the 1965 figure in that plot from "genome length / gene length" does not take stretches of junk into account, thus counting them as containing genes. If the gene lengths were computed from protein (and RNA sizes), they would be far too small and the gene count would be much higher. So the 1965 figure must somehow have taken intron lengths into account, even though they hadn't been discovered yet.; Sunday, September 29, 2019 5:07:00 PM
Joe Felsenstein said...: typo: ... from protein (and RNA) sizes ...; Sunday, September 29, 2019 5:08:00 PM
Larry Moran said...: One of the ways of estimating gene number was to assume that there were 5,000 genes in the fruit fly genome based on the idea that there was one gene per band in polytene chromosomes. Given the size of a gene from that assumption, Vogel estimated that there would be 60,000 genes in the human genome.

However, saturation mapping of parts of the Drosophila genome indicated that there were likely TWO genes per band, on average. By the late 1960s it was known that more than half the human genome was repetitive DNA and it was assumed that genes were confined to unique sequence DNA. Thus, the original estimate of 60,000 genes was still pretty good.

Lots of people, including me, thought that humans were unlikely to have an order of magnitude more genes that fruit flies so the idea that humans had a few tens of thousands of genes was pretty popular among my friends in the 1970s.; Monday, September 30, 2019 9:01:00 AM
Larry Moran said...: Estimates of gene number based of RNA association kinetics (Rot curves) were quite popular in the early 1970s - not 1990 as the chart implies. These estimates came in at around 30,000 genes which agreed with the genetic load estimates.; Monday, September 30, 2019 9:07:00 AM
Larry Moran said...: These three methods: (1) genetic load, (2) RNA association kinetics, and (3) dividing genome size by estimates of gene size, were the only methods available in the early 1970s. The numbers appeared to be consistent with sequence and genetic data throughout the 1980s so there was no reason to change the earlier estimates.

The first large-scale sequence of a portion of the human genome was published in 1992. It covered 4,000 kb of the MHC locus and there were 100 genes in this region for an average of one gene every 40,000 bp. If you extrapolate to the entire genome there should be 80,000 genes but everyone knew that the MHC locus was unusually rich in genes and gen families. Thus it seemed likely that there were a lot fewer that 80,000 genes in the human genome.; Monday, September 30, 2019 9:22:00 AM
Joe Felsenstein said...: Thanks, Larry for the explanations.; Wednesday, October 02, 2019 2:32:00 AM
Rafal Grochala said...: Nice summary. Straight to the bookmarks.; Saturday, October 12, 2019 6:51:00 AM
Unknown said...: Hi Joe,
introns were not known at that time, but it was known from the work by Monod that genes can be far longer than their coding sequence. At that time they thought it's "regulatory sequence". We added that to the Box 2 text, where we provide some more details on gene number estimates than in the main text.
Best, Martin; Tuesday, October 22, 2019 3:15:00 PM
Unknown said...: Hi Larry,
thanks for commenting on our review. I tried hard to find any publication about gene number estimates between 1965 and 1990, but failed. I know all the methods you mention in your comments above, and I know these were used to estimate gene size and other aspects. But I couldn't find a single study with a sentence "based on our method we propose/extrapolate the number of human genes to xx". All the numbers mentioned in any of the papers of that long 25 years refer to the previous publications. The citation practice seemed to be different at that time. All these papers cited reviews and text books from the late 1960th, and those again cite earlier reviews (mainly from Muller), and only when you dig through all these reviews you end up with the original work from the 1940th. That's maybe also the reason that all these early papers just have a handful citations each. Everybody seemed to believe in these numbers, they were in text books, thus no reason to present new numbers (total human gene numbers). Researchers seemed to concentrate on more important questions, junk, introns, etc, who cared about a slightly different total number? Or do you know a specific publication from the 1970th or 1980th giving an estimate on "total human genes"?
Best wishes, Martin; Tuesday, October 22, 2019 3:23:00 PM
Larry Moran said...: The Rot data comes mostly from the early 1970s and it represents a new estimate of gene number that doesn't depend on the genetic load argument. Benjamin Lewin wrote several reviews of the data and his 1974 book (Gene Expression) is the best summary of what was known at that time. I think it's fair to say that we had new independent data on gene number in the early 1970s.

The genetic load argument was constantly being revised and updated as we learned more and more about genes and genomes. The C-value debate played an important role in our understanding that humans could have less than 50,000 genes in spite of the fact that we had a large genome. This argument didn't become prominent until the 1970s.

I realize that the original estimates of gene number based on genetic load didn't change much when Ohno published his 1972 paper (~30,000 genes) but, because of more recent data, the validity of that estimate became more reliable in 1972 that it was in 1948.

The relevance of the Drosophila data was constantly being re-evaluated in the 1970s, especially after Judd published his sturation experiments in 1972 and 1973. I think it's fair to point out that estimates of gene number in humans, based on analogy with Drosophila, were stronger in the 70s than they were in the 60s.

For these reasons, I think your graph would have been better if it had included 1970s estimates of the number of genes in humans. It would have shown that a strong consensus was developing that humans had fewer than 50,000 genes (most likely only 30,000).; Friday, October 25, 2019 1:56:00 PM
Unknown said...: I absolutely agree with your comment. But this would be two different information in one plot. a) Papers with gene number estimates, and b) data/methods supporting one or the other estimate. In the 1970th there are many publications, as you point out, that nicely present new methods and data, but in the end "just" say, "well, our data/method supports the earlier estimate of 30,000 genes" and citation of a review from the late 1960th. The 1974 book is great, but cites "only" reviews from the late 1960th as well, and those point to just the few publications shown in the plot. Reviews/papers after 1974 point to this book or newer editions.

Drawing a plot based on "supporting data/methods" is not that easy. Also the methods used in the 199th "support" the high numbers, and the software used have thousands of citations. As much as researchers in the 1960th couldn't understand why genes are far longer than their coding sequence, researchers in the 1990th didn't imagine how complex and diverse genes are (e.g. exon length from 1 bp to 4000 bp, gene length from 50 aa to thousands of aa, alternative splicing etc.). Thus the data and methods from the 1990th are ok, but the conclusions are very wrong. How put that into a plot?

I agree with Joe's comment above that we also should have made a plot with a scalar x-axsis, at least as a small inlet, to demonstrate how long the 30,000 gene number was the agreed number compared to the very short time when high numbers were hyped.; Sunday, November 03, 2019 4:00:00 AM

Quotations

The old argument of design in nature, as given by Paley, which formerly seemed to me to be so conclusive, fails, now that the law of natural selection has been discovered. We can no longer argue that, for instance, the beautiful hinge of a bivalve shell must have been made by an intelligent being, like the hinge of a door by man. There seems to be no more design in the variability of organic beings and in the action of natural selection, than in the course which the wind blows.Charles Darwin (c1880)

Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed, during a long course of years, from a point of view directly opposite to mine. It is so easy to hide our ignorance under such expressions as "plan of creation," "unity of design," etc., and to think that we give an explanation when we only restate a fact. Any one whose disposition leads him to attach more weight to unexplained difficulties than to the explanation of a certain number of facts will certainly reject the theory.

Charles Darwin (1859)

Science reveals where religion conceals. Where religion purports to explain, it actually resorts to tautology. To assert that "God did it" is no more than an admission of ignorance dressed deceitfully as an explanation...

Peter Atkins

Quotations

The world is not inhabited exclusively by fools, and when a subject arouses intense interest, as this one has, something other than semantics is usually at stake. Stephen Jay Gould (1982)
I have championed contingency, and will continue to do so, because its large realm and legitimate claims have been so poorly attended by evolutionary scientists who cannot discern the beat of this different drummer while their brains and ears remain tuned to only the sounds of general theory. Stephen Jay Gould (2002) p.1339
The essence of Darwinism lies in its claim that natural selection creates the fit. Variation is ubiquitous and random in direction. It supplies raw material only. Natural selection directs the course of evolutionary change. Stephen Jay Gould (1977)
Rudyard Kipling asked how the leopard got its spots, the rhino its wrinkled skin. He called his answers "just-so stories." When evolutionists try to explain form and behavior, they also tell just-so stories—and the agent is natural selection. Virtuosity in invention replaces testability as the criterion for acceptance. Stephen Jay Gould (1980)
Since 'change of gene frequencies in populations' is the 'official' definition of evolution, randomness has transgressed Darwin's border and asserted itself as an agent of evolutionary change. Stephen Jay Gould (1983) p.335
The first commandment for all versions of NOMA might be summarized by stating: "Thou shalt not mix the magisteria by claiming that God directly ordains important events in the history of nature by special interference knowable only through revelation and not accessible to science." In common parlance, we refer to such special interference as "miracle"—operationally defined as a unique and temporary suspension of natural law to reorder the facts of nature by divine fiat. Stephen Jay Gould (1999) p.84

Quotations

My own view is that conclusions about the evolution of human behavior should be based on research at least as rigorous as that used in studying nonhuman animals. And if you read the animal behavior journals, you'll see that this requirement sets the bar pretty high, so that many assertions about evolutionary psychology sink without a trace.

Jerry Coyne
Why Evolution Is True

I once made the remark that two things disappeared in 1990: one was communism, the other was biochemistry and that only one of them should be allowed to come back.

Sydney Brenner
TIBS Dec. 2000

It is naïve to think that if a species' environment changes the species must adapt or else become extinct.... Just as a changed environment need not set in motion selection for new adaptations, new adaptations may evolve in an unchanging environment if new mutations arise that are superior to any pre-existing variations

Douglas Futuyma

One of the most frightening things in the Western world, and in this country in particular, is the number of people who believe in things that are scientifically false. If someone tells me that the earth is less than 10,000 years old, in my opinion he should see a psychiatrist.

Francis Crick

There will be no difficulty in computers being adapted to biology. There will be luddites. But they will be buried.

Sydney Brenner

An atheist before Darwin could have said, following Hume: 'I have no explanation for complex biological design. All I know is that God isn't a good explanation, so we must wait and hope that somebody comes up with a better one.' I can't help feeling that such a position, though logically sound, would have left one feeling pretty unsatisfied, and that although atheism might have been logically tenable before Darwin, Darwin made it possible to be an intellectually fulfilled atheist

Richard Dawkins

Another curious aspect of the theory of evolution is that everybody thinks he understand it. I mean philosophers, social scientists, and so on. While in fact very few people understand it, actually as it stands, even as it stood when Darwin expressed it, and even less as we now may be able to understand it in biology.

Jacques Monod

The false view of evolution as a process of global optimizing has been applied literally by engineers who, taken in by a mistaken metaphor, have attempted to find globally optimal solutions to design problems by writing programs that model evolution by natural selection.

Richard Lewontin

More Recent Comments

Tuesday, September 24, 2019

How many protein-coding genes in the human genome? (2)

11 comments :