More Recent Comments

Tuesday, October 10, 2023

How many genes in the human genome (2023)?

The latest summary of the number of genes in the human genome gets the number of protein-coding genes correct but their estimate of the number of known non-coding genes is far too high.

In order to have a meaningful discussion about molecular genes, we have to agree on the definition of a molecular gene. I support the following definition (see What Is a Gene?).

A gene is a DNA sequence that is transcribed to produce a functional product.

This is a working definition, not a exclusive definition, because we know there are exceptions (see Definition of a gene (again)). The important parts of the definition are that it includes two types of genes, protein-coding and non-coding, and that it emphasizes function. It's important to note that just because a DNA sequence is transcribed doesn't mean that it's a gene. The transcript could be spurious junk RNA (most are).

I discuss the number of protein-coding genes in my book on pages 139-147 and conclude that there are between 19,000 and 20,000 of them. This is the same conclusion reached by the authors of a recent paper in Nature (Oct. 5, 2013).

Amaral, P., Carbonell-Sala, S., De La Vega, F.M., Faial, T., Frankish, A., Gingeras, T., Guigo, R., Harrow, J.L., Hatzigeorgiou, A.G., Johnson, R. et al. (2023) The status of the human gene catalogue. Nature 622:41-47. [doi: 10.1038/s41586-023-06490-x]

Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.

It's more of a challenge to estimate the number of non-coding genes. I cover this in my book where I conclude that there are no more than 5,000 non-coding genes. (I suspect the actual number is closer to 1000.) Amaral et al. recognize the difficulty in determining whether a non-coding RNA is functional but they lean toward a much higher estimate. Table 1 in their paper lists various estimates of lncRNA "genes" with numbers ranging from 18,000 to 96,000. (They don't cover the other non-coding genes.)

It seems clear to me that the authors are conflicted about non-coding genes. They recognize that function is important because they say, "... our definition will call them genes only if they have a discernable function at the cellular or organismal level." However, they seem inclined to believe that the genome contains tens of thousands of non-coding genes even though there's no evidence of function for the vast majority of those genes. Their failure to critically discuss this apparent conflict is a great weakness. I would not have recommended publication if I had been asked to review the manuscript.

There's one other example of a lack of critical thinking and that's in their discussion of historical estimates of the number of genes. They fail to mention any of the experts who were predicting 30,000 genes back in 1970 and, instead, focus on the outlandish guess of Walter Gilbert who predicted 100,000 genes in 1990. I've covered this false history several times on Sandwalk (see How many protein-coding genes in the human genome? (2)) and it's addressed in my book.

Here's the figure from the Amaral et al. paper.

This ignores the accurate predictions from before 1990 and gives a very misleading impression of historical estimates. It also gives the impression that nobody knew about non-coding genes before 2000. That's ridiculous because even in 1970 we knew that there were hundreds of genes for ribosomal RNAs and tRNAs and throughout the 1980s and 90s we learned a lot about other non-coding genes.


SPARC said...

Examples for other non-coding genes not expressing rRNAs or tRNAs are H19 and Xist which have been known since the early 90s.

Michael Tress said...

Definitely a fair amount of work going on in this paper. My favourite is the attempt to convince us that the Pertea and Salzberg paper that predicted 22,000 coding genes was part of a general decrease in predicted gene numbers when Clamp et al. had predicted 20,500 three years earlier in PNAS.