Friday, July 13, 2018

How many protein-coding genes in the human genome?

The three main human databases (GENCODE/Ensembl, RefSeq, UniProtKB) contain a total of 22,210 protein-coding genes but only 19,446 of these genes are found in all three databases. That leaves 2764 potential genes that may or may not be real. A recent publication suggests that most of them are not real genes (Abascal et al., 2018). The issue is the same problem that I discussed in recent posts [Disappearing genes: a paper is refuted before it is even published] [Nature falls (again) for gene hype].

Sunday, July 08, 2018

Nature falls (again) for gene hype

Nature is arguably the most prestigious science journal. Articles published in Nature are widely perceived to be correct, unbiased, and factual. This perception is certainly true of articles that appear in the News section of the journal since these article are presumably written by expert science writers who have evaluated the new study and decided that it's worth reporting.

Sandwalk readers know that this perception is false (fake news). It turns out that science writers who publish in Nature are not very much better than science writers in general and that's not good.

I recently published a post about an extraordinary claim concerning the number of human genes [Disappearing genes: a paper is refuted before it is even published ]. It concerns a paper posted on an archive site claiming to have found 4,998 new genes of which 1,178 are new protein-coding genes (Pertea et. al., 2018). About five weeks later another paper was posted that effectively refuted the claim of new protein-coding genes (Jungreis et al., 2018). In between publication of those two papers, a freelance science writer, Cassandra Willyard, wrote an article for Nature News that covered the original claim of 4,998 new genes [New human gene tally reignites debate].

Let's see how she handled the controversy.

Disappearing genes: a paper is refuted before it is even published

Several readers alerted me to a paper that was posted on bioRxiv a few weeks ago (May 28, 2018). The paper claimed that the human genome contains 43,162 genes consisting of 21,306 protein-coding genes and 21,856 noncoding genes. The authors reported that they had discovered 3,819 new noncoding genes and 1,178 new protein-coding genes. In addition, they claim to have discovered 97,511 new splice variants raising the total number of splice variants to 12.5 per protein-coding gene although they seem to suggest that almost one-third of these splice variants are non-functional splicing errors. The most striking result, according to the authors, is that 95% of all transcripts are just transcriptional noise.

Here's the paper ...