Sunday, July 08, 2018

Disappearing genes: a paper is refuted before it is even published

Several readers alerted me to a paper that was posted on bioRxiv a few weeks ago (May 28, 2018). The paper claimed that the human genome contains 43,162 genes consisting of 21,306 protein-coding genes and 21,856 noncoding genes. The authors reported that they had discovered 3,819 new noncoding genes and 1,178 new protein-coding genes. In addition, they claim to have discovered 97,511 new splice variants raising the total number of splice variants to 12.5 per protein-coding gene although they seem to suggest that almost one-third of these splice variants are non-functional splicing errors. The most striking result, according to the authors, is that 95% of all transcripts are just transcriptional noise.

Here's the paper ...
Pertea, M., Shumate, A., Pertea, G., Varabyou, A., Chang, Y.-C., Madugundu, A.K., Pandey, A., and Salzberg, S. (2018) Thousands of large-scale RNA sequencing experiments yield a comprehensive new human gene list and reveal extensive transcriptional noise. bioRxiv (May 29, 2018). [doi: 10.1101/332825]

We assembled the sequences from 9,795 RNA sequencing experiments, collected from 31 human tissues and hundreds of subjects as part of the GTEx project, to create a new, comprehensive catalog of human genes and transcripts. The new human gene database contains 43,162 genes, of which 21,306 are protein-coding and 21,856 are noncoding, and a total of 323,824 transcripts, for an average of 7.5 transcripts per gene. Our expanded gene list includes 4,998 novel genes (1,178 coding and 3,819 noncoding) and 97,511 novel splice variants of protein-coding genes as compared to the most recent human gene catalogs. We detected over 30 million additional transcripts at more than 650,000 sites, nearly all of which are likely to be nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells.
While I appreciate the evidence that most transcripts are just noise (contra ENCODE), I'm skeptical of some of the other claims, especially the number of genes. There's been a lot of papers on the number of human protein-coding genes and the consensus number is less than 20,000 as the authors themselves note [How many proteins in the human proteome?].

The number of noncoding genes is very controversial—it's very unlikely that there are as many as 22,000 as Pertea et al. claim [How many lncRNAs are functional?].

Finally, alternative splicing is mostly artifact, in my opinion, so I'm skeptical of claims that every gene has multiple functional variants [Are splice variants functional or noise?].

I decided not to blog about that paper because it is a preprint posted on bioRxiv and hasn't been peer-reviewed. It's not that I'm a big fan of peer review but I think it's wise to allow the paper to undergo some kind of review before commenting.

Let me re-emphasize that there's a lot of good stuff in this paper in spite of the fact that I'm skeptical about some of the main conclusions.

My stance on blogging changed when I was told about another paper that's just appeared on bioRixiv ...
Jungreis, I., Tress, M. L., Mudge, J., Sisu, C., Hunt, T., Johnson, R., Uszczynska-Ratajczak, B., Lagarde, J., Wright, J., Muir, P., Gerstein, M., Guigo, R., Kellis, M., Frankish, A., and Flicek, P. (2018) Nearly all new protein-coding predictions in the CHESS database are not protein-coding. bioRxiv(July 2, 2018). [doi: 10.1101/360602]

In a 2018 paper posted to bioRxiv, Pertea et al. presented the CHESS database, a new catalog of human gene annotations that includes 1,178 new protein-coding predictions. These are based on evidence of transcription in human tissues and homology to earlier annotations in human and other mammals. Here, we reanalyze the evidence used by CHESS, and find that nearly all protein-coding predictions are false positives. We find that 86% overlap transposons marked by RepeatMasker that are known to frequently result in false positive protein-coding predictions. More than half are homologous to only nine Alu-derived primate sequences corresponding to an erroneous and previously withdrawn Pfam protein domain. The entire set shows poor evolutionary conservation and PhyloCSF protein-coding evolutionary signatures indistinguishable from noncoding RNAs, indicating lack of protein-coding constraint. Only four predictions are supported by mass spectrometry evidence, and even those matches are inconclusive. Overall, the new protein-coding predictions are unsupported by any credible experimental or evolutionary evidence of function, result primarily from homology to genes incorrectly classified as protein-coding, and are unlikely to encode functional proteins.
The abstract says it all. This is also a non-peer-reviewed preprint but it's remarkable because it was posted about one month after the Pertea et al. paper and it challenges one of the main conclusions of that paper. The authors, Jungreis et al., represent the GENCODE Consortium and they are responding because the first paper claimed that their new database of human genes is superior to that of GENCODE. (There are two other reliable, annotated databases called RefSeq and UniProtKB (formerly Swiss-Prot).)

Jungreis et al. have included some sensible recommendations that are worth repeating ...
Recommendations for future studies
A plethora of recent papers proclaim the discovery of hundreds or thousands of new human protein-coding ORFs (see Introduction), so it seems likely that there will be further studies of this type. For that reason we would like to make several recommendations that authors and reviewers alike might like to bear in mind with the aim of achieving higher confidence protein-coding predictions.

First, data should be filtered for the complete list of transposons. Second, ORFs predicted to be protein-coding based on homology1 should extend the full length of the coding homolog, unless there is independent evidence of functional translation, to avoid inclusion of pseudogenes. Third, any homology must be to manually-curated genes, not to predicted genes. Fourth, expression at the transcript level is not protein-coding evidence; even ribosome profiling data is not in itself proof of translation into a functional protein. Fifth, be conservative when attributing protein evidence from proteomics experiments; most novel protein coding genes will be hard to detect in standard proteomics experiments because they are likely to be expressed only in low quantities or in limited tissues, but using less stringent thresholds to compensate for that is likely to result in many false positives. Sixth, conservation among related species should be tested against a null model defined by noncoding regions in order to detect purifying selection. Finally it is important that all novel predictions are manually inspected, and not just a select few. Authors should not implicitly trust their own predictions. For example, manual inspection of the CHESS novel protein-coding predictions has quickly revealed that most are based on homology to the same few annotations, many of which are low quality predictions.
The recommendations are justified but I have to admit to a deep sense of irony when I read them. Several of the authors on this paper are ENCODE leaders (Mark Gerstein, Roderic Guigo, Manolis Kellis) whose names appeared on the original ENCODE paper claiming that most of the genome is functional. They are also on the "retraction" paper that was published in 2014 (Kellis et al., 2014). As far as I know, they have not applied such rigorous standards to their claims for noncoding genes and they have never admitted that their original ENCODE propaganda was misleading. Think about that when you read this sentence from the current paper ...
The discovery of hundreds of novel human protein-coding genes is an extraordinary claim that must be backed up by strong experimental or evolutionary evidence.
I agree that extraordinary claims about new protein-coding genes must be strongly supported by evidence. I also think that the extraordinary claim that most of the human genome is functional must be backed up by strong experimental or evolutionary evidence but I'm not sure Jungreis et al. would agree.

The important point here is that Pertea et al. did not do their homework. They published an extraordinary claim (1,178 new protein-coding genes) that challenged the experts and the experts shot it down within five weeks. Almost everyone who had read the literature was skeptical of the original claim as soon as they saw it. It will be interesting to see if the Pertea et al. paper ever appears in the peer-reviewed scientific literature.

1. I think they are misusing the word "homology." They should use the word "similarity" as in "... predictions of function based on similarity to sequences in other species."

Kellis, M., Wold, B., Snyder, M.P., Bernstein, B.E., Kundaje, A., Marinov, G.K., Ward, L.D., Birney, E., Crawford, G.E., Dekker, J., Dunham, I., Elnitski, L., Farnham, E.A., Gerstein, M., Giddings, M.C., Gilbert, D.M., Gingeras, T.R., Green, E.D., Guigo, R., Hubbard, T., Kent, J., Lieb, J.D., Myers, R.M., Pazin, M.J., Ren, B., Stamatoyannopoulos, J.A., Weng, Z., White, K.P., and Hardison, R.C. (2014) Defining functional DNA elements in the human genome. Proceedings of the National Academy of Sciences, 111:6131-6138. [doi: 10.1073/pnas.131894811]


  1. Even many protein-coding from GENCODE, RefSeq and Uniprot will disappear too. There's considerable disagreement between G, R and U, and genes in disagreement do not show proper protein-coding features. Sorry for self-advertising but thought this might be of interest to the debate.

    Loose ends: almost one in five human genes still have unresolved coding status.
    F Abascal, D Juan, I Jungreis, L Martinez, M Rigau, JM Rodriguez, ...
    Nucleic Acids Research

    1. I'm in the midst of preparing a blog post on that paper. It's very interesting. I have questions that I hope you can answer.

  2. Thanks, Larry. Sure, you can ask us anything you like

  3. hi Larry, I just stumbled upon your blog here. Our paper - unlike the ENCODE papers and the GENCODE database - is based on fully open data (the GTEx data) and transparent methods. We describe exactly how we narrowed down the set of assembled RNA transcripts from over 30 million to just a few hundred thousand, and we describe how we determined that some of them are protein-coding. Our methods essentially recapitulated the entire Gencode and RefSeq protein sets, plus a few more. (1178 is only about 5% more.) We did not say that our database was "superior" to anyone else's, despite what you wrote here. However, we believe that our methods do a remarkably good job of capturing nearly all protein-coding genes, based as they are upon an enormous, high-quality RNA-seq resource.

    Like you, we are also skeptical of some of our own genes (and note that we have already thrown out the vast majority of assembled RNA-seq transcripts). I will not be surprised if many of our 1178 newly-reported genes turn out to be nonfunctional and/or noncoding. This paper describes our second iteration (2.0) of a new gene database, CHESS, which we are presenting as a more open alternative to the closed (i.e., curated by a self-selected group that does not allow others to modify anything) databases. As you note, the Gencode people already wrote a lengthy rebuttal, which was a complete surprise to us, but which we view to be more about protecting turf than about science.

    I would also note that if you go back just a couple of years, Gencode had well over 1,000 protein-coding genes that they have subsequently deleted. (The number is even larger if you go back further.) I've never written a paper saying "Gencode has over 1,000 protein-coding genes that are not protein-coding," although clearly the Gencode authors would agree with that statement as applied to previous releases of their own database. We are trying to present an alternative, and we encourage input-especially constructive input-from the community.

    1. Steven,

      Thanks for your comments.

      You said, "We did not say that our database was "superior" to anyone else's, despite what you wrote here." I agree that you did not use the word "superior" but there are many phrases in your paper suggesting that you have created a better database than others. For example, this is the opening sentences of your discussion ...

      The new human gene catalog described here, CHESS, contains a comprehensive set of genes based on nearly 10,000 RNA sequencing experiments. As such, it provides a reference with substantially greater experimental support than previous human gene catalogs. Although it represents only a modest increase in the number of protein-coding genes (1178, or 5.5% out of 21,306 total), it more than doubles the number of splice variants and other isoforms of these genes, to 267,476. This more-comprehensive catalog of genes and splice variants should provide a better foundation for RNA-seq experiments, exome sequencing experiments, genome-wide association studies, and many other studies that rely on human gene annotation as the basis for their analysis.

      I interpreted that to mean that you thought your CHESS database was better (= superior) to the other databases. If my interpretation was incorrect I'd appreciate it if you could clarify what you meant when you wrote that.

    2. @Steven Salzberg,

      You claim to have identified 267,476 splice variants of protein-coding genes. This represents 97,511 new splice variants that are not present in the current version of RefSeq. This brings the total number of isoforms to 12.5 per protein-coding gene.

      How many of these isoforms do you think are biologically relevant and how many are just noise? Do you think that every protein-coding gene produces several different polypeptides due to alternative splicing?

  4. Larry, I tried to reply but your blogsite keeps getting stuck and my reply didn't post. Trying again, more briefly: in answer to one of your questions, the more-comprehensive CHESS set is better for RNA-seq, we believe, because it gives transcriptome assemblers like StringTie and Cufflinks more possibilities, which the assemblers are pretty good at choosing from based on the data itself. The effect will be even greater for those who use Salmon or Kallisto, because those programs simply quantify the genes/transcripts provided in the gtf file - they don't do any assembly or alignment, so anything missing from the annotation will simply be ignored. All of these programs should be able to tolerate 'extra' (false positive) transcripts pretty well. For exome studies, the exome capture kits are made by companies that rely entirely on RefSeq+Gencode, and if an exon isn't in the kit, it's ignored by any exome sequencing study. Adding more exons to the kits can't hurt and might help, and we hope that CHESS will spur the exome kit manufacturers to do that. So this is part of the basis for our statement that chess provides "a better foundation" for such studies.

    1. So, if I understand you correctly, you think your CHESS database is better because it has more false positives. Is that correct?

  5. @Steven Salzberg,

    I'm still interested in your best guess about how many functional alternative transcripts there are for each protein-coding gene. Do you think most of them are capable of coding for several different functional polypeptides?