Sandwalk: Disappearing genes: a paper is refuted before it is even published

Sunday, July 08, 2018

Disappearing genes: a paper is refuted before it is even published

Several readers alerted me to a paper that was posted on bioRxiv a few weeks ago (May 28, 2018). The paper claimed that the human genome contains 43,162 genes consisting of 21,306 protein-coding genes and 21,856 noncoding genes. The authors reported that they had discovered 3,819 new noncoding genes and 1,178 new protein-coding genes. In addition, they claim to have discovered 97,511 new splice variants raising the total number of splice variants to 12.5 per protein-coding gene although they seem to suggest that almost one-third of these splice variants are non-functional splicing errors. The most striking result, according to the authors, is that 95% of all transcripts are just transcriptional noise.

Here's the paper ...

Pertea, M., Shumate, A., Pertea, G., Varabyou, A., Chang, Y.-C., Madugundu, A.K., Pandey, A., and Salzberg, S. (2018) Thousands of large-scale RNA sequencing experiments yield a comprehensive new human gene list and reveal extensive transcriptional noise. bioRxiv (May 29, 2018). [doi: 10.1101/332825]

Abstract
We assembled the sequences from 9,795 RNA sequencing experiments, collected from 31 human tissues and hundreds of subjects as part of the GTEx project, to create a new, comprehensive catalog of human genes and transcripts. The new human gene database contains 43,162 genes, of which 21,306 are protein-coding and 21,856 are noncoding, and a total of 323,824 transcripts, for an average of 7.5 transcripts per gene. Our expanded gene list includes 4,998 novel genes (1,178 coding and 3,819 noncoding) and 97,511 novel splice variants of protein-coding genes as compared to the most recent human gene catalogs. We detected over 30 million additional transcripts at more than 650,000 sites, nearly all of which are likely to be nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells.

While I appreciate the evidence that most transcripts are just noise (contra ENCODE), I'm skeptical of some of the other claims, especially the number of genes. There's been a lot of papers on the number of human protein-coding genes and the consensus number is less than 20,000 as the authors themselves note [How many proteins in the human proteome?].

The number of noncoding genes is very controversial—it's very unlikely that there are as many as 22,000 as Pertea et al. claim [How many lncRNAs are functional?].

Finally, alternative splicing is mostly artifact, in my opinion, so I'm skeptical of claims that every gene has multiple functional variants [Are splice variants functional or noise?].

I decided not to blog about that paper because it is a preprint posted on bioRxiv and hasn't been peer-reviewed. It's not that I'm a big fan of peer review but I think it's wise to allow the paper to undergo some kind of review before commenting.

Let me re-emphasize that there's a lot of good stuff in this paper in spite of the fact that I'm skeptical about some of the main conclusions.

My stance on blogging changed when I was told about another paper that's just appeared on bioRixiv ...

Jungreis, I., Tress, M. L., Mudge, J., Sisu, C., Hunt, T., Johnson, R., Uszczynska-Ratajczak, B., Lagarde, J., Wright, J., Muir, P., Gerstein, M., Guigo, R., Kellis, M., Frankish, A., and Flicek, P. (2018) Nearly all new protein-coding predictions in the CHESS database are not protein-coding. bioRxiv(July 2, 2018). [doi: 10.1101/360602]

Abstract
In a 2018 paper posted to bioRxiv, Pertea et al. presented the CHESS database, a new catalog of human gene annotations that includes 1,178 new protein-coding predictions. These are based on evidence of transcription in human tissues and homology to earlier annotations in human and other mammals. Here, we reanalyze the evidence used by CHESS, and find that nearly all protein-coding predictions are false positives. We find that 86% overlap transposons marked by RepeatMasker that are known to frequently result in false positive protein-coding predictions. More than half are homologous to only nine Alu-derived primate sequences corresponding to an erroneous and previously withdrawn Pfam protein domain. The entire set shows poor evolutionary conservation and PhyloCSF protein-coding evolutionary signatures indistinguishable from noncoding RNAs, indicating lack of protein-coding constraint. Only four predictions are supported by mass spectrometry evidence, and even those matches are inconclusive. Overall, the new protein-coding predictions are unsupported by any credible experimental or evolutionary evidence of function, result primarily from homology to genes incorrectly classified as protein-coding, and are unlikely to encode functional proteins.

The abstract says it all. This is also a non-peer-reviewed preprint but it's remarkable because it was posted about one month after the Pertea et al. paper and it challenges one of the main conclusions of that paper. The authors, Jungreis et al., represent the GENCODE Consortium and they are responding because the first paper claimed that their new database of human genes is superior to that of GENCODE. (There are two other reliable, annotated databases called RefSeq and UniProtKB (formerly Swiss-Prot).)

Jungreis et al. have included some sensible recommendations that are worth repeating ...

Recommendations for future studies
A plethora of recent papers proclaim the discovery of hundreds or thousands of new human protein-coding ORFs (see Introduction), so it seems likely that there will be further studies of this type. For that reason we would like to make several recommendations that authors and reviewers alike might like to bear in mind with the aim of achieving higher confidence protein-coding predictions.

First, data should be filtered for the complete list of transposons. Second, ORFs predicted to be protein-coding based on homology¹ should extend the full length of the coding homolog, unless there is independent evidence of functional translation, to avoid inclusion of pseudogenes. Third, any homology must be to manually-curated genes, not to predicted genes. Fourth, expression at the transcript level is not protein-coding evidence; even ribosome profiling data is not in itself proof of translation into a functional protein. Fifth, be conservative when attributing protein evidence from proteomics experiments; most novel protein coding genes will be hard to detect in standard proteomics experiments because they are likely to be expressed only in low quantities or in limited tissues, but using less stringent thresholds to compensate for that is likely to result in many false positives. Sixth, conservation among related species should be tested against a null model defined by noncoding regions in order to detect purifying selection. Finally it is important that all novel predictions are manually inspected, and not just a select few. Authors should not implicitly trust their own predictions. For example, manual inspection of the CHESS novel protein-coding predictions has quickly revealed that most are based on homology to the same few annotations, many of which are low quality predictions.

The recommendations are justified but I have to admit to a deep sense of irony when I read them. Several of the authors on this paper are ENCODE leaders (Mark Gerstein, Roderic Guigo, Manolis Kellis) whose names appeared on the original ENCODE paper claiming that most of the genome is functional. They are also on the "retraction" paper that was published in 2014 (Kellis et al., 2014). As far as I know, they have not applied such rigorous standards to their claims for noncoding genes and they have never admitted that their original ENCODE propaganda was misleading. Think about that when you read this sentence from the current paper ...

The discovery of hundreds of novel human protein-coding genes is an extraordinary claim that must be backed up by strong experimental or evolutionary evidence.

I agree that extraordinary claims about new protein-coding genes must be strongly supported by evidence. I also think that the extraordinary claim that most of the human genome is functional must be backed up by strong experimental or evolutionary evidence but I'm not sure Jungreis et al. would agree.

The important point here is that Pertea et al. did not do their homework. They published an extraordinary claim (1,178 new protein-coding genes) that challenged the experts and the experts shot it down within five weeks. Almost everyone who had read the literature was skeptical of the original claim as soon as they saw it. It will be interesting to see if the Pertea et al. paper ever appears in the peer-reviewed scientific literature.

1. I think they are misusing the word "homology." They should use the word "similarity" as in "... predictions of function based on similarity to sequences in other species."

Kellis, M., Wold, B., Snyder, M.P., Bernstein, B.E., Kundaje, A., Marinov, G.K., Ward, L.D., Birney, E., Crawford, G.E., Dekker, J., Dunham, I., Elnitski, L., Farnham, E.A., Gerstein, M., Giddings, M.C., Gilbert, D.M., Gingeras, T.R., Green, E.D., Guigo, R., Hubbard, T., Kent, J., Lieb, J.D., Myers, R.M., Pazin, M.J., Ren, B., Stamatoyannopoulos, J.A., Weng, Z., White, K.P., and Hardison, R.C. (2014) Defining functional DNA elements in the human genome. Proceedings of the National Academy of Sciences, 111:6131-6138. [doi: 10.1073/pnas.131894811]

9 comments :

Federico Abascal said...: Even many protein-coding from GENCODE, RefSeq and Uniprot will disappear too. There's considerable disagreement between G, R and U, and genes in disagreement do not show proper protein-coding features. Sorry for self-advertising but thought this might be of interest to the debate.

Loose ends: almost one in five human genes still have unresolved coding status.
F Abascal, D Juan, I Jungreis, L Martinez, M Rigau, JM Rodriguez, ...
Nucleic Acids Research; Thursday, July 12, 2018 5:54:00 AM
Larry Moran said...: I'm in the midst of preparing a blog post on that paper. It's very interesting. I have questions that I hope you can answer.; Thursday, July 12, 2018 1:50:00 PM
Federico Abascal said...: Thanks, Larry. Sure, you can ask us anything you like; Thursday, July 12, 2018 5:44:00 PM
Steven Salzberg said...: hi Larry, I just stumbled upon your blog here. Our paper - unlike the ENCODE papers and the GENCODE database - is based on fully open data (the GTEx data) and transparent methods. We describe exactly how we narrowed down the set of assembled RNA transcripts from over 30 million to just a few hundred thousand, and we describe how we determined that some of them are protein-coding. Our methods essentially recapitulated the entire Gencode and RefSeq protein sets, plus a few more. (1178 is only about 5% more.) We did not say that our database was "superior" to anyone else's, despite what you wrote here. However, we believe that our methods do a remarkably good job of capturing nearly all protein-coding genes, based as they are upon an enormous, high-quality RNA-seq resource.

Like you, we are also skeptical of some of our own genes (and note that we have already thrown out the vast majority of assembled RNA-seq transcripts). I will not be surprised if many of our 1178 newly-reported genes turn out to be nonfunctional and/or noncoding. This paper describes our second iteration (2.0) of a new gene database, CHESS, which we are presenting as a more open alternative to the closed (i.e., curated by a self-selected group that does not allow others to modify anything) databases. As you note, the Gencode people already wrote a lengthy rebuttal, which was a complete surprise to us, but which we view to be more about protecting turf than about science.

I would also note that if you go back just a couple of years, Gencode had well over 1,000 protein-coding genes that they have subsequently deleted. (The number is even larger if you go back further.) I've never written a paper saying "Gencode has over 1,000 protein-coding genes that are not protein-coding," although clearly the Gencode authors would agree with that statement as applied to previous releases of their own database. We are trying to present an alternative, and we encourage input-especially constructive input-from the community.; Monday, July 16, 2018 2:22:00 PM
Larry Moran said...: Steven,

Thanks for your comments.

You said, "We did not say that our database was "superior" to anyone else's, despite what you wrote here." I agree that you did not use the word "superior" but there are many phrases in your paper suggesting that you have created a better database than others. For example, this is the opening sentences of your discussion ...

The new human gene catalog described here, CHESS, contains a comprehensive set of genes based on nearly 10,000 RNA sequencing experiments. As such, it provides a reference with substantially greater experimental support than previous human gene catalogs. Although it represents only a modest increase in the number of protein-coding genes (1178, or 5.5% out of 21,306 total), it more than doubles the number of splice variants and other isoforms of these genes, to 267,476. This more-comprehensive catalog of genes and splice variants should provide a better foundation for RNA-seq experiments, exome sequencing experiments, genome-wide association studies, and many other studies that rely on human gene annotation as the basis for their analysis.

I interpreted that to mean that you thought your CHESS database was better (= superior) to the other databases. If my interpretation was incorrect I'd appreciate it if you could clarify what you meant when you wrote that.; Monday, July 16, 2018 4:58:00 PM
Larry Moran said...: @Steven Salzberg,

You claim to have identified 267,476 splice variants of protein-coding genes. This represents 97,511 new splice variants that are not present in the current version of RefSeq. This brings the total number of isoforms to 12.5 per protein-coding gene.

How many of these isoforms do you think are biologically relevant and how many are just noise? Do you think that every protein-coding gene produces several different polypeptides due to alternative splicing?; Monday, July 16, 2018 5:13:00 PM
Steven Salzberg said...: Larry, I tried to reply but your blogsite keeps getting stuck and my reply didn't post. Trying again, more briefly: in answer to one of your questions, the more-comprehensive CHESS set is better for RNA-seq, we believe, because it gives transcriptome assemblers like StringTie and Cufflinks more possibilities, which the assemblers are pretty good at choosing from based on the data itself. The effect will be even greater for those who use Salmon or Kallisto, because those programs simply quantify the genes/transcripts provided in the gtf file - they don't do any assembly or alignment, so anything missing from the annotation will simply be ignored. All of these programs should be able to tolerate 'extra' (false positive) transcripts pretty well. For exome studies, the exome capture kits are made by companies that rely entirely on RefSeq+Gencode, and if an exon isn't in the kit, it's ignored by any exome sequencing study. Adding more exons to the kits can't hurt and might help, and we hope that CHESS will spur the exome kit manufacturers to do that. So this is part of the basis for our statement that chess provides "a better foundation" for such studies.; Wednesday, July 18, 2018 2:00:00 PM
Larry Moran said...: So, if I understand you correctly, you think your CHESS database is better because it has more false positives. Is that correct?; Wednesday, July 18, 2018 5:40:00 PM
Larry Moran said...: @Steven Salzberg,

I'm still interested in your best guess about how many functional alternative transcripts there are for each protein-coding gene. Do you think most of them are capable of coding for several different functional polypeptides?; Wednesday, July 18, 2018 5:43:00 PM

Quotations

The old argument of design in nature, as given by Paley, which formerly seemed to me to be so conclusive, fails, now that the law of natural selection has been discovered. We can no longer argue that, for instance, the beautiful hinge of a bivalve shell must have been made by an intelligent being, like the hinge of a door by man. There seems to be no more design in the variability of organic beings and in the action of natural selection, than in the course which the wind blows.Charles Darwin (c1880)

Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed, during a long course of years, from a point of view directly opposite to mine. It is so easy to hide our ignorance under such expressions as "plan of creation," "unity of design," etc., and to think that we give an explanation when we only restate a fact. Any one whose disposition leads him to attach more weight to unexplained difficulties than to the explanation of a certain number of facts will certainly reject the theory.

Charles Darwin (1859)

Science reveals where religion conceals. Where religion purports to explain, it actually resorts to tautology. To assert that "God did it" is no more than an admission of ignorance dressed deceitfully as an explanation...

Peter Atkins

Quotations

The world is not inhabited exclusively by fools, and when a subject arouses intense interest, as this one has, something other than semantics is usually at stake. Stephen Jay Gould (1982)
I have championed contingency, and will continue to do so, because its large realm and legitimate claims have been so poorly attended by evolutionary scientists who cannot discern the beat of this different drummer while their brains and ears remain tuned to only the sounds of general theory. Stephen Jay Gould (2002) p.1339
The essence of Darwinism lies in its claim that natural selection creates the fit. Variation is ubiquitous and random in direction. It supplies raw material only. Natural selection directs the course of evolutionary change. Stephen Jay Gould (1977)
Rudyard Kipling asked how the leopard got its spots, the rhino its wrinkled skin. He called his answers "just-so stories." When evolutionists try to explain form and behavior, they also tell just-so stories—and the agent is natural selection. Virtuosity in invention replaces testability as the criterion for acceptance. Stephen Jay Gould (1980)
Since 'change of gene frequencies in populations' is the 'official' definition of evolution, randomness has transgressed Darwin's border and asserted itself as an agent of evolutionary change. Stephen Jay Gould (1983) p.335
The first commandment for all versions of NOMA might be summarized by stating: "Thou shalt not mix the magisteria by claiming that God directly ordains important events in the history of nature by special interference knowable only through revelation and not accessible to science." In common parlance, we refer to such special interference as "miracle"—operationally defined as a unique and temporary suspension of natural law to reorder the facts of nature by divine fiat. Stephen Jay Gould (1999) p.84

Quotations

My own view is that conclusions about the evolution of human behavior should be based on research at least as rigorous as that used in studying nonhuman animals. And if you read the animal behavior journals, you'll see that this requirement sets the bar pretty high, so that many assertions about evolutionary psychology sink without a trace.

Jerry Coyne
Why Evolution Is True

I once made the remark that two things disappeared in 1990: one was communism, the other was biochemistry and that only one of them should be allowed to come back.

Sydney Brenner
TIBS Dec. 2000

It is naïve to think that if a species' environment changes the species must adapt or else become extinct.... Just as a changed environment need not set in motion selection for new adaptations, new adaptations may evolve in an unchanging environment if new mutations arise that are superior to any pre-existing variations

Douglas Futuyma

One of the most frightening things in the Western world, and in this country in particular, is the number of people who believe in things that are scientifically false. If someone tells me that the earth is less than 10,000 years old, in my opinion he should see a psychiatrist.

Francis Crick

There will be no difficulty in computers being adapted to biology. There will be luddites. But they will be buried.

Sydney Brenner

An atheist before Darwin could have said, following Hume: 'I have no explanation for complex biological design. All I know is that God isn't a good explanation, so we must wait and hope that somebody comes up with a better one.' I can't help feeling that such a position, though logically sound, would have left one feeling pretty unsatisfied, and that although atheism might have been logically tenable before Darwin, Darwin made it possible to be an intellectually fulfilled atheist

Richard Dawkins

Another curious aspect of the theory of evolution is that everybody thinks he understand it. I mean philosophers, social scientists, and so on. While in fact very few people understand it, actually as it stands, even as it stood when Darwin expressed it, and even less as we now may be able to understand it in biology.

Jacques Monod

The false view of evolution as a process of global optimizing has been applied literally by engineers who, taken in by a mistaken metaphor, have attempted to find globally optimal solutions to design problems by writing programs that model evolution by natural selection.

Richard Lewontin

More Recent Comments

Sunday, July 08, 2018

Disappearing genes: a paper is refuted before it is even published

9 comments :