Sandwalk: Splicing errors or alternative splicing?

Sunday, August 21, 2022

Splicing errors or alternative splicing?

The most important issue in alternative splicing, in my opinion, is whether splice variants are due to splicing errors (= junk RNA) or whether they reflect real biologically relevant alternative splicing.

Unfortunately, this view is not shared by the majority of scientists who work in this field. They are convinced that the vast majority of splice variant transcripts represent real examples of regulation and the main task is to document the extent of alternative splicing and characterize the various mechanisms.

I've written a lot about this topic over the years (see the list of posts at the bottom of this page). The two most important issues are: (1) the frequency of splicing errors and whether it can account for the splice variants and (2) the number of well-established, genuine, examples of biologically relevant alternative splicing and whether that's consistent with the claims.

I managed to post a summary of the data on the accuracy of splicing on the Intron article on Wikipedia and I urge you to take a look at it before it disappears. The bottom line is that splicing is not terribly accurate so we expect to detect a fairly high level of incorrectly spliced transcripts whenever we look at a collection of RNAs from a particular cell line. The expected number of mispliced transcripts is well within the concentrations of 'alternatively spliced' transcripts reported in most studies.

Several groups have attempted to confirm whether splice variants are producing different isoforms of the same protein as alternative splicing would predict. The results have been disappointing—it looks like fewer than 5% of all genes exibit genuine alternative splicing and, even then, there are only two variants. One of the best papers is from a former graduate student at the University of British Columbia, Shamsuddin Bhuiyan (Bhuiyan et al. 2018). He scoured the literature for genuine, well-documented examples of alternative splicing and came up with just a handful that pass muster. His thesis also addressed whether new long-read sequencing techniques would be better than the old short-read techniques at detecting real alternative splicing. The answer is no.

In light of these facts, you might think that most workers in the field would recognize that splicing errors are a serious problem. They don't. They persist in equating their splice variants with real examples of alternative splicing and promoting the idea that humans are capable of making many more proteins than you would predict by looking at the number of protein-coding genes.

This idea is connected the The Deflated Ego Problem, which features prominently in my upcoming book. Abundant alternative splicing is one way of explaining why humans can be so much more complex than nematodes even though they have the same number of protein-coding genes.

The latest contribution to the field is a paper that was published in the August 11, 2022 issue of Nature. It's from the labs of Tuuli Lappalainen at Columbia University (New York, NY USA) and Beryl Cummings at the Broad Institute (MIT, Boston MA USA).

Glinos, D. A., Garborcauskas, G., Hoffman, P., Ehsan, N., Jiang, L., Gokden, A., et al. (2022) Transcriptome variation in human tissues revealed by long-read sequencing. Nature 608:353-360. [doi: 10.1038/s41586-022-05035-y]

Regulation of transcript structure generates transcript diversity and plays an important role in human disease. The advent of long-read sequencing technologies offers the opportunity to study the role of genetic variation in transcript structure. In this Article, we present a large human long-read RNA-seq dataset using the Oxford Nanopore Technologies platform from 88 samples from Genotype-Tissue Expression (GTEx) tissues and cell lines, complementing the GTEx resource. We identified just over 70,000 novel transcripts for annotated genes, and validated the protein expression of 10% of novel transcripts. We developed a new computational package, LORALS, to analyse the genetic effects of rare and common variants on the transcriptome by allele-specific analysis of long reads. We characterized allele-specific expression and transcript structure events, providing new insights into the specific transcript alterations caused by common and rare genetic variants and highlighting the resolution gained from long-read data. We were able to perturb the transcript structure upon knockdown of PTBP1, an RNA binding protein that mediates splicing, thereby finding genetic regulatory effects that are modified by the cellular environment. Finally, we used this dataset to enhance variant interpretation and study rare variants leading to aberrant splicing patterns.

They are taking advantage of the new long-read sequencing technology to add to the list of human splice variants that have been deposited in various databases. They discovered 71,735 novel transcripts that had not been previously reported.

Unlike many other papers, the authors attempted to determine if these splice variants produced a novel protein by looking for predicted peptides in the mass spec data. They restricted their search to only 20,948 of the novel transcripts because they used a cutoff concentration of greater or equal to five transcripts per million. They found 2,575 hits, which corresponds to 12% of the candidates and 3.6% of all novel transcripts.

We don't know how many of the hits in the mass spec data correspond to genuine protein isoforms.

My way of looking at this data confirms the view that a large percentage of splice variants is due to nonfunctional errors in splicing. The authors of the paper do not mention that possibility; instead, they focus on the number of novel variants they detected.

In this study, we present a large dataset of long-read RNA-seq, using material derived from cell lines and human tissues collected by the GTEx project. We identified 71,735 novel transcripts, which is high compared to other long-read studies probably because of our large sample size and tissue diversity, consistent with the high number of tissue-specific novel transcripts discovered. Supported by a high validation rate of the novel transcripts in high-throughput mass spectrometry proteome data, our data make an important contribution to human transcript annotations. Expanding long-read studies to further tissues and cell types, coupled with more extensive validation efforts, will enable a better understanding of the regulatory mechanisms of the different types of transcript changes, the functionally distinct protein isoforms that different transcripts can give rise to and the improved variant annotation, as demonstrated by our analysis.

It's difficult for me to see why discovering more and more examples of splicing errors would make "an important contribution to human transcript annotations." The most important annotation is whether the transcript is due to splicing error or genuine alternative splicing and they can't do that for 96.4% of their novel transcripts.

Blog posts on alternative splicing

Bhuiyan, S. A., Ly, S., Phan, M., Huntington, B., Hogan, E., Liu, C. C., . . . Pavlidis, P. (2018) Systematic evaluation of isoform function in literature reports of alternative splicing. BMC Genomics 19:637. [doi: 10.1186/s12864-018-5013-2]

12 comments :

SPARC said...: What always surprises me is that nobody is wondering where all these alternative transcripts kept hiding in the old pre-PCR/RNAseq/mass spec days. How can these people seriously teach about metabolic and signaling pathways while at the same time claiming that complete layers of complexity necessary for understanding life have never been addressed properly?; Monday, August 22, 2022 2:56:00 AM
Larry Moran said...: @SPARC

I bet you were surprised to learn that the genes for every enzyme in the glycolysis pathway and the citric acid cycle produce several different variants by alternative splicing. Not only that, the labs that studied those enzymes for decades never discovered those variants and never appreciated the complexity that made us human.

Isn’t genomics amazing!; Monday, August 22, 2022 4:46:00 AM
SPARC said...: According to the source cited in the supplements they isolated RNA from whole cells. Thus, by default their sample will contain many RNAs that will never make it into the cytoplasm to get translated.
In addition, doesn't their statement "For 608 genes we validated more than one transcript (1,304 total), with 823 transcripts being novel, often detecting tissue-specific protein transcript validation (Supplementary Table 5 and Extended Data Fig. 5e)." exactly mean "it looks like fewer than 5% of all genes exibit genuine alternative splicing and, even then, there are only two variants." as you've stated in your post.; Monday, August 22, 2022 5:04:00 AM
Michael Tress said...: Long read sequencing is a reality that will be reflected in the genome annotations before long. Genes are going to be annotated with a lot more splice variants. To separate transcripts that produce functionally important proteins from those that produce less functionally important proteins (or none at all) is why we developed APPRIS and TRIFID.

I personally don’t believe that there are anywhere near 2,000 novel functional proteins in the human genome that have only just showed up in transcriptomics analysis, but the proteomics analysis was carried out to validate the transcriptomics analysis, not to show that these transcripts are really coding (yes, I know that sounds strange!). Also, there are probably some real proteins in there, because this analysis was carried out with GENCODE v26 and we are now on v41.

By the way, just a point on glycolysis genes. PFKP, PKM and ENO1 all produce at least one functional alternative splice isoform. But these alternative isoforms are not recent innovations. They are generated from mutually exclusively spliced tandem duplicated exons, and in each case the alternative exons pre-dated jawed vertebrates.

@SPARC – no, it doesn’t mean that, unfortunately. While the authors don’t say how many genes they validated with peptides, it will be a lot less than 20,000, so the denominator is not the whole genome. I also don’t think that they searched against the whole human gene set in their proteomics analysis.; Monday, August 22, 2022 7:29:00 AM
Larry Moran said...: @Michael Tress
Why do you think this work was published in Nature? Do you think it makes a significant contribution to our understanding of the transcriptome? Do you know why the authors don’t ever mention the possibility of splicing errors and splice variants that have no biological significance? Is it because they have good evidence that most of their splice variants are important?

Up until now, genome annotators have rejected most transcript variants, although they have still included many that don’t appear to be significant. I fear that there will be pressure to include many more on the grounds that long-read sequencing is more revealing. This could be a disaster for all kinds of studies that rely on genome data. Even now it’s often difficult to figure out which version of a gene is the correct one. You can’t even get a reliable count of the amount of coding DNA.; Monday, August 22, 2022 8:49:00 AM
Michael Tress said...: Novel transcripts with long read sequence support will be annotated, and in large numbers. Annotators are aware of this, and there are proposals to rank the data in some way (TRIFID already does this).

We have shown that almost all genes have a single main protein isoform, and you are right that it is important to know which it is. We have already shown that APPRIS works well here, and Ensembl/GENCODE and RefSeq have developed MANE Select for the human genome. We have two papers coming out that show that when APPRIS and MANE agree (something like 94% of coding genes) they are a rock solid prediction of the main protein isoform. They capture all but 1.8% of peptides, and all but 0.04% of pathogenic mutations that affect coding exons. So there are means of finding reference transcripts/isoforms.

As to why this was published in Nature? I guess because it was a large-scale study carried out on a novel theme, and because there's a lot of interesting data in there. Even though you and I may not agree with the proteomics section.; Monday, August 22, 2022 9:31:00 AM
Larry Moran said...: @Michael Tress

I find it difficult to know whether the data is interesting or not unless I know whether they are dealing with splicing errors or real variants. Am I missing something?

Do you have an estimate of the total amount of coding DNA based on identifying the main isoforms?

Are Ensembl/GENCODE and RefSeq going to include annotations that identify the most common isoform for each protein-coding gene? Is there a way of doing this for noncoding genes?; Monday, August 22, 2022 11:13:00 AM
Michael Tress said...: From what I remember the last time I checked approximately 10% of coding bases were exclusive to alternative coding CDS, so 90% maps to the main transcripts (using APPRIS to select the main transcript).

In the Ensembl web pages both the APPRIS principal isoform and MANE Select transcript are marked for each gene in the human reference set. RefSeq only references MANE, but one can download the list of MANE/APPRIS agreements for both RefSeq and Ensembl from the APPRIS web site. APPRIS is also available for model species other than human.

The thinking behind MANE Select is not necessarily the most expressed variant (though it often is), it is somehow a hybrid of expression, conservation, length and clinical information.

There are no plans to repeat this for non-coding genes.; Monday, August 22, 2022 1:22:00 PM
Anonymous said...: Why is splicing inaccurate? What is the cost of more accurate splicing vs the cost of splicing errors vs the benefits. Are other processes as inaccurate as splicing?; Friday, August 26, 2022 10:29:00 AM
Mikkel Rumraket Rasmussen said...: @Anonymous "Why is splicing inaccurate?"

Lynch M, Hagner K. Evolutionary meandering of intermolecular interactions along the drift barrier. Proc Natl Acad Sci U S A. 2015;112(1):E30-E38. doi:10.1073/pnas.1421641112; Friday, August 26, 2022 2:37:00 PM
Graham Jones said...: This preprint may be of interest.

Random genetic drift sets an upper limit on mRNA splicing accuracy in metazoans
Florian Benitiere, Anamaria Necsulea, Laurent Duret
https://www.biorxiv.org/content/10.1101/2022.12.09.519597v2; Monday, April 10, 2023 8:06:00 AM
Larry Moran said...: @Graham Jones

The preprint was posted four months ago. I usually try and wait until the paper appears in a peer-reviewed journal before commenting about it, especially in cases where the conclusions and/or the methodology is controversial.

In this case, the senior author has an excellent tract record so I expect that the paper will appear very soon.; Monday, April 10, 2023 10:25:00 AM

Quotations

The old argument of design in nature, as given by Paley, which formerly seemed to me to be so conclusive, fails, now that the law of natural selection has been discovered. We can no longer argue that, for instance, the beautiful hinge of a bivalve shell must have been made by an intelligent being, like the hinge of a door by man. There seems to be no more design in the variability of organic beings and in the action of natural selection, than in the course which the wind blows.Charles Darwin (c1880)

Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed, during a long course of years, from a point of view directly opposite to mine. It is so easy to hide our ignorance under such expressions as "plan of creation," "unity of design," etc., and to think that we give an explanation when we only restate a fact. Any one whose disposition leads him to attach more weight to unexplained difficulties than to the explanation of a certain number of facts will certainly reject the theory.

Charles Darwin (1859)

Science reveals where religion conceals. Where religion purports to explain, it actually resorts to tautology. To assert that "God did it" is no more than an admission of ignorance dressed deceitfully as an explanation...

Peter Atkins

Quotations

The world is not inhabited exclusively by fools, and when a subject arouses intense interest, as this one has, something other than semantics is usually at stake. Stephen Jay Gould (1982)
I have championed contingency, and will continue to do so, because its large realm and legitimate claims have been so poorly attended by evolutionary scientists who cannot discern the beat of this different drummer while their brains and ears remain tuned to only the sounds of general theory. Stephen Jay Gould (2002) p.1339
The essence of Darwinism lies in its claim that natural selection creates the fit. Variation is ubiquitous and random in direction. It supplies raw material only. Natural selection directs the course of evolutionary change. Stephen Jay Gould (1977)
Rudyard Kipling asked how the leopard got its spots, the rhino its wrinkled skin. He called his answers "just-so stories." When evolutionists try to explain form and behavior, they also tell just-so stories—and the agent is natural selection. Virtuosity in invention replaces testability as the criterion for acceptance. Stephen Jay Gould (1980)
Since 'change of gene frequencies in populations' is the 'official' definition of evolution, randomness has transgressed Darwin's border and asserted itself as an agent of evolutionary change. Stephen Jay Gould (1983) p.335
The first commandment for all versions of NOMA might be summarized by stating: "Thou shalt not mix the magisteria by claiming that God directly ordains important events in the history of nature by special interference knowable only through revelation and not accessible to science." In common parlance, we refer to such special interference as "miracle"—operationally defined as a unique and temporary suspension of natural law to reorder the facts of nature by divine fiat. Stephen Jay Gould (1999) p.84

Quotations

My own view is that conclusions about the evolution of human behavior should be based on research at least as rigorous as that used in studying nonhuman animals. And if you read the animal behavior journals, you'll see that this requirement sets the bar pretty high, so that many assertions about evolutionary psychology sink without a trace.

Jerry Coyne
Why Evolution Is True

I once made the remark that two things disappeared in 1990: one was communism, the other was biochemistry and that only one of them should be allowed to come back.

Sydney Brenner
TIBS Dec. 2000

It is naïve to think that if a species' environment changes the species must adapt or else become extinct.... Just as a changed environment need not set in motion selection for new adaptations, new adaptations may evolve in an unchanging environment if new mutations arise that are superior to any pre-existing variations

Douglas Futuyma

One of the most frightening things in the Western world, and in this country in particular, is the number of people who believe in things that are scientifically false. If someone tells me that the earth is less than 10,000 years old, in my opinion he should see a psychiatrist.

Francis Crick

There will be no difficulty in computers being adapted to biology. There will be luddites. But they will be buried.

Sydney Brenner

An atheist before Darwin could have said, following Hume: 'I have no explanation for complex biological design. All I know is that God isn't a good explanation, so we must wait and hope that somebody comes up with a better one.' I can't help feeling that such a position, though logically sound, would have left one feeling pretty unsatisfied, and that although atheism might have been logically tenable before Darwin, Darwin made it possible to be an intellectually fulfilled atheist

Richard Dawkins

Another curious aspect of the theory of evolution is that everybody thinks he understand it. I mean philosophers, social scientists, and so on. While in fact very few people understand it, actually as it stands, even as it stood when Darwin expressed it, and even less as we now may be able to understand it in biology.

Jacques Monod

The false view of evolution as a process of global optimizing has been applied literally by engineers who, taken in by a mistaken metaphor, have attempted to find globally optimal solutions to design problems by writing programs that model evolution by natural selection.

Richard Lewontin

More Recent Comments

Sunday, August 21, 2022

Splicing errors or alternative splicing?

12 comments :