More Recent Comments

Sunday, August 21, 2022

Splicing errors or alternative splicing?

The most important issue in alternative splicing, in my opinion, is whether splice variants are due to splicing errors (= junk RNA) or whether they reflect real biologically relevant alternative splicing.

Unfortunately, this view is not shared by the majority of scientists who work in this field. They are convinced that the vast majority of splice variant transcripts represent real examples of regulation and the main task is to document the extent of alternative splicing and characterize the various mechanisms.

I've written a lot about this topic over the years (see the list of posts at the bottom of this page). The two most important issues are: (1) the frequency of splicing errors and whether it can account for the splice variants and (2) the number of well-established, genuine, examples of biologically relevant alternative splicing and whether that's consistent with the claims.

I managed to post a summary of the data on the accuracy of splicing on the Intron article on Wikipedia and I urge you to take a look at it before it disappears. The bottom line is that splicing is not terribly accurate so we expect to detect a fairly high level of incorrectly spliced transcripts whenever we look at a collection of RNAs from a particular cell line. The expected number of mispliced transcripts is well within the concentrations of 'alternatively spliced' transcripts reported in most studies.

Several groups have attempted to confirm whether splice variants are producing different isoforms of the same protein as alternative splicing would predict. The results have been disappointing—it looks like fewer than 5% of all genes exibit genuine alternative splicing and, even then, there are only two variants. One of the best papers is from a former graduate student at the University of British Columbia, Shamsuddin Bhuiyan (Bhuiyan et al. 2018). He scoured the literature for genuine, well-documented examples of alternative splicing and came up with just a handful that pass muster. His thesis also addressed whether new long-read sequencing techniques would be better than the old short-read techniques at detecting real alternative splicing. The answer is no.

In light of these facts, you might think that most workers in the field would recognize that splicing errors are a serious problem. They don't. They persist in equating their splice variants with real examples of alternative splicing and promoting the idea that humans are capable of making many more proteins than you would predict by looking at the number of protein-coding genes.

This idea is connected the The Deflated Ego Problem, which features prominently in my upcoming book. Abundant alternative splicing is one way of explaining why humans can be so much more complex than nematodes even though they have the same number of protein-coding genes.

The latest contribution to the field is a paper that was published in the August 11, 2022 issue of Nature. It's from the labs of Tuuli Lappalainen at Columbia University (New York, NY USA) and Beryl Cummings at the Broad Institute (MIT, Boston MA USA).

Glinos, D. A., Garborcauskas, G., Hoffman, P., Ehsan, N., Jiang, L., Gokden, A., et al. (2022) Transcriptome variation in human tissues revealed by long-read sequencing. Nature 608:353-360. [doi: 10.1038/s41586-022-05035-y]

Regulation of transcript structure generates transcript diversity and plays an important role in human disease. The advent of long-read sequencing technologies offers the opportunity to study the role of genetic variation in transcript structure. In this Article, we present a large human long-read RNA-seq dataset using the Oxford Nanopore Technologies platform from 88 samples from Genotype-Tissue Expression (GTEx) tissues and cell lines, complementing the GTEx resource. We identified just over 70,000 novel transcripts for annotated genes, and validated the protein expression of 10% of novel transcripts. We developed a new computational package, LORALS, to analyse the genetic effects of rare and common variants on the transcriptome by allele-specific analysis of long reads. We characterized allele-specific expression and transcript structure events, providing new insights into the specific transcript alterations caused by common and rare genetic variants and highlighting the resolution gained from long-read data. We were able to perturb the transcript structure upon knockdown of PTBP1, an RNA binding protein that mediates splicing, thereby finding genetic regulatory effects that are modified by the cellular environment. Finally, we used this dataset to enhance variant interpretation and study rare variants leading to aberrant splicing patterns.

They are taking advantage of the new long-read sequencing technology to add to the list of human splice variants that have been deposited in various databases. They discovered 71,735 novel transcripts that had not been previously reported.

Unlike many other papers, the authors attempted to determine if these splice variants produced a novel protein by looking for predicted peptides in the mass spec data. They restricted their search to only 20,948 of the novel transcripts because they used a cutoff concentration of greater or equal to five transcripts per million. They found 2,575 hits, which corresponds to 12% of the candidates and 3.6% of all novel transcripts.

We don't know how many of the hits in the mass spec data correspond to genuine protein isoforms.

My way of looking at this data confirms the view that a large percentage of splice variants is due to nonfunctional errors in splicing. The authors of the paper do not mention that possibility; instead, they focus on the number of novel variants they detected.

In this study, we present a large dataset of long-read RNA-seq, using material derived from cell lines and human tissues collected by the GTEx project. We identified 71,735 novel transcripts, which is high compared to other long-read studies probably because of our large sample size and tissue diversity, consistent with the high number of tissue-specific novel transcripts discovered. Supported by a high validation rate of the novel transcripts in high-throughput mass spectrometry proteome data, our data make an important contribution to human transcript annotations. Expanding long-read studies to further tissues and cell types, coupled with more extensive validation efforts, will enable a better understanding of the regulatory mechanisms of the different types of transcript changes, the functionally distinct protein isoforms that different transcripts can give rise to and the improved variant annotation, as demonstrated by our analysis.

It's difficult for me to see why discovering more and more examples of splicing errors would make "an important contribution to human transcript annotations." The most important annotation is whether the transcript is due to splicing error or genuine alternative splicing and they can't do that for 96.4% of their novel transcripts.

Blog posts on alternative splicing


Bhuiyan, S. A., Ly, S., Phan, M., Huntington, B., Hogan, E., Liu, C. C., . . . Pavlidis, P. (2018) Systematic evaluation of isoform function in literature reports of alternative splicing. BMC Genomics 19:637. [doi: 10.1186/s12864-018-5013-2]

12 comments :

SPARC said...

What always surprises me is that nobody is wondering where all these alternative transcripts kept hiding in the old pre-PCR/RNAseq/mass spec days. How can these people seriously teach about metabolic and signaling pathways while at the same time claiming that complete layers of complexity necessary for understanding life have never been addressed properly?

Larry Moran said...

@SPARC

I bet you were surprised to learn that the genes for every enzyme in the glycolysis pathway and the citric acid cycle produce several different variants by alternative splicing. Not only that, the labs that studied those enzymes for decades never discovered those variants and never appreciated the complexity that made us human.

Isn’t genomics amazing!

SPARC said...

According to the source cited in the supplements they isolated RNA from whole cells. Thus, by default their sample will contain many RNAs that will never make it into the cytoplasm to get translated.
In addition, doesn't their statement "For 608 genes we validated more than one transcript (1,304 total), with 823 transcripts being novel, often detecting tissue-specific protein transcript validation (Supplementary Table 5 and Extended Data Fig. 5e)." exactly mean "it looks like fewer than 5% of all genes exibit genuine alternative splicing and, even then, there are only two variants." as you've stated in your post.

Michael Tress said...

Long read sequencing is a reality that will be reflected in the genome annotations before long. Genes are going to be annotated with a lot more splice variants. To separate transcripts that produce functionally important proteins from those that produce less functionally important proteins (or none at all) is why we developed APPRIS and TRIFID.

I personally don’t believe that there are anywhere near 2,000 novel functional proteins in the human genome that have only just showed up in transcriptomics analysis, but the proteomics analysis was carried out to validate the transcriptomics analysis, not to show that these transcripts are really coding (yes, I know that sounds strange!). Also, there are probably some real proteins in there, because this analysis was carried out with GENCODE v26 and we are now on v41.

By the way, just a point on glycolysis genes. PFKP, PKM and ENO1 all produce at least one functional alternative splice isoform. But these alternative isoforms are not recent innovations. They are generated from mutually exclusively spliced tandem duplicated exons, and in each case the alternative exons pre-dated jawed vertebrates.

@SPARC – no, it doesn’t mean that, unfortunately. While the authors don’t say how many genes they validated with peptides, it will be a lot less than 20,000, so the denominator is not the whole genome. I also don’t think that they searched against the whole human gene set in their proteomics analysis.

Larry Moran said...

@Michael Tress
Why do you think this work was published in Nature? Do you think it makes a significant contribution to our understanding of the transcriptome? Do you know why the authors don’t ever mention the possibility of splicing errors and splice variants that have no biological significance? Is it because they have good evidence that most of their splice variants are important?

Up until now, genome annotators have rejected most transcript variants, although they have still included many that don’t appear to be significant. I fear that there will be pressure to include many more on the grounds that long-read sequencing is more revealing. This could be a disaster for all kinds of studies that rely on genome data. Even now it’s often difficult to figure out which version of a gene is the correct one. You can’t even get a reliable count of the amount of coding DNA.

Michael Tress said...

Novel transcripts with long read sequence support will be annotated, and in large numbers. Annotators are aware of this, and there are proposals to rank the data in some way (TRIFID already does this).

We have shown that almost all genes have a single main protein isoform, and you are right that it is important to know which it is. We have already shown that APPRIS works well here, and Ensembl/GENCODE and RefSeq have developed MANE Select for the human genome. We have two papers coming out that show that when APPRIS and MANE agree (something like 94% of coding genes) they are a rock solid prediction of the main protein isoform. They capture all but 1.8% of peptides, and all but 0.04% of pathogenic mutations that affect coding exons. So there are means of finding reference transcripts/isoforms.

As to why this was published in Nature? I guess because it was a large-scale study carried out on a novel theme, and because there's a lot of interesting data in there. Even though you and I may not agree with the proteomics section.

Larry Moran said...

@Michael Tress

I find it difficult to know whether the data is interesting or not unless I know whether they are dealing with splicing errors or real variants. Am I missing something?

Do you have an estimate of the total amount of coding DNA based on identifying the main isoforms?

Are Ensembl/GENCODE and RefSeq going to include annotations that identify the most common isoform for each protein-coding gene? Is there a way of doing this for noncoding genes?

Michael Tress said...

From what I remember the last time I checked approximately 10% of coding bases were exclusive to alternative coding CDS, so 90% maps to the main transcripts (using APPRIS to select the main transcript).

In the Ensembl web pages both the APPRIS principal isoform and MANE Select transcript are marked for each gene in the human reference set. RefSeq only references MANE, but one can download the list of MANE/APPRIS agreements for both RefSeq and Ensembl from the APPRIS web site. APPRIS is also available for model species other than human.

The thinking behind MANE Select is not necessarily the most expressed variant (though it often is), it is somehow a hybrid of expression, conservation, length and clinical information.

There are no plans to repeat this for non-coding genes.

Anonymous said...

Why is splicing inaccurate? What is the cost of more accurate splicing vs the cost of splicing errors vs the benefits. Are other processes as inaccurate as splicing?

Mikkel Rumraket Rasmussen said...

@Anonymous "Why is splicing inaccurate?"


Lynch M, Hagner K. Evolutionary meandering of intermolecular interactions along the drift barrier. Proc Natl Acad Sci U S A. 2015;112(1):E30-E38. doi:10.1073/pnas.1421641112

Graham Jones said...

This preprint may be of interest.

Random genetic drift sets an upper limit on mRNA splicing accuracy in metazoans
Florian Benitiere, Anamaria Necsulea, Laurent Duret
https://www.biorxiv.org/content/10.1101/2022.12.09.519597v2

Larry Moran said...

@Graham Jones

The preprint was posted four months ago. I usually try and wait until the paper appears in a peer-reviewed journal before commenting about it, especially in cases where the conclusions and/or the methodology is controversial.

In this case, the senior author has an excellent tract record so I expect that the paper will appear very soon.