Monday, April 01, 2019

The frequency of splicing errors reflects the balance between selection and drift

Splice variants are very common in eukaryotes. We know that it's possible to detect dozens of different splice variants for each gene with multiple introns. In the past, these variants were thought to be examples of differential regulation by alternative spicing but we now know that most of them are due to splicing errors. Most of the variants have been removed from the sequence databases but many remain and they are annotated as examples of alternative splicing, which implies that they have a biological function.

I have blogged about splice variants many times, noting that alternative splicing is a very real phenomenon but it's probably restricted to just a small percentage of genes. Most of splice variants that remain in the databases are probably due to splicing errors. They are junk RNA [The persistent myth of alternative splicing].

The ongoing controversy over the origin of splice variants is beginning to attract attention in the scientific literature although it's fair to say that most scientists are still unaware of the controversy. They continue to believe that abundant alternative splicing is a real phenomenon and they don't realize that the data is more compatible with abundant splicing errors.

Some molecular evolution labs have become interested in the controversy and have devised tests of the two possibilities. I draw your attention to a paper that was published 18 months ago.
Saudemont, B., Popa, A., Parmley, J. L., Rocher, V., Blugeon, C., Necsulea, A., Meyer, E., and Duret, L. (2017) The fitness cost of mis-splicing is the main determinant of alternative splicing patterns. Genome biology, 18:208. [doi: 10.1186/s13059-017-1344-6]

Most eukaryotic genes are subject to alternative splicing (AS), which may contribute to the production of protein variants or to the regulation of gene expression via nonsense-mediated messenger RNA (mRNA) decay (NMD). However, a fraction of splice variants might correspond to spurious transcripts and the question of the relative proportion of splicing errors to functional splice variants remains highly debated.

We propose a test to quantify the fraction of AS events corresponding to errors. This test is based on the fact that the fitness cost of splicing errors increases with the number of introns in a gene and with expression level. We analyzed the transcriptome of the intron-rich eukaryote Paramecium tetraurelia. We show that in both normal and in NMD-deficient cells, AS rates strongly decrease with increasing expression level and with increasing number of introns. This relationship is observed for AS events that are detectable by NMD as well as for those that are not, which invalidates the hypothesis of a link with the regulation of gene expression. Our results show that in genes with a median expression level, 92–98% of observed splice variants correspond to errors. We observed the same patterns in human transcriptomes and we further show that AS rates correlate with the fitness cost of splicing errors.

These observations indicate that genes under weaker selective pressure accumulate more maladaptive substitutions and are more prone to splicing errors. Thus, to a large extent, patterns of gene expression variants simply reflect the balance between selection, mutation, and drift.
This is another example of a well-written paper that explains the controversy and the two competing explanations; namely, functional alternative splicing and splicing errors. The authors suggest a test that might help distinguish between these two possibilities.
We propose here a test to quantify the fraction of splice variants corresponding to errors, i.e. having a negative impact on the fitness of organisms. The basis of this test is that the strength of splice signals is expected to reflect a balance between selection (which favors alleles that are optimal for splicing efficiency) and mutation and random genetic drift (which can lead to the fixation of non-optimal alleles). This selection-mutation-drift equilibrium therefore predicts a higher splicing accuracy at introns where errors are more deleterious for the fitness of organisms. Hence, if [splice variants] predominantly correspond to splicing errors, one should expect a negative correlation between the rate of [splice variant] events and their cost in terms of resource allocation (metabolic cost, mobilization of cellular machineries). The noisy splicing model therefore makes several specific predictions regarding the [splice variant] rate according to whether splice variants are detectable by NMD and according to the expression level, length, and number of introns of genes.1
They carry out their main test using genes in Paramecium tetrauelia because this organisms has short introns (20-35 bp) that can be covered in single RNA-seq reads. Then they apply the same test to human genes and conclude ...
For a given error rate, errors are expected to be more costly (in terms of metabolic resources and mobilization of cellular machineries) in highly expressed genes. Hence the fitness cost of mis-splicing is expected to increase with increasing expression level. Indeed, this is precisely what we observed in humans: the strength of selection against deleterious mutations at splice sites is strongly correlated to gene expression level (Fig. 6b). Since the risk of producing erroneous transcripts increases with the number of introns, this implies that all else being equal, there should be a stronger selective pressure against mis-splicing in intron-rich genes. The mutation-selection-drift theory therefore predicts that introns from weakly expressed/intron-poor genes should accumulate more non-optimal substitutions in their splice signals and therefore should show a higher splicing error rate. The relationships that we observe between [splice variant] rate, expression level, and intron number are perfectly consistent with these predictions, both in human (Fig. 5) and in paramecia (Fig. 3).
I'm not going to argue that this is a definitive answer to the problem but I'm pleased that more and more groups are promoting the idea that splicing errors is a viable explanation of the data. I'm also pleased that more attention is being paid to the fact that slightly deleterious events can persist in the population because they are effectively invisible to selection. This counters the prevailing narrative that everything we observe must be adaptive and functional.

Note: Saudermont et al. (2019) review the literature on the rate of splicing errors and note that it can be as high as 3%. My own review of the literature suggests that an error rate of this magnitude is rare but splicing is still error-prone. I estimate that a typical splice site is only 99.9% effective and, in addition, inappropriate splice sites are activated about 0.1% of the time in a typical human gene. Saudermont et al. alerted me to a paper by Stepankiw et al. (2015) that I hadn't read before. Those authors presented evidence that 1% of all transcripts are incorrectly spliced due to errors in the spliceosome reaction.

1. The authors refer to all splice variants as examples of alternative splicing (AS). I think this is confusing since the term "alternative splicing" has been used for decades to refer to real examples of differential splicing with a biological function. I think we should reserve that term for biologically meaningful examples of splice variants as opposed to variants due to splicing errors.

Stepankiw, N., Raghavan, M., Fogarty, E. A., Grimson, A., and Pleiss, J.A. (2015) Widespread alternative and aberrant splicing revealed by lariat sequencing. Nucleic acids research, 43:8488-8501. [doi: 10.1093/nar/gkv763]


  1. Great read!

    Where is this stated 3% splicing error rate stated: "Saudermont et al. (2019) review the literature on the rate of splicing errors and note that it can be as high as 3%"?

    I can't find a Saudemont paper from 2019 nor can I find it on their 2017 Paramecium paper. I have seen estimates between 0.1% and 1%, so I'm curious where a 3% splicing error rate is stated.

  2. Hi profesor Moran
    What do u think about this paper?

    In the human genome, more than 4.5 million sequences can be readily identified as derived from transposable elements (TEs), accounting for at least 50% of its DNA content.

    Long discarded as junk DNA, TEs are increasingly recognized as major motors of genome evolution.

  3. Sort of an offshoot, but i'm curious what you think about Wright's shifting balance theory of evolution, here: It seems like it has a lot of potential to explain both macroevolutionary processes and population genetic level processes.

  4. This is excellent! I'm glad to hear of this controversy, it reminds me of hidden dogmas that I probably take for granted.