More Recent Comments

Wednesday, April 08, 2020

Alternative splicing: function vs noise

This post is about a recent review of alternative splicing published by my colleague Ben Blencowe in the Dept. of Medical Genetics at the University of Toronto (Toronto, Ontario, Canada). (The other author is Jermej Ule of The Francis Crick Institute in London (UK).) They are strong supporters of the idea that alternative splicing is a common feature of most human genes.

I am a strong supporter of the idea that most splice variants are due to splicing errors and only a few percent of human genes undergo true alternative spicing.

This is a disagreement about the definition of "function." Is the mere existence of multiple splice variants evidence that they are biologically relevant (functional) or should we demand evidence of function—such as conservation—before accepting such a claim?

Background: what are splice variants?

Let me begin by defining some terms. Modern techniques are capable of detecting specific RNA molecules that may be present at less than one copy per cell. By scanning many different tissues, workers have compiled extensive lists of transcripts that are complementary to various parts of the genome. This gives rise to the idea of pervasive transcription and that was one of the reasons why ENCODE researchers claimed that most of our genome is functional.

Most knowledgeable scientists now agree that many of those transcripts are spurious transcripts produced by accidental transcription. Many of those transcripts overlap with known genes and the primary transcript will be processed by splicing if it overlaps a splice site. This gives rise to transcripts that are characterized as splice variants and those transcripts are not so easily dismissed as mistakes by workers in the field of alternative splicing. That's because alternative splicing is a real phenomenon that has been well-studied in a few genes since the early 1980s.

I restrict the term "alternative splicing" to those situations where the alternate transcripts are known to be biologically relevant, or when we have a strong reason to suspect true alternative splicing. In situations where the transcript variants don't have the characteristics of true alternative splicing, and where there's no evidence of biological relevance, I will refer to those transcripts as "transcript variants" or "splice variants." This differs from standard usage in the field where all the splice variants are automatically assumed to be examples of true alternative splicing.1

It's hard to find a modern up-to-date database that lists all the variants for an individual gene but it seems from scanning old databases that there may be dozens of splice variants for most genes. One of most widely quoted papers in the field is the Pan et al. (2008) paper from the Blencowe/Frey labs. This is the paper where they claim that 95% of human multiexon protein-coding genes are alternatively spliced and that there are, on average, "at least seven alternative splicing events" per gene.

I reject this terminology. I would say there are at least seven splice variants per gene and it remains to be seen whether they are examples of splicing errors or true alternative splicing. Neverthelss, in spite of the lack of supporting evidence—other than the mere existence of splice variants—this paper is widely quoted as evidence of pervasive alternative splicing.

An example of splice variants

The top figure below shows some of the splice variants for the human triose phosphate isomerase gene (TPI1) from the Ensembl: human database. I think these are only a small subset of the variants that have been reported for this gene but even in this small subset you can see predictions of eight different proteins plus two variants that don't encode proteins.

The bottom figure is the same data for the mouse gene [Ensemble: mouse]. There are only three variants of the mouse TRI1 genes in the Ensemble database and only one of them is predicted to make a different protein—one that's missing the C-terminal half of the protein. Note that the patterns of transcript variants of the mouse and human genes are not the same. Production of these variants is not conserved in mammals.

Triose phosphate isomerase is an important metabolic enzyme found in all species, including bacteria. The enzyme catalyzes an important reaction in gluconeogenesis/glycolysis. The structure of the protein is well known and it's function is well understood. It seems very unlikely that humans would make seven functional variants of this protein especially since none of them are found in other mammals.

(Note: There seems to be an increasing reluctance to publish examples of transcript variants for specific genes. I can't recall when I've last seen any images like the ones I posted above. I wonder if this is because the proponents of alternative splicing are embarrassed to show representations of the data or whether they don't look at it themselves. I suspect the latter explanation. It seems as though workers in the field are increasingly relying on bioinformatic analysis of transcript variant databases without ever actually looking at specific genes to see if the databases make sense. It's time to re-issue my Challenge to Fans of Alternative Splicing.)

The Deflated Ego Problem

The controversy over the frequency of alternative splicing is related to something I call The Deflated Ego Problem. The "problem" is based on the view that humans are extraordinarily complex compared to other species and that this complexity should be reflected in the number of genes. Many scientists were "shocked" to discover that humans don't have very many more genes than the nematode Caenorhabditis elegans and even fewer genes than some flowering plants.

In order to preserve their view of human exceptionalism, these shocked scientists have been forced to come up with an explanation for this "anomaly." I listed seven of these explanations in the Deflated Ego post but the one I want to draw your attention to is alternative splicing. The idea is that while humans may not have a lot more genes than nematodes, they make much better use of those genes by producing multiple proteins from each gene. Thus, the complexity of humans is explained by alternative splicing and not by an increase in the number of genes.

The lack of genes is often referred to as the G-value paradox (see Deflated egos and the G-value paradox). It's only a problem if you haven't been following the work of developmental biologists over the past forty years. They have established that complexity and species differences are usually explained by changes in how genes are regulated and not by large increases in the evolution of new genes [Revisiting the deflated ego problem]. There is no "problem" and scientists should not have been shocked.2

Here's an explicit explanation of the imaginary problem as expressed by Gil Ast in a 2005 Scientific American article (Ast, 2005).
The Alternative Genome

The old axiom "one gene, one protein" no longer holds true. The more complex an organism, the more likely it became that way by extracting multiple protein meanings from individual genes

When a first draft of the human sequence was published the following summer, some observers were therefore shocked by the sequencing team's calculation of 30,000 to 35,000 protein-coding genes. The low number seemed almost embarrassing. In the years since, the human genome map has been finished and the gene estimate has been revised downward still further, to fewer than 25,000. During the same period, however, geneticists have come to understand that our low count might actually be viewed as a mark of our sophistication because humans make such incredibly versatile use of so few genes.

Through a mechanism called alternative splicing, the information stored in the genes of complex organisms can be edited in a number of ways, making it possible for a single gene to specify two or more distinct proteins. As scientists compare the human genome to those of other organisms, they are realizing the extent to which alternative splicing accounts for much of the diversity among organisms with relativity similar gene sets ....

Indeed, the prevalence of alternative splicing appears to increase with an organism's complexity—as many as three quarters of all human genes are subject to alternative splicing. The mechanism itself probably contributed to the evolution of that complexity and could drive our further evolution.
This view has become standard dogma in the alternative splicing world so that almost every new paper begins with a reference to it as though it were established theory. It seems to be widely accepted that multiple versions of metabolic enzymes such as triose phosphate isomerase will explain human complexity.3

But it is not a fact that most genes exhibit some form of alternative splicing; it's merely speculation designed to assuage deflated egos. Furthermore, the explanation relies on the assumption that less complex animals must make fewer proteins from a similar set of genes. Recent experiments have shown that this assumption is false so the whole argument falls apart [Alternative splicing in the nematode C. elegans].

Explain these facts

Here's a modified list of things that need explaining if you think that alternative splicing is widespread in humans. The original list was posted more than a year ago [The persistent myth of alternative splicing].
  • Splicing is associated with a known error rate that's consistent with the production of frequent spurious splice variants. Explain why this fact is ignored.
  • The unusual transcript variants are usually present at less than one copy per cell. Explain how thousands of such rare transcripts could have a function.
  • The unusual transcript variants are rapidly degraded and usually don't leave the nucleus. What is their function?
  • The transcripts are not conserved, as expected if they are splicing errors. Give a rational evolutionary explanation for why we should ignore the lack of sequence conservation.
  • In the vast majority of cases, the predicted protein products of these transcripts have never been detected. Explain that.
  • The number of different unusual transcripts produced from each gene makes it extremely unlikely that they could all be biologically relevant. Explain how such strange transcripts, and even stranger protein variants, could have evolved.
  • The number of detectable transcripts correlates with the length of the gene and the number of introns, which is consistent with splicing errors. Explain how this is consistent with biologically relevant alternative splicing.
  • Gene annotators who have looked closely at the data have determined that >90% of them are spurious junk RNA or noise and they have not been included in the standard reference database. Why do genome annotators dismiss most splice variants?

The Ule and Blencowe paper of 2020

This brings me, finally, to the paper I want to discuss. It was published last October (2019).
Ule, J., and Blencowe, B.J. (2019) Alternative splicing regulatory networks: Functions, mechanisms, and evolution. Molecular Cell, 76:329-345. [doi: 10.1016/j.molcel.2019.09.017]
This review article begins with the statement that "Transcripts from nearly all human protein-coding genes undergo one or more forms of alternative splicing ...." This statement is misleading, at best. I could easily make the case that nearly all genes produce multiple transcript variants but most of them are due to splicing errors. The interesting question is how many of them might, instead, be due to biologically relevant alternative splicing. The burden of proof is on those who claim functionality and, in the absence of evidence of function, the default assumption is junk RNA.

Most of the review article deals with the variety of RNA-binding and DNA-binding proteins that give rise to splice variants. I don't find this very interesting since it's not clear whether these are spurious binding events that give rise to errors in splicing or whether they are biologically relevant.

The authors clearly believe that alternative splicing "... accounts for the vast range of biological complexity and phenotypic attributes across metazoan species." They conclude that, "... it is becoming clear that alternative splicing has been particularly important for enriching proteomic complexity in animals in ways that have provided an expanded toolkit for evolution."

It's important to note that the authors are aware of the fact that the pattern of production of splice variants is not conserved between species. In fact, they explicitly mention this point in support of their claim that "... alternative splice patterns have diverged rapidly among species." They believe that the lack of conservation can be explained away by postulating rapid selection such that the patterns of thousands of genes are different, even between closely related species. This is a common rationale (rapid selection for divergence) used to dismiss the lack of sequence conservation.

The other interpretation, of course, is that most of the splice variants are due to splicing errors and that's why they are not conserved (see Using conservation to determine whether splice variants are functional for an extended discussion of this issue).

The most interesting part of the review paper, in my opinion, is the section called "Function versus Noise or Evolutionary Fodder." This is the part of the paper that deals with the controversy and it's good to see it finally addressed since most papers on alternative splicing ignore it. Here's how Ule and Blencowe begin this section ....
As the number of alternative splicing events detected in large-scale sequencing studies continues to rise, it has been argued that only a minor fraction of splice variants are regulated or translated or are of functional importance (Tress et al., 2017).
The paper they reference (Tress et al., 2017a) was covered in an earlier post on this blog [Debating alternative splicing (part II)]. What Tress et al. did was to use mass spectroscopy to look for the protein variants predicted by alternative splicing. The authors analyzed the results of eight large-scale experiments and reached the following conclusions ...
Alternative splicing is well documented at the transcript level, and microarray and RNA-seq experiments routinely detect evidence for many thousands of splice variants. However, large-scale proteomics experiments identify few alternative isoforms. The gap between the numbers of alternative variants detected in large-scale transcriptomics experiments and proteomics analyses is real and is difficult to explain away as a purely technical phenomenon. While alternative splicing clearly does contribute to the cellular proteome, the proteomics evidence indicates that it is not as widespread a phenomenon as suggested by transcript data. In particular, the popular view that alternative splicing can somehow compensate for the perceived lack of complexity in the human proteome is manifestly wrong. [my emphasis LAM]

... The results from large-scale proteomics experiments are in line with evidence from cross-species conservation, human population variation studies, and investigations into the relative effect of gene expression and alternative splicing. Gene expression levels, not alternative splicing, seem to be the key to tissue specificity. While a small number of alternative isoforms are conserved across species, have strong tissue dependence, and are translated in detectable quantities, most have variable tissue specificities and appear to be evolving neutrally. This suggests that most annotated alternative variants are unlikely to have a functional cellular role as proteins. [my emphasis, LAM]
As you might have guessed, Ben Blencowe was unhappy with this result so he responded with a critical letter published in the same journal a few months later (Blencowe, 2017) [see Debating alternative splicing (Part IV)]. In that letter, he made the same points that he makes in the Ule and Blencowe review; namely that the mass spec experiments are flawed for technical reasons—they are not detecting protein variants that should be there. However, the authors do concede that, "... alternative splicing events lie on an evolving spectrum of regulation and functionality; therefore, it is very challenging to draw a line between those that are functional or non-functional."

Tress et al. responded to Blencowe's letter back in 2017 (Tress et al., 2017b). As experts in proteomics they were probably aware of all of the objections that Blencowe raised, and many more. After considering Blencowe's criticisms, they write, "We believe our conclusions are well substantiated and invite readers to judge for themselves in the article and related papers."

Resolving the controversy

It don't think it's possible to state conclusively that almost all human protein-coding genes produce protein variants by biologically-relevant alternative splicing. Scientists who make such claims are wrong because there's nothing to support such a claim other than wishful thinking. On the other hand, it's not possible to conclude that most splice variants are noise, although I firmly believe that the evidence tilts in the direction of noise. The apppropriate null hypothesis is that the transcripts do not have a function and the burden of proof is on those who make the claim for function.

The main problems I have with the alternative splicing literature are: (1) that proponents of widespread alternative splicing are using questionable evolutionary arguments to rationalize their claim, and (2) they are mostly ignoring any objections to their claims and refusing to acknolwedge that they could be mistaken.

It's interesting that Ule and Blencowe do not address any of the other criticisms of alternative splicing. They only respond to one paper. Here's a short list of other papers they might have considered.
Bhuiyan, S.A., Ly, S., Phan, M., Huntington, B., Hogan, E., Liu, C.C., Liu, J., and Pavlidis, P. (2018) Systematic evaluation of isoform function in literature reports of alternative splicing. BMC Genomics, 19:637. [doi: 10.1186/s12864-018-5013-2]

Bitton, D.A., Atkinson, S. R., Rallis, C., Smith, G.C., Ellis, D.A., Chen, Y.Y., Malecki, M., Codlin, S., Lemay, J.-F., and Cotobal, C. (2015) Widespread exon skipping triggers degradation by nuclear RNA surveillance in fission yeast. Genome Research. [doi: 10.1101/gr.185371.114]

Hsu, S.-N., and Hertel, K.J. (2009) Spliceosomes walk the line: splicing errors and their impact on cellular function. RNA biology, 6:526-530. [doi: 10.4161/rna.6.5.986]

Melamud, E., and Moult, J. (2009a) Stochastic noise in splicing machinery. Nucleic acids research, gkp471. [doi: 10.1093/nar/gkp471]

Melamud, E., and Moult, J. (2009b) Structural implication of splicing stochastics. Nucleic acids research, gkp444. [doi: 10.1093/nar/gkp444]

Mudge, J.M., and Harrow, J. (2016) The state of play in higher eukaryote gene annotation. Nature Reviews Genetics, 17:758-772. [doi: 10.1038/nrg.2016.119]

Pickrell, J.K., Pai, A.A., Gilad, Y., and Pritchard, J.K. (2010) Noisy splicing drives mRNA isoform diversity in human cells. PLoS Genet, 6:e1001236. [doi: 10.1371/journal.pgen.1001236]

Saudemont, B., Popa, A., Parmley, J.L., Rocher, V., Blugeon, C., Necsulea, A., Meyer, E., and Duret, L. (2017) The fitness cost of mis-splicing is the main determinant of alternative splicing patterns. Genome biology, 18:208. [doi: 10.1186/s13059-017-1344-6]

Stepankiw, N., Raghavan, M., Fogarty, E.A., Grimson, A., and Pleiss, J.A. (2015) Widespread alternative and aberrant splicing revealed by lariat sequencing. Nucleic acids research, 43:8488-8501. [doi: 10.1093/nar/gkv763]

Tress, M. L., Martelli, P. L., Frankish, A., Reeves, G. A., Wesselink, J. J., Yeats, C., ĺsólfur Ólason, P., Albrecht, M., Hegyi, H., Giorgetti, A. et al. (2007) The implications of alternative splicing in the ENCODE protein complement. Proceedings of the National Academy of Sciences, 104:5495-5500. [doi: 10.1073/pnas.0700800104]

Zhang, Z., Xin, D., Wang, P., Zhou, L., Hu, L., Kong, X., and Hurst, L. D. (2009) Noisy splicing, more than expression regulation, explains why some exons are subject to nonsense-mediated mRNA decay. BMC biology, 7:23. [doi:10.1186/1741-7007-7-23]

Debating alternative splicing (part I)
Debating alternative splicing (part II)
Debating alternative splicing (Part III)
Debating alternative splicing (Part IV)

1. Some authors recognize this problem but they solve it by distinguishing between functional alternative splicing and spurious alternative splicing. I don't think this is helpful.

2. They should not have been shocked for other reasons, as well.
3. I'm well aware of the fact that other types of genes could be alternatively spliced; especially genes involved in regulating gene expression. However, the proponents of alternative splicing do not single out specific types of genes; instead they claim that 90% of all genes are alternatively spliced. This must include thousands of conserved genes required for normal metabolic events. I focus attention on those genes to illustrate the absurdity of the claim.

Ast, G. (2005) The alternative genome. Scientific American, 292:58-65. [doi: 10.1038/scientificamerican0405-58]

Tress, M.L., Abascal, F., and Valencia, A. (2017) Alternative splicing may not be the key to proteome complexity. Trends in biochemical sciences, 42:98-110. [doi: 10.1016/j.tibs.2016.08.008]

Tress, M.L., Abascal, F., and Valencia, A. (2017b) Most Alternative Isoforms Are Not Functionally Important. Trends in biochemical sciences, 42:408-410. [doi: 10.1016/j.tibs.2017.04.002]


Greg said...
This comment has been removed by the author.
Greg said...

Reposted with edits:

Thanks for an interesting blog post. It seems like this disagreement has hallmarks of an endless, unsolvable argument, similar to the neutralist-selectionist debate. Just like in that debate it’s not clear how it could ever be resolved (let’s say we determine beyond a shadow of doubt that 25% of all substitutions are adaptive — who was right then, Kimura or Gillespie?). Here, one could endlessly evaluate examples of genes where isoforms are or aren’t important.

You gave an example of a gene where you claim isoforms are not likely to be functional. In return, I could give examples of well documented cases where alternative isoforms are functionally important: RORgt (RORg isoform which specifies a sub-type of T-helper cells) or consitutive androstane receptor where alternative isoforms have been shown to have affinity for different ligands; there are many others. Where does it lead us? Even if just 10% of annotated isoforms are “really” functional, that would still make alternative splicing an important mechanism.

Additionally, you cannot rely on the fact that isoforms are not annotated in other species to infer lack of conservation. The quality of annotation in species other than human and mouse is much lower so it is not at all surprising that they wouldn’t be annotated in other species.

Mark Sturtevant said...

A while ago I had a conversation about this topic with a colleague at work. Like me, he had no real dog in this fight other than an interest in keeping abreast in biology and in understanding things accurately. But when I explained this viewpoint that the evidence for pervasive alternative splicing does not in fact exist he looked at me like I was crazy. It was surprisingly difficult to get him to see that the presence of some functional alternative splicing does not mean that all splicing variants are functional.

Larry Moran said...

This is not an endless unsolvable argument. THe most important point seems to be the one you are missing. In the absence of evidence for function we should not just assume that most splice variants represent real biologically relevant examples of alternative splicing.

That does not mean that there are no examples of alternative splicing. I've put these well-known examples in my textbooks beginning in 1987. Pointing out that there are real examples does not contribute to the discussion.

The reason why I continue to mention examples where it's extremely unlikely that most (any?) of the splice variants are functional is to counter the widely-held (and unsubstantiated) belief that all those splice variants mean something.

You can easily resolve the debate by showing that the most of the splice variants are functional in a large number of genes. Until you do that, the only reasonable conclusion is that they are non-functional because that's what is consistent with the available data (e.g. lack of conservation) and that's how you form a null hypothesis in science.

Larry Moran said...

BTW, the neutralist-selectionist debate is over. The neutralists won. About 90% of our genome is junk and it's evolving at the neutral rate.

Larry Moran said...

It's strange that this logical fallacy is so common among scientists who should know better. I agree with others that it's related to pan-adaptationism - the view that just about everything is an adaptation. When looking at the abundance of splice variants, adaptationists just assume that they are functional because they are unaware of any other possibility. When they see proof that one or two genes have genuine alternative splicing this becomes confirmation that their original assumption was correct (confirmation bias).

Greg said...

It seems a bit disingenuous to attack me for mentioning specific examples as not contributing to the debate after you wrote "There seems to be an increasing reluctance to publish examples of transcript variants for specific genes. (...) It seems as though workers in the field are increasingly relying on bioinformatic analysis of transcript variant databases without ever actually looking at specific genes to see if the databases make sense."

If all you wanted to claim is "not *all* annotated alternative isoforms are functional" that would obviously be true. It would also not be very interesting, or in opposition to what anyone else is claiming. What you really seem to be implying is that 'hardly any' alternative isoforms are functional. This is where it becomes an ill-defined controversy. How many of the annotated isoforms have to be functional for one side to win? Is 50% the magic number and so if less than half of the isoforms are correctly annotated then we should automatically assume they're all unimportant? What if the annotation eventually improves and >50% is correct, does that suddenly make alternative splicing an important biological phenomenon?

Should we automatically assume all annotation is true? Of course not. Should we be skeptical of any individual annotated alternative isoform in absence of functional data? Yes. Should we seek independent validation that they're functional? Absolutely.

Nevertheless, none of this should lead us to conclude that the mechanism itself is unimportant. For comparison, consider the early attempts to annotate genes after human genome was first sequenced. Initially, there were far too many genes annotated, and many annotations were wrong. Following your logic, should we conclude that genes are not important for organismal complexity?

Since you mentioned the supposed lack of conservation again, I will reiterate: the reason why you think alternative isoforms are not conserved is because most species are understudied compared to human and mouse. Indeed, if alternative isoforms were all noise as you imply, you would expect a the annotations to contain similar numbers of non-conserved isoforms in different species. In fact, human and mouse have many more annotated isoforms compared to other mammals but that's just due to study bias, nothing to do with conservation.

As for your second comment, I'm beginning to regret mentioning the neutralist-selectionist debate as I can see that it caused some confusion: it's not about what fraction of the genome is evolving neutrally -- if you spend any amount of time reading about it, you will see people argue mostly about what happens in protein-coding regions.

Larry Moran said...

Do you agree with the widely reported claim that "nearly all human protein-coding genes undergo one or more forms of alternative splicing" as the opening sentence of the Ule & Blencowe paper says? If so, what is the evidence that you rely on to make such a claim? If not, how would you express your personal view of alternative splicing?

Larry Moran said...

I think you are missing the point about annotation. Lots of scientists make claims about the human genome. Some of those claims are confirmed when expert annotators examine the evidence closely and some claims are rejected. In the case of the number of genes, you point out that the initial estimates of more than 30,000 protein-coding genes were incorrect and annotators rightly rejected about 10,000 candidates.

The point it NOT that there are still genes in the human genome; the point is that expert annotators found lots of mistakes. Those same annotators have looked closely at the reported splice variants for each gene and concluded that most of them are probably splicing errors and not true examples of alternative splicing.

I think that's significant, do you? Or do you think that all those splice variants of the TPI gene that were rejected are biologically significant and the annotators made a serious mistake in rejecting them?

Larry Moran said...

You don't seem to be familiar with the data on the conservation of splice variants. The Blencowe lab (among others) has looked at production of splice variants in many species and concluded that the patterns are not conserved. For example, they discovered that "approximately half of alternatively spliced exons among species separated by ~6 million years are different" (see ref below). They're talking about RNA-Seq data from nine different organs in humans and chimpanzees.

Even the strongest proponents of alternative splicing concede that the production of variants in most genes is not conserved to any great extent. On the other hand, the production of splice variants in the well-studied examples of alternative splicing IS conserved across species separated by tens of millions of years.

So we are left with an interesting observation. When function has been sufficiently demonstrated, we see conservation. When there's no evidence of function, we see no conservation.

This observation is significant to skeptics but not to true believers. The AS proponents are quick to come up with any number of excuses to explain away the lack of observation. You repeated one of them; namely that alternative splicing is probably conserved but we just don't have enough date to see it ("study bias").

Greg said...

Fine, let’s look at annotations and data on isoform conservation. I examined the TPI1 gene you used to demonstrate lack of conservation between alternative isoforms. I have no idea if you chose it deliberately or at random but this example doesn’t support your claims.

You wrote that there are 8 human isoforms and 3 mouse isoforms with a complete lack of conservation between them. Case closed? Not quite.

First of all, human and mouse TPI1 genes are encoded on opposite strands. Your figure doesn’t take that into account, making the isoform structures look more different than they are in reality.

If you look carefully at transcript annotations, three of the human isoforms have “3’ CDS incomplete” and among the remaining five, two isoform pairs have identical amino-acid sequences: there are actually three unique, complete protein products for the human gene: 286aa, 167aa and 249aa long. In mouse, there are two unique isoforms: 299aa and 167 aa.

I aligned them and the human ‘long’ (286aa) isoform aligns very well against the mouse ‘long’ (299aa) isoform. The same is true of the short (167aa) isoforms. In other words, these two isoform pairs are in fact conserved.

Now, what about the remaining, apparently non-conserved human isoform? Here is the curious thing, it turns out that this is the canonical isoform: according to annotations (MANE and APPRIS) and indeed the one for which a human structure has been solved recently (PDB: 6nlh). Hm.

Joe Felsenstein said...

BTW, the neutralist-selectionist debate is over. The neutralists won. About 90% of our genome is junk and it's evolving at the neutral rate.

Really? The original argument was about genetic variation in coding sequences, and about substitutions in coding sequences. The amount of junk DNA outside of coding sequences would seem to be irrelevant to that.

John Harshman said...

To be precise, the original argument was about starch gel isozymes, was it not?

João said...

Larry, can you help me with this?

You said "About 90% of our genome is junk and it's evolving at the neutral rate."

I was looking for some papers anda I found this:

"In an ambitious undertaking, Pouyet et al. – who are based at the University of Bern, the Swiss Institute of Bioinformatics and the University of Zurich – discovered how much of the human genome can really be used for this style of demographic analysis. Their results showed that only 5% of the genome is truly evolving neutrally, with the remaining 95% being affected by some kind of natural selection"

Harris, 2018 (

Does this affect you statement that 90% of our genome is junk and is evolving at a neural rate?

I will really appreciate if you can answer me. Btw, I'm just trying to understand, not implying that I think you are wrong.

João said...

A lot of typos, sorry.

Joe Felsenstein said...

The neutral theory incorporates not only neutral substitution but purifying selection as well, so Harris et al, need to show that much of their 95% which is selected is not just purifying selection.

João said...

Thank you, Joe. But what I am struggling to understand is why non-funcional elements of our genome are under purifying selection? Shouldn't they be "invisible" to selection?

Does it has something to do with the transition from junk to garbage DNA that Dan Grau talks about?

João said...

I should say what the paper Harris refers to claims. Here is the abstract.

Disentangling the effect on genomic diversity of natural selection from that of demography is notoriously difficult, but necessary to properly reconstruct the history of species. Here, we use high-quality human genomic data to show that purifying selection at linked sites (i.e. background selection, BGS) and GC-biased gene conversion (gBGC) together affect as much as 95% of the variants of our genome. We find that the magnitude and relative importance of BGS and gBGC are largely determined by variation in recombination rate and base composition. Importantly, synonymous sites and non-transcribed regions are also affected, albeit to different degrees. Their use for demographic inference can lead to strong biases. However, by conditioning on genomic regions with recombination rates above 1.5 cM/Mb and mutation types (C↔G, A↔T), we identify a set of SNPs that is mostly unaffected by BGS or gBGC, and that avoids these biases in the reconstruction of human history.

Pouyet et al., 2018.

Joe Felsenstein said...

@John: Yes, but the restriction to mobility of bands on starch gels is hardly relevant now. The corresponding question would be nonsynonymous changes in coding sequences.

Larry Moran said...

@Joe and @Greg

I see your point. There appear to be many scientists who restrict the "neutralist-selectionist" debate to a discussion over the affect of mutations in amino acid codons. I guess the selectionists concede that the vast majority of alleles outside of coding regions are (nearly-)neutral but they still want to argue that most changes within coding regions will affect fitness.

Larry Moran said...


Joe is in a better position to answer your questions but the key sentence in the "digest" is, "This suggests that while most of our genetic material is formed of non-functional sequences, the vast majority of it evolves indirectly under some type of selection."

If I understand the paper correctly, what they are showing is that lots of junk DNA is linked to sites that are under selection so they are sometimes dragged along with those sites. They also looked at biased gene conversion following recombination. This favors substitution of G or C at mismatches like A-C or G-T. GC-biased gene conversion can lead to weak selection for GC base pairs in junk DNA.

João said...

Thank you, Larry. Think I got it. Do you thinks it is possible that background selection lead to some sequence conservation of DNA?

João said...

*junk DNA

Joe Felsenstein said...

I think that "purifying selection on linked sites" does not, in the long run, cause sequence to be conserved. Thus even if it is widespread in the genome, it cannot explain conservation of a sequence which does not itself have function being conserved, but is just near a functional sequence.

Joe Felsenstein said...

@Larry: They might also add control regions outside of the exxons.

Larry Moran said...


I'm not sure I understand your comment about control regions. In your opinion is the neutralist-selectionist debate mostly about whether changes in codons are neutral or not or is it about what parts of the entire genome are under under selection or not? If it's the latter then ALL functional regions are important not just coding DNA and regulatory sequences. This includes coding regions (~1%), and regulatory sequences (<0.2%) but also origins of replication (~0.3%), centromeres (~1%), SARs (~0.3%), telomeres (~0.1%), noncoding genes (~0.6%), and functional regions of introns. In addition, there seems to be about 4% more of the genome that's conserved but where the function isn't clear.

If all of this is included in the neutralist-selectionist debate then it's clearly more than just a debate about codons, right?

Do you agree with Greg when he said the following?

As for your second comment, I'm beginning to regret mentioning the neutralist-selectionist debate as I can see that it caused some confusion: it's not about what fraction of the genome is evolving neutrally -- if you spend any amount of time reading about it, you will see people argue mostly about what happens in protein-coding regions.

Mikkel Rumraket Rasmussen said...

But even nonfunctional DNA will still be under some level off purifying selection not to cause problems for other cellular processes. So if mutations in nonfunctional DNA causes some disease state that negatively affects reproductive fitness, it can be selected against.

And too high levels of expression of nonfunctional sites, even if it results in relatively benign or inactive RNA or proteins, still carries some metabolic cost. If this metabolic cost gets high enough it can affect reproductive fitness. So ultimately, mutations that cause upregulation of nonfunctional DNA to those levels of expression will be selected against too.

Those are two examples of purifying selection that you'd expect to operate on nonfunctional junk-DNA.

Michael Tress said...


Just to clear up the information on TPI1. Greg is right that there are only 3 sequence distinct isoforms annotated at the moment in Ensembl/GENCODE. The main isoform is clearly the one with 248 amino acids (it is conserved in yeast). The other two transcripts differ in their ATG. The upstream ATG may be functional. We find a peptide and it is conserved in mammals. Note that it is NOT the same as the 299 amino acid transcript in mouse, mouse is not (yet) annotated with this isoform. The 167 amino acid isoform is annotated in mouse, as Greg said, but curiously it doesn't seem to be annotated anywhere else (yet). It is unlikely that this isoform is functional, apart from the lack of conservation, it would break the structure and remove functional residues.

Larry Moran said...

@Michael Tress

There seems to be some confusion (probably my fault) about the meaning of "annotation." I showed the splice variants that Ensemble puts on their entry for TPI1.


The data has ten different splice variants and eight of them show a protein-coding region. That's why I said there were eight variants with "predicted" protein variants. I'm aware of the fact that GenBank and other versions of Ensemble only show three actual isoforms of the protein but this only reinforces my point that annotators have rejected a lot of the splice variants. I was actually thinking of the many more splice variants reported in other databases that don't even make it to Ensemble.

I agree that the 248 (249 if you count the methionine) aa is the correct version. Do I understand you correctly that you have found a peptide corresponding to the upstream 37 aa's predicted for isoform 2 and that you also find it in other mammals? I have searched the literature for any report of a TPI enzyme with an extra 37 amino acids at the N-terminus and failed to find anything. The amino acid sequence of this region doesn't look very "normal" to me. Is the peptide abundant? Is it found in a variety of tissues? Do you know of anyone who has actually detected an larger version of TPI?

Michael Tress said...

You are quite right, the main isoform is 249 residues, not 248. My bad. The 286 residue version (let's keep calling it "isoform 2" even though UniProt have it as the main isoform, P60174) is a curious case. There are several peptides to support the 37 extra residues. PeptideAtlas is full of them and some of them have good peptide-spectrum matches (PSM). So I am convinced that it is translated (at least occasionally). There is much less evidence for this region, though. While there are peptides with more than 50,000 detected PSM for the main protein, the peptide with most observations corresponding to isoform 2 has only 131 PSM. Also I just checked and there is no peptide evidence for the equivalent of isoform 2 in mouse. But that may only be because they are fewer mouse proteomics experiments. Where is it detected? Testis, ovary and CD8 cells appear, but mostly testis.

The upstream 37 residues aren't human specific. They are annotated throughout eutheria. You can see them here in this alignment ( - note, this alignment will only be available for a week). However, the first 37 residues are clearly less conserved than the main portion of TPI1.

At the transcript level the evidence for the upstream exon extension that generates isoform 2 is marginal. There is also no evidence of tissue-specific splicing. This isn't surprising since we will shortly show that tissue specificity in AS is highly correlated with cross-species conservation, and isoform 2 can't be detected beyond mammals.

It hard to imagine a function for these extra 37 residues. It is highly unlikely to fold along with the TPI1 structure for example. However, it may be noisy translation. We find incontrovertible peptide evidence for the incorporation of more than 30 Alu exons (transposable elements introduced in the primate lineage, obviously) into principal and alternative isoforms. However, we didn't find any evidence at all to suggest that these Alu exons were functional. This might be similar. It is possible that the presence of a start codon so close to the start codon of the principal isoform allows it to be translated sufficiently to be detected at the protein level occasionally.

Federico Abascal said...

I would add that we've been using proteomic evidence as a much better proxy of function than transcription evidence. But I am sure the protein level tolerates some degree of noise too, especially in unstructured regions. I mean, some isoforms seen at the protein level may not be functional - although proteomics and evolutionary conservation mostly agree.

Joe Felsenstein said...

Let me clarify matters after some statements of mine that might be unclear. Kimura and co.'s Neutral Theory allowed for both neutral sites and those experiencing purifying selection. The bottom line was that most of the polymorphism was neutral, and most of the nuclotide substitution was neutral. I once asked Kimura whether, if 50% of the polymorphism was proven neutral, he would feel vindicated. He said yes. Interestingly enough, I asked the same question of Bryan Clarke, a leading panselectionist. He also said yes. (Further in my next comment)

Joe Felsenstein said...

Kimura was concentrating on coding sequences and sequences of known function. He did not concede that it was enough for the noncoding DNA to mostly be Junk DNA -- he argued that the substitution and polymorpism in the coding sequences was mostly neutral too.

Joe Felsenstein said...

I have, since this thread started, contacted Brian Charlesworth, who has worked in this area a lot, and my colleague Kelley Harris, who wrote the Harris commentary referred to in the comments. Both agreed that in saying that background selection made changes in much of the Junk DNA "nonneutral", they were referring to a change of substitution rates. But they agreed that substitutions would then look neutral in that they would not favor some base substitutions over others, and the sequences at those sites would not be conserved.

Joe Felsenstein said...

One last correction: with background selection there can also be no change in substitution rates, but there would be change in the amount of polymorphism.

Larry Moran said...

If you are going to make the case that widespread alternative splicing generates protein diversity then it seems pretty obvious that detecting the presence of the alternative proteins is important. Ben isn't very worried about the lack of evidence for these proteins because he thinks you have just failed to detect them for some reason. I would like to point out that we're talking about tens of thousands of predicted proteins that seem to be missing. What do you think of Ben's argument?

(He's also not very worried about the fact that the presumed alternatively spliced transcripts are also missing in the sense that very few of them are present in concentrations sufficient enough to be functional. )

Federico Abascal said...

There are always excuses, like if "everything is functional" was the null hypothesis. When alternative splicing was found to not be conserved between species (e.g. between mouse and human), rather than thinking it may not be functional, proponents of masive AS interpreted that AS was key in species diversification/innovation... it made us human. Using population genetic variation in human we showed that wasn't the case, most alternative exons are evolving neutrally, they are not human-specific innovations. This doesn't seem to matter. The same applies to all other strands of evidence; there are always excuses to ignore them. Like with proteomics. There are of course limitations in detection sensitivity. Proteomics is much more limited than transcriptomics. We (Michael) have done a lot of work to show that detection sensitivity limitations do not explain the paucity of alternative protein isoforms. But that doesn't seem to matter either.

Federico Abascal said...

Just wanted to add something: that we try to fight this battle does not mean we don't think alternative splicing is a wonderful, amazing, real phenomenon. We have contributed a few papers on real AS cases: homologous exons are very interesting and highly conserved, certain transposable elements have been co-opted through AS, etc.

Michael Tress said...

To add to what Fede wrote … there’s a lot I could write on the issue of MS proteomics evidence and identification of alternative splice isoforms (or the non-coding feature du jour where researchers are just as naive about its use). I will try to be brief.

We find very few peptides for alternative isoforms. Ben’s explanation (and that of many others) for this is that there must be technical limitations on the coverage of MS. Without getting into details, it is true that there are technical limitations, but even after taking these into account the number of AS variants in standard MS experiments is orders of magnitude below what would be expected. There may be many biological reasons for this, of course. The fact that we don’t find many alternative peptides in MS experiments does not mean that the transcripts are not translated in some form/quantity.

At the same time many research groups have found considerable peptide evidence for alternative isoforms (and non-coding regions). Unfortunately, it is possible to find evidence for any coding feature if you aren’t sufficiently careful. A recent paper, for example, detected peptide evidence for more than 1,000 mouse alternative splice isoforms. However, the same data showed that they also “identified” peptides that mapped to 597 olfactory receptors (without investigating nasal tissues), which does rather suggest that they might have done something wrong. Sadly getting MS proteomics evidence massively wrong often pays. The less care one takes with MS proteomics experiments and the more exciting the claims, the easier it is to publish in important journals (cough /cough).