More Recent Comments

Thursday, October 31, 2024

Philip Ball's view of alternative splicing

Genomics is a powerful tool that allows you to collect massive amounts of data that can point the way to new understanding. But it can also be abused when the results are overinterpreted. We saw an extraordinary example of this in 2012 when ENCODE made unsubstantiated claims that were quickly challenged.

I'm reminded of the caution from Sydney Brenner who warned us about genomics (Brenner, 2000) and the warning in Dan Graur's harsh critique of the 2012 ENCODE claims (Graur et al., 2013) where they said ...

The Editor-in-Chief of Science, [Bruce Alberts,] has recently expressed concern about the future of "small science," given that ENCODE-style Big Science grabs the headlines that decision makers so dearly love. Actually the main function of Big Science is to generate massive amounts of easily accessible data. The road from data to wisdom is quite long and convoluted. Insight, understanding, and scientific progress are generally achieved by "small science." ...

High-throughput genomics and the centralization of science funding have enabled Big Science to generate "high-impact false positives" by the truckload. Those involved in Big Science will do well to remember the depressingly true popular maxim: "If it is too good to be true, it is too good to be true."

We conclude that the ENCODE Consortium has, so far, failed to provide a compelling reason to abandon the prevailing understanding among evolutionary biologists according to which most of the human genome is devoid of function. The ENCODE results were predicted by one of its lead authors to necessitate the rewriting of textbooks. We agree, many textbooks dealing with marketing, mass media hype, and public relations may well have to be rewritten.

Philip Ball is one of many science writers who don't distinguish between the existence of a feature and whether it is meaningful. For example, he takes ENCODE workers at their word when they claim that the human genome contains far more non-coding genes than protein-coding genes even though it is based on the mere existence of a transcript without any evidence that is is functional and not just junk RNA [Philip Ball says RNA may rule our genome]. Now, it may eventually turn out to be true that our genome is chock full of regulatory RNA genes but for now that speculation lacks evidence and runs contrary to the evidence of a sloppy genome that's 90% junk. It is not fair to mislead the general public by not fully presenting both sides of the controversy.1

All researchers need to realize that the best scientific practice is produced when, like Darwin, they persistently search for flaws in their arguments.

Bruce Alberts et al. (2015)
"Self-correction in science at work"
Science 348: 1420
Graur is not alone in pointing out the difference between collecting data and the established methods of forming and testing a hypothesis. He is not the only one who reminds us that extraordinary claims require extraordinary evidence—and let's be clear that saying there more non-coding genes than coding genes IS an extraordinary claim. All the critics of ENCODE have made it clear that it is scientific malpractice to make outrageous claims without mentioning all the evidence against your claim.

Large-scale genomics experiments have looked at all the RNAs in various tissues and mapped them to the genome. Many of them appear to come from protein-coding genes and that's not a surprise since these genes cover about 40% of the genome. Most of the transcripts are derived from introns and the surprise was that many of them seem to be slice variants that do not correspond to the canonical mRNA sequence that was previously characterized or predicted. The researchers who discovered these variants jumped to the conclusion that they must be due to alternative splicing—a conclusion that was based on the mere existence of a splice variant and not on any evidence that they were biologically significant.

One of the groups was based in a sister department to mine at the University of Toronto. They published a paper in 2008 claiming that 95% of multiexon genes undergo alternative splicing (Pan et al., 2008). This paper is widely quoted in the popular press and in the scientific literature. That's an extraordinary claim that flies in the face of common sense.

Let's see how Philip Ball handles alternative splicing. I'm quoting from his recent book How Life Works: A User's Guide to the New Biology (pp. 170-171). Judge for yourselves whether you think this science writer is doing a good job of presenting both sides of a major controversy.

Around 90% of our genes give rise to more than one mRNA by alternative splicing. It is particularly common in the brain, for reasons not fully understood. Proteins called neurexins, which control the adhesion between cells and are an essential component of the formation of synapses (the junctions of neurons), are alternatively spliced into vast numbers of different forms. A gene called Dscam1 encodes a protein that enables neurons to recognize each other so that they can avoid fusing amid the tangle of long filamentary axons ... It's thought that various "isoforms" of the Dscam1 protein are produced at random by alternative splicing, and that they act as arbitrary cell-surface labels that distinguish one axon from another. In the fruit fly, almost twenty thousand different alternatively splice variants of the Descam1 protein have been observed—all from a single gene. (It's not clear how many of them actually have a biological function, though.)

It is in this way that, from around twenty thousand genes, our cells can make between eighty thousand and four hundred thousand different proteins. That the number is still so uncertain testifies to how much we still don't understand about the human proteome. Alternative splicing and polyadenylation (p. 121) of mRNA shows that there is at least as much regulation going on after transcription of a gene has begun—that is, en route from RNA to protein—as there is before transcription happens, when it may be turned up or down with the involvement of regulatory sites on DNA itself.

Alternatively spliced proteins are essential components of our molecular toolkit, being mainly involved in the processes most central to the operation of complex organisms: signalling, cell communication. and regulation of development. Thanks to regulatory mechanisms, splicing is tissue-specific. The different cell types don't just have a different repertoire of genes turned on and off, but a different array of proteins made from them. It's no surprise then, to find that alternative splicing is common in multicellular eukaryotes with many tissue types, but much less so in single-celled eukaryotes (let alone prokaryotes).

Philip Ball is reporting on a widespread belief in abundant alternative splicing. This idea permeates the scientific literature making it difficult to find skeptical viewpoints. The majority of molecular biologists are firmly convinced that almost all human protein-coding genes produce several different functional isoforms as a result of alternative splicing.

The truth is more complicated. True alternative splicing has been well-documented and the best examples have been reported in the textbooks since the 1980s. Nobody questions those examples because the protein isoforms have been detected and their functions have been demonstrated. However, most of the speculation about widespread alternative splicing is based solely on the detection of splice variants in large-scale genomics experiments. It's possible that these splice variants are just splicing errors that are readily detected in sensitive transcriptome studies.

It is wrong to just blindly assume that all those splice variants are biologically relevant and declare that 90% of our genes give rise to to multiple protein isoforms. It's the same problem that we see with transcripts and transcription factor binding sites. The difference is that with transcripts we've come to recognize that many transcripts are junk RNA so nobody believes the original ENCODE claims that 80% of our genome is functional. Similarly, nobody believes the original ENCODE claim that every TF binding site represents a genuine regulatory seqeuence. We now know that large genomes full of junk DNA will give rise to millions of spurious transcription factor binding sites and that's why even the ENCODE researchers now refer to them as "candidate" cis-regulatory elements (cCREs).

This skepticism about transcripts and TF binding sites has not permeated the alternative splicing literature in spite of the fact that the same rules apply. The null hypothesis has to be splicing error when you detect a new splice variant. You cannot claim that this is evidence of alternative splicing without doing the hard work required to prove the case.

I write about this in my book in Chapter 6: How Many Genes? How Many Proteins? (pp. 154-169) and I've written dozens of blog posts on alternative splicing (see the list below).

The skeptical view of alternative splicing goes like this.

Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can—if you know anything at all wrong, or possibly wrong—to explain it. If you make a theory, for example, and advertise it, or put it out, then you must also put down all the facts that disagree with it as well as those that agree with it.

Richard Fyenman (1985)
"Cargo Cult Science"
in Surely You're Joking, Mr. Feynman"

Respect the null hypothesis. Don't assume that every spice variant is biologically relevant. You need evidence to support such a positive claim. In the absence of such evidence the default assumption is splicing error. (Don't use the term "alternative splicing" unless it refers to a biologically relevant process.)

Understand splicing error rates. There are lots of papers on the error rate of splicing. It can be as high as 0.1% meaning that incorrectly spliced transcripts will be present in all cells. Some of these errors are due to inappropriate binding of tissue-specific splice factors so many splicing mistakes will be confined to certain cell types.

Understand the importance of concentration. Most splice variants are present at less than one copy per cell. That's a good indication that they are splicing errors and not true alternative splicing.

Can the predicted protein isoforms be detected? The answer is "no" in the vast majority of cases. Most protein-coding genes produce a single protein that's similar to the homologous protein in all other species.

Understand evolution and the importance of conservation. Are the observed slice variants present in other closely-related species? Is there evidence that they are preserved by natural selection? If the answer is "no" then splicing errors is a better explanation than biological function.

Use common sense. Does it make sense that alternative splicing would produce multiple functional protein isoforms for each of the 10,000 or so housekeeping genes? Why would humans need to have multiple versions of RNA polymerase subunits or the subunits of ATP synthase? Why would we need multiple versions of every enzyme involved in amino acid metabolism or lipid biosynthesis?

"Ball is one of the most meticulous, precise science writers out there. He is the antithesis of hypey, "dumb-it-down" reporting. He is MUCH more credible than you are, Laurence."

John Horgan July, 2024

Understand junk DNA. 90% of our genome is junk. Much of that resides in introns whose size correlates well with the size of the genome. The more junk DNA you have the bigger the introns and the bigger the introns the greater the chance of splicing error.

Be wary of arguments from medical relevance. Some people argue that alternative slicing must be important because there are many genetic diseases that are caused by mistakes in splicing due to loss-of-function mutations. But many of the ones that have been studied carefully show that the genetic defect is due to an intron mutation that creates a spurious splice site. These are gain-of-function errors in junk DNA and they support the idea that splicing errors are significant.

Be concerned about bias, especially your own. Many scientists are upset about the fact that humans have "only" 20,000 protein-coding genes. They were surprised when the sequence of the human genome was published because they had not kept up with the literature on the number of genes. (See, Revisiting the deflated ego problem.) If you are one of those people, you need to be careful about accepting unsubstantiated just-so stories supporting your view that humans must be a lot more complicated at the molecular level than nematodes and fruit flies. Alternative splicing is one of those stories; it seems to justify the "surprising" fact that we only have 20,000 genes by postulating that each one produces multiple proteins. In order to make sense, the argument requires that more simple species must have a lot less alternative splicing but, unfortunately for the logic, it turns out that when you look as closely for transcripts in other species (e.g. nematodes, plants) they have lots and lots of low-level splice variants just like we do. (See ad hoc rescue.)

Watch out for cherry-picking. Cherry-picking is a form of fallacy where touting the existence of individual cases is used as justification for an unwarrented extrapolation. It's commonly seen in the scientific literature where, for example, the discovery of a small bit of new functional DNA sequence is used as evidence that all junk DNA must be functional. Or, the existence of many confirmed regulatory RNAs must mean that all transcripts must be regulatory RNAs. Or, the fact that some transposons have secondarily acquired a function must mean that all transposon sequences must have some (unknown) function. Same with pseudogenes. There are genuine examples of alternative splicing. Highlighting them is evidence that the phenomenon exists but it is not evidence that all splice variants are due to alternative splicing.

Blog posts on alternative splicing


1. Ball has responded to my criticism by claiming that he has, in fact, presented views that conflict with his main message. It's true that you can comb through his book and his latest essays and find references to contrary points of view but these are never presented in a coherent manner that challenges his main message.

Brenner, S. (2000) Biochemistry strikes back. Trends in biochemical sciences 25:584. [doi: 10.1016/S0968-0004(00)01722-9]

Graur, D., Zheng, Y., Price, N., Azevedo, R.B., Zufall, R.A. and Elhaik, E. (2013) On the immortality of television sets:“function” in the human genome according to the evolution-free gospel of ENCODE. Genome Biology and Evolution 5:578-590. [doi: 10.1093/gbe/evt028]

Pan, Q., Shai, O., Lee, L.J., Frey, B.J. and Blencowe, B.J. (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature genetics 40:1413-1415. [doi: 10.1038/ng.259]