More Recent Comments

Wednesday, December 05, 2018

The textbook view of alternative splicing

As most of you know, I'm interested in the problem of alternative splicing. I believe that the number of splice variants that have been detected is perfectly consistent with the known rate of splicing errors and that there's no significant evidence to support the claim that alternative splicing leading to the production of biologically relevant protein variants is widespread. In fact, there's plenty of evidence for the opposite view; namely, splicing errors (lack of conservation, low abundance, improbable protein predictions, inability to detect the predicted proteins).

My preferred explanation is definitely the minority view. What puzzles me is not the fact that the majority is wrong () but the fact that they completely ignore any other explanation of the data and consider the case for abundant alternative splicing to be settled.

This bizarre situation is reflected in the textbooks, which means we are training an entire generation of undergraduates to ignore critical thinking and simply adopt some strange ideas that have permeated the scientific literature. I recently purchased a textbook that illustrates my point. It's Genomes 4 (2017), the fourth edition of a book by T.A. Brown that was first published in 1999. The relevant section is in chapter 7 in a section titled "Gene numbers can be misleading."

The section begins by saying that scientists expected humans to have a lot of genes because humans are "the most sophisticated species on the planet." It points out that while we have more genes than yeast, fruit flies, and chickens we have fewer genes than most plants and only about the same number as the nematode Caenorhaditis elegans. The author then offers an explanation,
These gene numbers lead us into an important aspect of genome biology. Before the human genome was sequenced it was anticipated that there would be 80,000-100,000 protein-coding genes, this number remaining in vogue up to a few months before the draft sequence was completed in 2000. This yearly estimate was high because it was based on the supposition that, in most cases, a single gene specifies a single mRNA and single protein. According to this model, the number of genes in the human genome should be similar to the number of proteins in human cells, leading to the estimates of 80,000-100,000. The discovery that the actual number of protein-coding genes is much lower indicates that it is possible for an individual gene to specify more than one protein. This is the case for many of the discontinuous genes in the human genome. When introns were first discovered, it was thought that a discontinuous gene would have just one splicing pathway, in which all the exons are joined together to give a single mRNA. We now know that many discontinuous genes have alternative splicing pathways, which means that their pre-mRNAs can be processed in a variety of ways, to give a series of mRNAs made up of different combinations of exons. Each of these genes can therefore direct synthesis of related but different proteins. ... Alternative splicing is relatively common in vertebrates, with 75% of all human protein-coding genes, representing 95% of those with two or more intron's, undergoing alternative splicing, giving rise to an average of four different spliced mRNA's per gene. Alternative splicing also occurs in lower eukaryotes, but it is less prevalent. In C. elegans for example, only about 25% of the protein coding genes have alternative splicing pathways, with an average of 2.2 variants per gene

Because of alternative splicing, the question "How many genes are there?" has no real biological significance, as the number of genes does not indicate the number of proteins that can be synthesized and hence is not a measure of the coding capacity of a genome. A better measure of the biological complexity of an organism is provided by categorizing the genes, including the splice variants according to function.
This is the consensus view of textbook authors, science journalists, and probably most of the researchers in the field of gene expression although some of the details may differ. In this case, for example, the author claims that the high gene count was based on thinking that humans have 80,000-100,000 different proteins. That's not a rationale for the false expectation of gene number that I've seen before.1 I don't know where it comes from but I do know that current estimates of proteome complexity do not support such a claim. I direct your attention to a recent paper by Tress et al. (2017) who looked at the mass spec data and concluded,
Alternative splicing is well documented at the transcript level, and microarray and RNA-seq experiments routinely detect evidence for many thousands of splice variants. However, large-scale proteomics experiments identify few alternative isoforms. The gap between the numbers of alternative variants detected in large-scale transcriptomics experiments and proteomics analyses is real and is difficult to explain away as a purely technical phenomenon. While alternative splicing clearly does contribute to the cellular proteome, the proteomics evidence indicates that it is not as widespread a phenomenon as suggested by transcript data. In particular, the popular view that alternative splicing can somehow compensate for the perceived lack of complexity in the human proteome is manifestly wrong.
This is not something that was only discovered recently so the textbook author (T.A. Brown) cannot be excused on the grounds that the data came out after his book was in press.

Putting that aside, lets think about what we might be teaching undergraduates. The underlying assumption is that primitive eukaryotes produced only one or two different protein isoforms per gene but over time there was selection for more complexity by evolving alternative splicing to create more isoforms culminating in humans where an average of four different protein isoforms are made per gene. Think about what that means by considering typical housekeeping genes like those involved in glycolysis, the citric acid cycle, amino acid biosynthesis and a host of other metabolic pathways. Think about the genes for all the subunits of RNA polymerase, the subunits of the electron transport complexes in mitochondria, and the 80 or so proteins of the ribosome. The genes for all these proteins have multiple splice variants in the databases so it follows that there must have been selection for each of them to produce, on average, three protein variants in addition to the standard conserved protein seen in bacteria. It's very difficult to imagine why humans would need three additional versions of glyceraldehyde 3-phosphate dehydrogenase or citrate synthase where each new variant is missing a block of internal amino acid residues or has an extra stretch of amino acids inserted into the middle of the structure. It's even more difficult to imagine why such altered versions of the subunits of complex structures could provide a selective advantage. Nevertheless, this is exactly what we are teaching these days in our undergraduate courses.

I don't get it.

Let's assume that alternative splicing is restricted to just a small number of genes as I claim. How would scientists respond to that if they still think there's a problem reconciling the number of genes with the idea that humans are "the most sophisticated species on the planet"? Would it create such a serious problem for them that they are reluctant to consider the possibility? Is this an idea that undergraduates can't handle so we have to protect them from even thinking about it?

1. Here's the explanation given in the original publication of the draft human genome sequence by Lander, et al, (2001, p. 898).
Previous estimates of human gene number. Although direct enumeration of human genes is only now becoming possible with the advent of the draft genome sequence, there have been many attempts in the past quarter of a century to estimate the number of genes indirectly. Early estimates based on reassociation kinetics estimated that the mRNA complexity of typical vertebrate tissues to be 10,000-20,000, and were extrapolated to suggest around 40,000 for the entire genome. In the mid-1980s, Gilbert suggested that there might be about 100,000 genes, based on the approximate ratio of the size of a typical gene (~3 × 104bp) to the size of the genome (3 × 109bp). Although this was intended only as a back-of-the-envelope estimate, the pleasing roundness of the figure seems to have led to it being widely quoted and adopted in many textbooks (W. Gilbert, personal communication).

Lander, E., Linton, L., Birren, B., Nusbaum, C., Zody, M., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., Funke, R., Gage, D., Harris, K., Heaford, A., Howland, J., Kann, L., Lehoczky, J., LeVine, R., McEwan, P., McKernan, K., Meldrim, J., Mesirov, J., Miranda, C., Morris, W., Naylor, J., Raymond, C., Rosetti, M., Santos, R., Sheridan, A., Sougnez, C. et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409:860-921. [doi: 10.1038/35057062]

Tress, M.L., Abascal, F., and Valencia, A. (2017) Alternative splicing may not be the key to proteome complexity. Trends in biochemical sciences, 42:98-110. [doi: 10.1016/j.tibs.2016.08.008]


Mike S said...

Great post, thanks.

Joe Felsenstein said...

I hope that you get a chance to comment on this in your book, even though it is a bit off your main topic. Someone, somewhere has to make the points you are making, visibly enough to bring them to the attention of genomicists. I do wonder how they can defend the view that humans have all sorts of extra complexity owing to alternative splicing -- but other animals that have equal numbers of genes somehow don't have that extra complexity.

Arno Wouters said...

Thanks Larry for this post! I think you identified a very serious problem with the text books.

I myself recently stumbled upon a similar problem. As a philosopher of biology I need to keep my knowledge of biology up to date. One of my strategies is to look into textbooks for new information and if I find something that interests me I track the references.

I recently bought the 11th edition of Mason, Losos and Singer *Biology* (McGrawhill, 2017), an undergraduagte introduction to biology claiming to present "cutting edge science" (p. vi) "preparing students for the future" "developing critical thinking" with the help of Connect ((p. ix).

On arrival, I discovered to my astonishment that the book does not provide literature references. There is no list of references, there are no 'For further reading' sections and the scarce references to the literature are of the form 'A recent study showed ....' without any further indication of when, where and by whom that study was published!

Shocked, I glanced through the book. On p. 280 my eye feal on the central dogma of molecular biology, first describe by Francis Crick: Information passes in one direction from the gene (DNA) to an RNA copy of the gene. Uh??? Three days earlier I had explained to a colleague who claimed that the central dogma has long been superseded, why he was wrong! Yet this is exactly what the book says (okay, there is a difference, the colleague thought it was refuted by genomics, the book says the original formulation was modified due to the discovery of retroviruses).

I then moved to the section on evolution, which presents a 1950s picture of evolution. Mutation is "the ultimate source of genetic variation." However, "a typical gene mutates about once per 100,000 cell devisions. Because this rate is so low, other evolutionary processes are more important in determining how allele frequencies change" (p. 404). No word about the neutral theory. Genetic drift is the result of "a drastic reduction" of the size of a population.

However, on p. 465 it is said that "in some cases branching events can be timed." "One widely used but controversial method is the molecular clock, which states that the reate of evolution of a molecule is constant over time." This method "appears appears to hold true in some cases, in many others the data indicate that rtes of evolution have not been constant through time". "Recently, methods have been developed to date evolutionary events without assuming that molecular evolution has been clocklike. These methods hold grat promise for providing more reliable estimates of evolutionary timing". No explanation of why one would or would not expect molecular clocks, no reference to a review, no hint of what the alternative methods are, why they are more reliable or were I can read more about it.

(to be continued)

Arno Wouters said...


The explanation of evolutionary theory also suffers from a clear exposition of the relation between genotypic and phenotypic change. The essence of evolution? "Through time species accumulate differences; as a result, descendents differ from their ancestors. In this way new species arise from existing ones" (p. 400). There is no explanation of what 'species accumulating differences' would mean (an increase of the number of variants? an increase of the variance? a change of the mean of average of the population?) but this characterization seems to apply both to phenotypic changes (whether or not they have a genetic base) and genotypic changes. However, without any further explanation, the next section talks about "a change in the genetic composition," and the glossary defines 'evolution' as 'genetic change in a population of organisms.' However, it is not clear whether 'genetic variation' refers only to gene variants or to any variation in DNA. The remark that "DNA testing shows that natural populations generally have substantial variation" in the next section suggest the latter view, but the discussion of the Hardy-Weinberg ratio in terms of gene frequencies in the section thereafter suggest that evolution is seen as a change in gene frequencies and the remark about mutation lack of attention for the neutral theory suggest that the relevant notion of evolution is a gene-based change of the phenotypic characteristics of a population.

Needless to say that I no longer trust this textbook and heavily regret the EUR 75 I spent on it.

When I look at discussion about extended syntheses, I often wonder why both parties have such an outdated view of evolution. Now, I know: it is the view presented in general biology books.

However, what should people do when they need to keep their knowledge of biology up to date, but do not have the time (and perhaps the knowledge) to follow the journals of that discipline, if they can't trust the textbooks?

Arno Wouters said...

Now, let's see what Mason et al. (2017) say about 'alternative splicing'.

From Chapter 15 "Genes and How They Work":

"One estimate, based on analysis of human and mouse transcript is that 95% of mammalian genes produce alternative transcripts, with an average of 4 transcripts per gene. Other estimates have produced both higher and lower numbers. Another recent study found similar overall levels of alternativ splicing with as many as 25 different transcripts from individual genes. However, this same analysis indicated that most genes had a single predominant transcript. This leave us with an incomplete picture, but further work should clarify this" (p. 291)

It is not explained how the estimates are made and how they distinguish between alternative splicing and splicing errors. I guess that 'levels of alternative splicing' refers to estimation of the number of genes that produce alternative products, rather than to levels of alternative spliced products.

"The possible functions of the protein products of these splice variants have been investigated for only a small fraction of the potentially spliced genes. These data, however, are part of the explanation for how approximately 20,000 human genes can encode the more that 100,000 different proteins estimated to exist in human cells" (p. 291)

The discussion of aternative splicing is taken up again in chapter 16 "Control of Gene Expression."

On p. 320 two "well-characterized examples" are discussed in three paragraphs preceded by a one paragraph introduction in which it is repeated that the frequency of alternative splicing as discussed in the previous chapter emphasizes its important, but that the functional significance is not clea

It is not explained how the frequency of a process whose function is not clear would show that that process is important, but let's not cavil.

The first example: it is found that sex differentiation in Drosophila is the result of a complex series of alternative splicing events that differ in males and females. No hint of an explanation of how this is shown, what is spliced and how these splicing result in sex difference. And, of course, there is no literature reference.

The other example is slightly more detailed. The thyroid gland produces calcitonin, the hypothalamus CGRP. Calcitonin and CGRP are produced from the same transcript and this is illustrated with a picture. In the next pragraph it is said that it depends on the presence of tissue-specific factor which of the two is produced.

Larry Moran said...

It's covered extensively in Chapter 3 of my manuscript.

What's in Your Genome? Chapter 3: What Is a Gene?

Larry Moran said...

I know a little bit about writing textbooks from personal experience, from knowing several other textbook authors, and from serving on the editorial board of an education journal.

It's not an easy job. You never have time to delve into every subject to make sure you have the latest information so you rely on reviews and on what appears to be the consensus view as expressed in the introductions to scientific papers.

You hope that any serious errors will be picked up by reviewers but those reviewers tend to be educators who often don't know any more than you do.

This is why popular misconceptions are so prevalent in the major textbooks. It's why the ENCODE publicity stunt was much more damaging than most people realize. It's going to take a generation or more to remove the false information spread by ENCODE in collaboration with Nature and Science.

There's also a dirty little secret that's rarely mentioned in public. Publishers don't want their textbook to get too far out in front of the intended audience (= university lecturers). That's because lecturers have to choose between competing textbooks and the competition is fierce. If your book is the only one that differs from the popular view then most lecturers are going to assume you are wrong and reject your book. Even if they suspect you are correct they don't want next year's students to see that you've been teaching the subject incorrectly in all the previous year's classes. Furthermore, they don't want to upgrade their lecture notes and exam questions unless it's absolutely necessary.

anonymous said...
This comment has been removed by the author.
Joe Felsenstein said...

I'd forgotten that. You do indeed seem to be out there slaying numerous dragons. Ones that need the attention.

Tomi Aalto said...

Alternative splicing is one of the most sophisticated mechanisms in cells. The number of DNA strands used for protein encoding in human genome is only ~19,000 but the number of different protein isoforms in a human body is up to six million according to latest research. Based on one strand of DNA the cell is able to produce thousands of different proteins. Often these DNA strands are read from different locations of the genome, both to 3' and 5' direction and even from different chromosomes. So called DNA genes are overlapped, embedded and so optimally organized that it's obvious that they don't tolerate mutations. That's why 561,119 genetic mutations in the human genome are associated with more than 15,000 genetic diseases. (DisGeNet)

Alternative splicing is highly regulated by epigenetic factors and mechanisms. These are DNA methylation profiles (enhancers and promoters), transcription factors, non coding RNA molecules and especially histone epigenetic markers that function as a biological database.

Alternative splicing mechanism points out how passive role the DNA has in cellular processes. DNA doesn't determine organismal characteristics. Think about butterfly metamorphosis; the DNA is the same in those four different stages of development. Everything is driven by epigenetic mechanisms and factors.

Larry Moran said...

This is a perfect illustration of the words written by Alexander Pope over 300 years ago, "A little learning is a dangerous thing."

Unknown said...

I do see your point, but is there any other proposed mechanism (that obviously they fail to mention) the 20 000 or so genes to produce 5 times more proteins?

Larry Moran said...

There are only about 22,000 distinct proteins made in human cells. This includes an estimate of approximately two different proteins made from about 2,000 genes by alternative splicing.

Rosie Redfield said...

I have the same problem getting other researchers to think clearly about selection on genetic exchange processes and on so-called quorum sensing. Here's my favourite Tolstoy quote about this problem:

"I know that most men, including those at ease with problems of the greatest complexity, can seldom accept even the simplest and most obvious truth if it be such as would oblige them to admit the falsity of conclusions which they have delighted in explaining to colleagues, which they have proudly taught to others, and which they have woven, thread by thread, into the fabric of their lives."

Larry Moran said...

This is so true. A few months ago I had a discussion with a colleague about the validity of abundant alternative splicing. This person had been teaching the standard dogma to undergraduates for many years and it was clear that my colleague was reluctant to admit that splice variants could be due to splicing errors. I had the feeling that admitting to flawed teaching was a major impediment to changing their minds their minds about alternative splicing.

John Harshman said...

Just curious. Has anyone ever assayed, say, C. elegans looking for alternative transcripts?

Jmac said...

Would that change you mind if someone has?

John Harshman said...

More importantly, would that change your mind?