Last week I bumped into a colleague who teaches in our third year molecular biology course. I was lamenting about the sad state of science these days and we got to talking about alternative splicing. I repeated my complaint that much of the predicted alternative splice variants are artifacts. It makes no sense that conserved genes would be producing alternative protein variants that are species specific. I am convinced that the EST databases are full of artifacts and that most predicted splice variants do not exist.
My colleague was shocked. He is firmly convinced that most human genes express a number of different protein products that are produced as the result of alternatively spliced mRNA precursors. I asked him if he had ever looked at his favorite genes to see if the predicted variants make any sense. The ones that I've looked at certainly don't. (Join in the fun: see the challenge below.)
My colleague is very knowledgeable about the genes for the major subunits of eukaryotic RNA polymerase since it was his lab that cloned the first one. I suggested that he look at the predicted alternative splice variants of the two human genes and let me know if he is still convinced that these variants make biological sense. I'm not sure he will do it so let's take a look ourselves.
Eukaryotic RNA polymerase is a complex protein machine consisting of ten different subunits. Two of the subunits, Rpb1 and Rbp2, are more commonly known as A and B. In the human genome they are encoded by the genes POLR2A and POLR2B respectively [RNA Polymerase Genes in the Human Genome].
If you click on the Entrez Gene URLs you will end up at a page that summarizes what is known about the gene. Down the right-hand side of the page there are links to several other webpages, including a link to AceView, a database of alternative splice variants. Before following this link to the POLR1A variants, let's note that on the annotated Entrez Gene website there are no alternative splice variants listed. Apparently someone has decided that the predicted variants are probably artifacts.
Go to the AceView page for AceView POLR2A. The first thing you see is a short explanation.
RefSeq annotates one representative transcript (NM included in AceView variant.a), but Homo sapiens cDNA sequences in GenBank, filtered against clone rearrangements, coaligned on the genome and clustered in a minimal non-redundant way by the manually supervised AceView program, support at least 11 spliced variants.Here's the figure showing the various predicted alternatively spliced transcripts and the various different proteins.
Note that this locus is complex: it appears to produce several proteins with no sequence overlap.
Expression: According to AceView, this gene is expressed at very high level, 4.8 times the average gene in this release. The sequence of this gene is defined by 537 GenBank accessions from 518 cDNA clones, some from breast (seen 40 times), marrow (29), head neck (19), brain (18), eye (18), leukopheresis (18), lung tumor (18) and 132 other tissues. We annotate structural defects or features in 13 cDNA clones.
Alternative mRNA variants and regulation: The gene contains 29 different introns (28 gt-ag, 1 gc-ag). Transcription produces 13 different mRNAs, 11 alternatively spliced variants and 2 unspliced forms. There are 7 probable alternative promotors and 5 non overlapping alternative last exons (see the diagram). The mRNAs appear to differ by truncation of the 5' end, truncation of the 3' end, overlapping exons with different boundaries, alternative splicing or retention of 4 introns. 337 bp of this gene are antisense to spliced gene pluvu, raising the possibility of regulated alternate expression.
Protein coding potential: 10 spliced and the unspliced mRNAs putatively encode good proteins, altogether 11 different isoforms (3 complete, 4 COOH complete, 4 partial), some containing domains RNA polymerase Rpb1, domain 1, RNA polymerase, alpha subunit, RNA polymerase Rpb1, domain 3, RNA polymerase Rpb1, domain 4, RNA polymerase Rpb1, domain 5, RNA polymerase Rpb1, domain 6, RNA polymerase Rpb1, domain 7, Eukaryotic RNA polymerase II heptapeptide repeat [Pfam]. The remaining 2 mRNA variants (1 spliced, 1 unspliced) appear not to encode good proteins.
It's really difficult to imagine that any of these are biologically relevant. How could a small bit of the large RNA polymerase subunit ever be part of the RNA polymerase protein complex? It's not a surprise that the Entrez Gene annotators have ignored these predictions.
If, as I believe, most of the small ESTs on which these predictions are based are artifacts, then the overall pattern makes sense. What you see are examples of splicing errors where an intron has not been correctly removed. These extremely rare splicing errors are copied into cDNA during construction of EST libraries and specifically selected by screening out all the correctly spliced mRNAs. (That's how you make most EST libraries.)
Here's what AceView says about the gene for the other large subbunit [AceView: POLR2B].
RefSeq annotates one representative transcript (NM included in AceView variant.a), but Homo sapiens cDNA sequences in GenBank, filtered against clone rearrangements, coaligned on the genome and clustered in a minimal non-redundant way by the manually supervised AceView program, support at least 9 spliced variants.One again, AceView notes that the annotated human genome has ignored the predicted alternative plice variants but maintains that there are at least nine of them.
Here's the figure, decide for yourself whether this is credible.
There are several well-known examples of human genes producing different protein variants due to alternative splicing. The ones I can think of off the top of my head are the genes for class I antigens, α-tropomyosin, and calcitonin. I'm sure there are half a dozen others.
Here's the challenge. See if you can find a human gene for a well-studied protein where the structure of the protein is known and there are multiple protein variants derived by alternative splicing. I bet that readers of Sandwalk can't find very many where the predicted variants many any sense and are likely to be biologically significant.
What does this mean? Whenever you look at your favorite well-studied gene you see that the predictions of alternative splicing are silly. So why should we believe the genome wide analyses? Is it just a coincidence that the more we learn about a given gene the most we become willing to reject the ESTs as artifacts? Or is it possible that alternative splicing is mostly confined to those genes that have not been well studied?