Triose phosphate isomerase (TIM) is one of the enzymes in the gluconeogenesis pathway leading to the synthesis of glucose from simple precursors. It also plays a role in the degradation of glucose (glycolysis). The enzyme catalyzes the following reaction ....
To the best of my knowledge, no significant variants of this enzyme due to alternative promoters, alternative splicing, or proteolytic cleavage are known.1 The enzyme has been actively studied in biochemistry laboratories for at least eighty years.
The official name of the human TPI gene is TPI1. The official database entry for this gene is maintained in a database at the American National Center for Biotechnology Information (NCBI) [Gene = 7167]. That record lists three variants produced by alternative splicing and differential use of promoters. Each of these three variants are supported by RefSeq entries. (The RefSeq entries begin with "NP" for amino acid sequences and "NM" for nucleotide sequences.)
The human version of this gene has seven exons and six introns.
The middle version of this group (isoform 2) encodes a protein of 249 amino acid residues beginning with MAPSRKFFV .... The predicted molecular weight is 26,669 daltons. The size and sequence of this version of the protein corresponds to all of the entries in the structural databases and all the data in the biochemical literature. There's no doubt that this version is a biologically relevant enzyme found in humans. This version also corresponds the the homologues in all other species.
The top version of this gene (isoform 1) corresponds to a transcript beginning from an upstream promoter. Translation of the mature mRNA is predicted to begin at a presumed start codon upstream of the normal one. This is predicted to give a protein with an extra 37 amino acid residues at the N-terminal end of the gene. This is the "canonical" sequence shown in the UniProt database [UniProt = P60174]. All other sequences are listed as variants relative to that sequence. Thus, the correct version is a "variant" missing residues 1-37.
The bottom version shown in the UniGene figure is an alternatively spliced isoform (isoform 4) missing residues 1-119. It makes no sense to suggest that this isoform has any biological significance. Isoform 3 was a version truncated at the C-terminus of the protein. It has been eliminated by curation. (Twelve other incorrect versions were eliminated many years ago.)
What about isoform 1, the version with the extra 37 amino acid residues? Is there any evidence that this protein actually exists in a biologically relevant form? The short answer is, "no." Nevertheless, this is the version you get in all the downloadable human genome sequence databases.
If you go to the RefSeq entry for the predicted mRNA sequence of isoform 1 [NM_001159287] this is what you see ....
CCDS Note: This CCDS represents a TPI variant that uses an upstream promoter compared to the CCDS8566.1 representation. Data in PMIDs 4022011, 2925688, 2243103 and 10575546 support the presence of the internal promoter used by the CCDS8566.1 variant. The use of the upstream promoter is supported by human transcript data, including M10036.1, AL517115.3 and DB444195.1, as well as homologous transcripts. This longer transcript also uses an upstream start codon, resulting in an isoform that is 37 aa longer at the N-terminus compared to the CCDS8566.1 isoform. The sequence encoding the longer N-terminus is conserved in most mammalian species.What this means is that curators have included isoform 1 in the databases because RNAs from an upstream promoter have been sequenced. The longer version of the protein is predicted on the grounds that the first start codon (methionine codon) in a mature mRNA will be utilized. As far as I can tell, the only evidence for the biological relevance of this prediction is the fact that the sequence of the 37 predicted N-terminal amino acids is similar in most (but not all?) mammals.
If there was evidence that this protein is biologically relevant it would be mentioned here. Contrast this with the entry for isoform 2—the common version that we know is relevant [NM_000365.5].
Transcript Variant: This variant encodes the predominant isoform.So, we're left with a weird situation where the "canonical" protein sequence of the human triose phosphate isomerase gene is not the common sequence and may not even be expressed! Anyone looking for the size and sequence of the human triose phosphate isomerase enzyme will be misled into thinking that the longer, larger, version is the normal protein.
CCDS Note: This CCDS represents a TPI variant that uses an internal promoter compared to the CCDS53740.1 representation. Data in PMIDs 4022011, 2925688, 2243103 and 10575546 support the presence of the internal promoter used by this variant. The resulting isoform is 37aa shorter at the N-terminus compared to the CCDS53740.1 isoform. N-terminal sequencing in PMIDs 9150946 and 9150948 supports the existence of the shorter isoform in vivo.
This is a common problem with sequence databases because the curators and annotators don't have time to thoroughly search all the biochemical literature. They have to make decisions that are largely based on the quality of sequence data in the databases and on common sense. This has worked pretty well in eliminating most transcript variants but some of them slip through the cracks. I have no idea why isoform 3 is still thought to be important enough to reference in RefSeq.
This result—curating of so-called "alternative transcripts"—is not widely known nor appreciated. We still see scientists promoting the idea that most human protein-coding genes will produce multiple protein isoforms. For example,
Alternative pre-mRNA splicing (AS) allows a single gene to generate more than one mature mRNA species through non-uniform utilization of exonic and intronic sequences. Many multiexon transcripts in higher eukaryotes undergo AS. It is currently thought that this form of regulation effectively quadruples the number of protein isoforms compared with the number of their encoding genes in mammalian genomes.This is from a review published just a few days ago (Aug. 15, 2016) in a respectable journal (Yap and Makeyev, 2016). It is extremely misleading. The idea that the average protein-coding gene could produce four different biologically functional isoforms is ridiculous. I've chosen just one example—the triose phosphate isomerase gene—to illustrated this point but I could have picked any one of hundreds of other well-studied genes in fundamental metabolic pathways. When the structure and function of the protein product is known, the predicted versions from so-called "alternative" splice variants make no sense.
The Frequency of Alternative Splicing
Two Examples of "Alternative Splicing"
Making Sense in Biology
A Challenge to Fans of Alternative Splicing
Let's look at the so-called "alternative splicing"2 databases to see examples of transcripts that have been rejected by the RefSeq curators. The point I'm making is that the majority of these extra transcripts are so nonsensical that RefSeq curators have decided to ignore them. They are undoubtedly due to splicing errors or sequencing errors and that's why they have been eliminated from RefSeq. As you look at those variants, try to imagine a situation where one of the most conserved enzymes in a fundamental pathway would need to have weird variants that couldn't possibly assemble into an active dimer.
Here are the predictions of "alternative splicing" in the human TPI1 gene ....
The Human-transcriptome DataBase for Alternative Splicing (H-DBAS) includes four different transcripts [HIX0010385]. The top one in the figure (HIT000033038) is the correct mRNA and the three below it whose names begin with "HIT" are the variants included in this database. The next three, whose names begin with "NM" or "NR" are the RefSeq transcripts. This is as close to being the "official" version as you can get. Note that NM_000365 is the correct transcript and it's identical to HIT000033038 in the H-DBAS database. The bottom three variants are from the Ensembl database. The bottom one is the correct version.
Counting the correct version, there are seven different processed transcripts and no two databases agree on all entries. As you can see below, there are dozens of other variants that have been identified but most of them have been rejected or ignored by the annotated databases such as RefSeq and Ensembl. I assume this is because they are presumed to be artifacts or splicing errors. That makes a lot sense. However, as I said above, even the ones that are included in RefSeq and Ensembl look like to me like variants that are not biologically relevant.
The Alternative Splicing Gallery (asg) shows 52 transcript variants [ENSG00000111669].
According to the ECgene, the human TPI1 gene produces 34 different transcript variants encoding 13 distinct proteins [EC gene search for TPI1].
This is what the raw data looks like. Based on "evidence" like this, many scientists conclude that the average human gene produces multiple biologically relevant protein isoforms for each gene.
1. I found one report of different variants in mouse sperm. This included detection of different proteins of higher molecular weight than the normal version.
2. Just because a splice variant can be detected does not mean that it is functional. It could be a splicing error. "Alternative splicing" should be restricted to a phenomenon that's known to be biologically relevant in producing different function products. A simple collection of all transcripts, including mistakes, is just a collection of transcripts and processed transcripts. It may or may not include genuine examples of alternative splicing.
Image Credit: The first two figures are from: Moran, L.A., Horton, H.R., Scrimgeour, K.G., and Perry, M.D. (2012) Principles of Biochemistry 5th ed., Pearson Education Inc. page 175 [Pearson: Principles of Biochemistry 5/E]
Yap, K., and Makeyev, E. V. (2016) Functional impact of splice isoform diversity in individual cells. Biochemical Society Transactions, 44:1079-1085. [doi: 10.1042/BST20160103]