Most eukaryotic protein-encoding genes are interrupted. The coding regions are divided into numerous blocks called "exons" and the exons are separated by "introns."
An example is shown below. The triose phosphate isomerase (TPI) gene from maize is composed of 9 exons and 8 introns. (Triose phosphate isomerase is one of the enzymes in the glycolysis/gluconeogenesis pathway.)
The top line is a cartoon representation of the TPI gene with each exon in a different color. The thick gray lines between them represent the introns. The gene is transcribed from left (5′) to right (3′) beginning at the promoter (P). The long primary RNA transcript contains both intron and exon sequences. Subsequent processing of this primary transcript results in modification of the 5′ end by addition of an m7 GTP cap and modification of the 3′ end by addition of adenylate (A) residues to form the poly A tail. More importantly, the introns are spliced out and the exon sequences are fused to form the mature mRNA. This mRNA is then transported to the cytoplasm where it is translated into protein.
Note that all the coding regions in the exons (hatched) are contiguous in the mature mRNA. The relationship between the exons and the structure of the protein is shown on the right where the color of each segment of the protein corresponds to the color of the exons in the upper figure. There is no correlation between the exons and any protein domains or motifs. (It used to be thought that exons corresponded to domains in the protein.)
The splicing reaction is complicated. The cell must cleave the primary transcript at each end of the intron while holding on to the flanking exons so the chopped RNA transcript does not come apart. Then the two exons have to be joined together. For protein-encoding genes the splicing reactions are catalyzed by an RNA/protein complex called a spliceosome. In some cases, the introns can be thousands of nucleotides long—much longer than the exons.
Let's look at a simplified version of this reaction. The various components of the spliceosome have to assemble at the 5′ (left) end of an intron and at the 3′ end. There's a third site in the middle called the branch site. All three sites are identified by specific short sequences in the primary transcript as shown below.
These are the consensus sequences for vertebrates, including us. The splice site and branch site sequences in other species are similar but not identical.
In the first step of the splicing reaction, the various components of the spliceosome bind to the 5′ splice site, the 3′ splice site, and the branch site. Then the three complexes interact with each other to draw together the ends of the intron and position them near the branch site. This forms the spliceosome.
The first reaction involves an attack of the 2′ -OH group of the branch point adenylate residue on the 5′ splice site. This forms an intermediate where the branch site A residue is attached to three different ends of the primary transcript. The structure resembles a lariat or lasso. This is the structure depicted in Monday's Molecule #31.
Meanwhile, the 5′ end of the transcript is still bound to the spliceosome. This is important because it's about to be joined to the next exon and the reaction wouldn't work if the 5′ end were released following the first cleavage reaction.
In the next step, the spliceosome catalyzes the attack of the -OH group at the end of the 5′ exon on the 3′ splice site. This results in cleavage of the 3′ intron/exon junction and joining of the 5′ exon to the 3′ exon. The intron sequence (dark brown) is released as a lariat (looped) structure.
The two reactions are known as transesterification reactions because they require the breaking of one strand of RNA and formation of a new ester linkage. The details are not very important. What's important is to recognize that splicing depends on the correct interaction between the components of the spliceosome and the 5′ and 3′ splice site sequences (and the branch site).
These interactions are mediated by small RNAs that are bound to the spliceosome proteins. These RNAs are called small nuclear RNAs (snRNAs) and they're one example of a host of small RNAs produced by non-protein encoding genes. The snRNA/protein complexes are called small nuclear ribonuclear proteins or snRNPs (snurps).
The snRNAs are complimentary to the splice sites and branch sites and that's how the various snRNPs recognize them. This interaction is very weak since it depends on only three or four base pairs. It can be even less since there are many slice sites that are not perfect matches to the consensus sequences shown above. The relative lack of significant sequence similarity makes splicing a very error-prone reaction.
U1 snRNP recognizes 5′ splice sites, U2 snRNP binds to the branch site, and U5 snRNP binds to the 3′ splice site. A more detailed description of the formation of the splicesome is shown below.