Saturday, February 09, 2008

Junk in Your Genome: Intron Size and Distribution

In the comments to Junk in Your Genome: Protein-Encoding Genes martinc asks,
Larry, if the amount of necessary sequences within introns are as small as you suggest wouldn't this allow us to make a prediction. Couldn't we predict that due to drift there should be very little similarity in intron lengths between different species. If, by any chance, there is similarity then what would your explanation be?
There have been quite a few studies of average intron size in various species. I selected a number for the average size of introns from Hong et al. (2006). The average intron size, according to them, is 3,479 bp in coding regions. This value is a little deceptive since there are a small number of huge introns that make the average quite large. The median value is 1334 bp or less than half the average value.

I suggested that much of the intron sequences were junk. Martinc's question is quite reasonable but in order to get an answer we need to look more closely at the distribution of introns.

The figure shows the distribution of intron sizes in four species: the flowering plant Arabidopsis thaliana; the fruit fly Drosophila melanogaster; human, and mouse. The data is from Hong et al. (2006, Fig.1).

Note that the distribution in Arabidopsis and Drosophila is very tight. Both of these species have relatively compact genomes compared to mammals. The data strongly suggests that the minimum intron size is about 80 bp.

The distributions in the human and mouse genomes are very different. There is a strong peak at 100 bp—this is similar to the peaks in other species. But unlike other species, mammalian introns can be extremely large, giving rise to a long tail of the distribution extending to 10,000 bp or more. The key question is whether this distribution of long introns is noise or an artifact of gene prediction algorithms, or whether it represents a real phenomenon.

Returning to martinc's question. If we look at well-conserved genes in different species what we find is some variation in intron length but only around a mean of about 100-400 bp. In other words, in genes that have been closely examined, where the protein product is known, the distribution of intron sizes looks a lot more like the distribution in Arabidopsis and Drosophila.

Let's look at the hsp90 genes. These are the genes that endcode Hsp90, the protein that SciPhu was blogging about [Hsp90 and Evolution].

I've picked the zebrafish gene and four mammalian genes to illustrate the variation in intron length. (Blue exons are 5′ and 3′ UTR's.) Most of the introns are between 80 and 400 bp in size but there are a few exceptions. In this case the human gene is the exception; it has two huge introns at the 5′ end of the gene.

What we see is a narrow distribution of intron lengths in most cases and a few huge introns. It isn't surprising that the length of introns in different species are quite similar.

Let's look at my favorite gene. HSPA8 is the cytoplasmic version of the chaperone HSP70 multigene family.

We see a similar pattern. Most intron lengths are very similar in different species suggesting selection for introns in the 100-400 bp range. There are exceptions, as we see in the chimpanzee, monkey and dog genes. All three have large introns at either the 5′ or 3′ ends. The large monkey inrons are 10,253 bp and 1007 bp. The large chimpanzee intron is 13,257 bp in length. This is typical. I think it's very likely that the large introns in noncoding exons are artifacts.

So here's the complete answer to the question posed at the top of the page. I think there's selection to maintain introns sizes to a fairly narrow range of between 100-400 bp. Because of this, we expect to see similar intron sizes in different species. On occasion we discover a huge intron that is peculiar to one species. This intron could be a transient expansion that hasn't been reduced yet, or it could be an artifact.

Incidentally, while retrieving these sequences from Entrez Gene I noticed that the annotators have eliminated all spice variants for HSP90 and HSPA8 genes with a few exceptions.

The dog sequences all have many splice variants for every gene and some of the variants have been retained in Entrez Gene entry for dog HSPA8. Look carefully at the two predicted variants in the seond and third lines. These alternative splice variants are supposed to produce Hsc70 proteins that are missing several highly conserved regions encoded by exons 7 and 8. Recall that this is the most highly conserved protein in biology.

These cannot be biologically relevant protein variants that are only produced in dogs. The annotators are right to remove similar artifacts from the other genomes and they should remove these as well. Alternative splice variants are mostly artifacts, in my opinion, but that's a fight for another day.


Hong X, Scofield DG, Lynch M (2006) Intron size, abundance, and distribution within untranslated regions of genes. Mol. Biol. Evol. 23:2392-404. [PubMed]

8 comments :

  1. In this case the human gene is the exception; it has two huge introns at the 3′ end of the gene. DOes this mean that the 3' ends of the genes are on the left side of the graph or should it read "3' end" instead?

    In addition, I have the impression that the different genes are not dipicted in the same scale.

    ReplyDelete
  2. LM: These alternative splice variants are supposed to produce Hsc70 proteins that are missing several highly conserved regions encoded by exons 7 and 8.

    The processed transcripts aren't supposed to do anything at all. Under some circumstances, processed mRNAs may be exported to the cytoplasm and translated into proteins that again aren't supposed to do anything, but might anyway. It's one thing to say that the alternatively spliced transcripts are not cellularly useful for producing (what we consider to be) functional proteins, but it's another thing to deny their existence altogether. (This philosophical point is also a good take-home message from the ENCODE project.)

    ReplyDelete
  3. sparc, I meant 5′ end. Thanks.

    The scales are very different.

    ReplyDelete
  4. anonymous says,

    It's one thing to say that the alternatively spliced transcripts are not cellularly useful for producing (what we consider to be) functional proteins, but it's another thing to deny their existence altogether.

    Good point. I should have made that clear.

    It's ridiculous to pretend that those alternatively spliced variants are producing functional Hsc70 proteins so I conclude that they are not useful.

    I also conclude that the "alternatively spliced" variants don't actually exist. I agree with the genome annotators. The "variants" are artifacts of a corrupt EST database.

    I should have made it clear that I was questioning the very existence of the variants and not just their ability to make functional protein.

    ReplyDelete
  5. A question about "artifacts" - do you think that alternative splicing indicated by RT-PCR experiments are also just as questionable as those indicated by ESTs?

    How about full-length cDNAs? Exon arrays?

    I'm trying to get a feel for what would constitute reasonable as well as solid evidence of alternative splicing, according to you.

    As you must know, high-end estimates talk of up to 80% of multi-exonic human genes being alternatively spliced. I'm guessing your estimate would be near the lower end - 10%?

    Sorry if those are too many questions at once.

    ReplyDelete
  6. Poor Larry - didn't you know?

    Software engineers KNOW that the genome has no junk in it...

    ReplyDelete
  7. Oops - here's the link:

    http://randystimpson.blogspot.com/2008/02/most-dna-is-not-junk.html

    ReplyDelete
  8. Larry, if the amount of necessary sequences within introns are as small as you suggest wouldn't this allow us to make a prediction. Couldn't we predict that due to drift there should be very little similarity in intron lengths between different species. If, by any chance, there is similarity then what would your explanation be?

    Just because the amount of necessary intron sequence is small, doesn't mean that intron lengths are not significant. At the very least, long introns would make transcription more costly. So I'm not sure I agree with the premise here - selection could favor longer or shorter introns for particular genes, even if the content is junk.

    ReplyDelete