Tuesday, June 19, 2007

What is a gene, post-ENCODE?

Back in January we had a discussion about the definition of a gene [What is a gene?]. At that time I presented my personal preference for the best definition of a gene.
A gene is a DNA sequence that is transcribed to produce a functional product.
This is a definition that's widely shared among biochemists and molecular biologists but there are competing definitions.

Now, there's a new kid on the block. The recent publication of a slew of papers from the ENCODE project has prompted many of the people involved to proclaim that a revolution is under way. Part of the revolution includes redefining a gene. I'd like to discuss the paper by Mark Gerstein et al. (2007) [What is a gene, post-ENCODE? History and updated definition] to see what this revolution is all about.

The ENCODE project is a large scale attempt to analyze and annotate the human genome. The first results focus on about 1% of the genome spread out over 44 segments. These results have been summarized in an extraordinarily complex Nature paper with massive amounts of supplementary material (The Encode Project Consortium, 2007). The Nature paper is supported by dozens of other papers in various journals. Ryan Gregory has a list of blog references to these papers at ENCODE links.

I haven't yet digested the published results. I suspect that like most bloggers there's just too much there to comment on without investing a great deal of time and effort. I'm going to give it a try but it will require a lot of introductory material, beginning with the concept of alternative splicing, which is this week's theme.

The most widely publicized result is that most of the human genome is transcribed. It might be more correct to say that the ENCODE Project detected RNA's that are either complimentary to much of the human genome or lead to the inference that much of it is transcribed.

This is not news. We've known about this kind of data for 15 years and it's one of the reasons why many scientists over-estimated the number of humans genes in the decade leading up to the publication of the human genome sequence. The importance of the ENCODE project is that a significant fraction of the human genome has been analyzed in detail (1%) and that the group made some serious attempts to find out whether the transcripts really represent functional RNAs.

My initial impression is that they have failed to demonstrate that the rare transcripts of junk DNA are anything other than artifacts or accidents. It's still an open question as far as I'm concerned.

It's not an open question as far as the members of the ENCODE Project are concerned and that brings us to the new definition of a gene. Here's how Gerstein et al. (2007) define the problem.
The ENCODE consortium recently completed its characterization of 1% of the human genome by various high-throughput experimental and computational techniques designed to characterize functional elements (The ENCODE Project Consortium 2007). This project represents a major milestone in the characterization of the human genome, and the current findings show a striking picture of complex molecular activity. While the landmark human genome sequencing surprised many with the small number (relative to simpler organisms) of protein-coding genes that sequence annotators could identify (~21,000, according to the latest estimate [see www.ensembl.org]), ENCODE highlighted the number and complexity of the RNA transcripts that the genome produces. In this regard, ENCODE has changed our view of "what is a gene" considerably more than the sequencing of the Haemophilus influenza and human genomes did (Fleischmann et al. 1995; Lander et al. 2001; Venter et al. 2001). The discrepancy between our previous protein-centric view of the gene and one that is revealed by the extensive transcriptional activity of the genome prompts us to reconsider now what a gene is.
Keep in mind that I personally reject the premise and I don't think I'm alone. As far as I'm concerned, the "extensive transcriptional activity" could be artifact and I haven't had a "protein-centric" view of a gene since I learned about tRNA and ribosomal RNA genes as an undergraduate in 1967. Even if the ENCODE results are correct my preferred definition of a gene is not threatened. So, what's the fuss all about?

Regulatory Sequences
Gerstein et al. are worried because many definitions of a gene include regulatory sequences. Their results suggest that many genes have multiple large regions that control transcription and these may be located at some distance from the transcription start site. This isn't a problem if regulatory sequences are not part of the gene, as in the definition quoted above (a gene is a transcribed region). As a mater of fact, the fuzziness of control regions is one reason why most modern definitions of a gene don't include them.
Overlapping Genes
According to Gerstein et al.
As genes, mRNAs, and eventually complete genomes were sequenced, the simple operon model turned out to be applicable only to genes of prokaryotes and their phages. Eukaryotes were different in many respects, including genetic organization and information flow. The model of genes as hereditary units that are nonoverlapping and continuous was shown to be incorrect by the precise mapping of the coding sequences of genes. In fact, some genes have been found to overlap one another, sharing the same DNA sequence in a different reading frame or on the opposite strand. The discontinuous structure of genes potentially allows one gene to be completely contained inside another one’s intron, or one gene to overlap with another on the same strand without sharing any exons or regulatory elements.
We've known about overlapping genes ever since the sequences of the first bacterial operons and the first phage genomes were published. We've known about all the other problems for 20 years. There's nothing new here. No definition of a gene is perfect—all of them have exceptions that are difficult to squeeze into a one-size-fits-all definition of a gene. The problem with the ENCODE data is not that they've just discovered overlapping genes, it's that their data suggests that overlapping genes in the human genome are more the rule than the exception. We need more information before accepting this conclusion and redefining the concept of a gene based on analysis of the human genome.
Splicing was discovered in 1977 (Berget et al. 1977; Chow et al. 1977; Gelinas and Roberts 1977). It soon became clear that the gene was not a simple unit of heredity or function, but rather a series of exons, coding for, in some cases, discrete protein domains, and separated by long noncoding stretches called introns. With alternative splicing, one genetic locus could code for multiple different mRNA transcripts. This discovery complicated the concept of the gene radically.
Perhaps back in 1978 the discovery of splicing prompted a re-evaluation of the concept of a gene. That was almost 30 years ago and we've moved on. Now, many of us think of a gene as a region of DNA that's transcribed and this includes exons and introns. In fact, the modern definition doesn't have anything to do with proteins.

Alternative splicing does present a problem if you want a rigorous definition with no fuzziness. But biology isn't like that. It's messy and you can't get rid of fuzziness. I think of a gene as the region of DNA that includes the longest transcript. Genes can produce multiple protein products by alternative splicing. (The fact that the definition above says "a" functional product shouldn't mislead anyone. That was not meant to exclude multiple products.)

The real problem here is that the ENCODE project predicts that alternative splicing is abundant and complex. They claim to have discovered many examples of splice variants that include exons from adjacent genes as shown in the figure from their paper. Each of the lines below the genome represents a different kind of transcript. You can see that there are many transcripts that include exons from "gene 1" and "gene 2" and another that include exons from "gene 1" and "gene 4." The combinations and permutations are extraordinarily complex.

If this represents the true picture of gene expression in the human genome, then it would require a radical rethinking of what we know about molecular biology and evolution. On the other hand, if it's mostly artifact then there's no revolution under way. The issue has been fought out in the scientific literature over the past 20 years and it hasn't been resolved to anyone's satisfaction. As far as I'm concerned the data overwhelmingly suggests that very little of that complexity is real. Alternative splicing exists but not the kind of alternative splicing shown in the figure. In my opinion, that kind of complexity is mostly an artifact due to spurious transcription and splicing errors.
Trans-splicing refers to a phenomenon where the transcript from one part of the genome is attached to the transcript from another part of the genome. The phenomenon has been known for over 20 years—it's especially common in C. elegans. It's another exception to the rule. No simple definition of a gene can handle it.
Parasitic and mobile genes
This refers mostly to transposons. Gerstein et al say, "Transposons have altered our view of the gene by demonstrating that a gene is not fixed in its location." This isn't true. Nobody has claimed that the location of genes is fixed.
The large amount of "junk DNA" under selection
If a large amount of what we now think of as junk DNA turns out to be transcribed to produce functional RNA (or proteins) then that will be a genuine surprise to some of us. It won't change the definition of a gene as far as I can see.
The paper goes on for many more pages but the essential points are covered above. What's the bottom line? The new definition of an ENCODE gene is:
There are three aspects to the definition that we will list below, before providing the succinct definition:
  1. A gene is a genomic sequence (DNA or RNA) directly encoding functional product molecules, either RNA or protein.
  2. In the case that there are several functional products sharing overlapping regions, one takes the union of all overlapping genomic sequences coding for them.
  3. This union must be coherent—i.e., done separately for final protein and RNA products—but does not require that all products necessarily share a common subsequence.
This can be concisely summarized as:
The gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products.
On the surface this doesn't seem to be much different from the definition of a gene as a transcribed region but there are subtle differences. The authors describe how their new definition works using a hypothetical example.

How the proposed definition of the gene can be applied to a sample case. A genomic region produces three primary transcripts. After alternative splicing, products of two of these encode five protein products, while the third encodes for a noncoding RNA (ncRNA) product. The protein products are encoded by three clusters of DNA sequence segments (A, B, and C; D; and E). In the case of the three-segment cluster (A, B, C), each DNA sequence segment is shared by at least two of the products. Two primary transcripts share a 5' untranslated region, but their translated regions D and E do not overlap. There is also one noncoding RNA product, and because its sequence is of RNA, not protein, the fact that it shares its genomic sequences (X and Y) with the protein-coding genomic segments A and E does not make it a co-product of these protein-coding genes. In summary, there are four genes in this region, and they are the sets of sequences shown inside the orange dashed lines: Gene 1 consists of the sequence segments A, B, and C; gene 2 consists of D; gene 3 of E; and gene 4 of X and Y. In the diagram, for clarity, the exonic and protein sequences A and E have been lined up vertically, so the dashed lines for the spliced transcripts and functional products indicate connectivity between the proteins sequences (ovals) and RNA sequences (boxes). (Solid boxes on transcripts) Untranslated sequences, (open boxes) translated sequences.
This isn't much different from my preferred definition except that I would have called the region containing exons C and D a single gene with two different protein products. Gerstein et al (2007) split it into two different genes.

The bottom line is that in spite of all the rhetoric the "new" definition of a gene isn't much different from the old one that some of us have been using for a couple of decades. It's different from some old definitions that other scientists still prefer but this isn't revolutionary. That discussion has already been going on since 1980.

Let me close by making one further point. The "data" produced by the ENCODE consortium is intriguing but it would be a big mistake to conclude that everything they say is a proven fact. Skepticism about the relevance of those extra transcripts is quite justified as is skepticism about the frequency of alternative splicing.

Gerstein, M.B., Bruce, C., Rozowsky, J.S., Zheng, D., Du, J., Korbel, J.O., Emanuelsson, O., Zhang, Z.D., Weissman, S. and Snyder, M. (2007) What is a gene, post-ENCODE? History and updated definition. Genome Res. 17:669-681.

The ENCODE Project Consortium (2007) Nature 447:799-816. [PDF]

[Hat Tip: Michael White at Adaptive Complexity]


  1. I suspect that like most bloggers there's just too much there to comment on without investing a great deal of time and effort.

    Agreed, although this only applies if you want to actually read the papers before proclaiming that everything has been shown to be functional by the ENCODE study -- apparently many commenters are not so constrained.

    I have been meaning to blog about this paper (which I actually like, being a history and terminology aficionado), but I haven't had time. Glad to see you've posted your thoughts.

    I think I will just wait until the hype dies down a bit before blogging more about the ENCODE findings. There's just too much in the way of misconceptions about what they actually showed for me to engage it all efficiently. There are some very interesting and surprising data there, but many of the media/blogger discussions I have seen have been strongly overstated.

  2. The article estimates "a similar fraction of constrained elements (40% in terms of bases) is located in protein-coding regions as unannotated noncoding regions (100% – 40% coding – 20% regulatory regions)," suggesting to me that even if there's a whole lotta transcription going on, there may not be a great deal of it that is significant in terms of selection.

  3. A bit of grade-school math (hopefully someone will be kind enough to tell me if the conclusions I draw here are dead wrong):

    ~5% of the genome is under selection.

    According to the ENCODE article:

    40%, or 2/5 of this (2% of the genome) is in coding regions, so unsurprisingly most of the coding regions are under selection.

    1/5 of this (1% of the genome) is in regions identified as regulatory, so much of the genome identified as regulatory is also under selection.

    2/5 of this (2% of the genome) is in the 90%+ of the genome previously identified as "junk," so it is a small fraction of these regions that is under selection.

  4. Indeed, in agreement with what tr gregory and jud point out, I don't see any hype in that review, and the authors can't
    really be held responsible for incautious comments by others. They seem quite clear both about the relatively small amount of "junk" that appears to be under purifying selection (as opposed to the total amount that's transcribed), and about the need for experimental data to demonstrate the function of any given transcript. If a considerable portion of that 2% of the genome that seems to be conserved, transcribed "junk" turns out to be functional, that is unlikely to shock anybody given the current volume of results coming out on small regulatory RNA's, nor do I see the review authors suggesting that it should shock anybody.

  5. Five years later, this is still a great and relevant post.

    Is this a small typo? I think you meant to write "exons D and E" in this sentence toward the end of the post: "This isn't much different from my preferred definition except that I would have called the region containing exons C and D a single gene with two different protein products."

    Thanks for the discussion.

  6. The time honoured ontology of the genome is not just failing, it is obsolete. We have known for too long there's too much heterogeneity to try and press-gang into the single concept of the "gene" (even if it is restricted to only exons and their ilk). A solution resides in an innovative proposal by Brosius & Gould (1992):
    On "genomenclature": a comprehensive (and respectful) taxonomy for pseudogenes and "junk" DNA. Proc. nat. Acad. Sci. USA 89: 10706-10710.

    Also see:

    Brosius, J. (2003) Contribution of RNAs and retroposition to evolutionary novelties. Genetica 118: 99-116.

    Brosius, J. (2005) Disparity, adaptation, exaptation, bookkeeping, and contingency at the genome level. Paleobiology (Suppl.) 31: 1-16.

    Brosius, J. (2009) The fragmented gene. Ann. N.Y. Acad. Sci. 1178: 186-193.