Tuesday, April 21, 2009

How to Evaluate Genome Level Transcription Papers

It's often very difficult to evaluate the results of large-scale genome studies. Part of the problem is that the technology is complicated and the controls are not obvious. Part of the problem is that the results depend a great deal on the software used to analyze the data and the limitations of the software are often not described.

But those aren't the only problems. We also have to take into consideration the biases of the people who write the papers. Some of those biases are the same ones we see in other situations except that they are less obvious in the case of large-scale genome studies.

Laurence Hurst has written up a nice summary of the problem and I'd like to quote from his recent paper (Hurst, 2009).
In the 1970s and 80s there was a large school of evolutionary biology, much of it focused on understanding animal behavior, that to a first approximation assumed that whatever trait was being looked at was the product of selection. Richard Dawkins is probably the most widely known advocate for this school of thought, John Maynard Smith and Bill (WD) Hamilton its main proponents. The game played in this field was one in which ever more ingenious selectionist hypotheses would be put forward and tested. The possibility that selection might not be the answer was given short shrift.

By contrast, during the same period non-selectionist theories were gaining ground as the explanatory principle for details seen at the molecular level. According to these models, chance plays an important part in determining the fate of a new mutation – whether it is lost or spreads through a population. Just as a neutrally buoyant particle of gas has an equal probability of diffusing up or down, so too in Motoo Kimura's neutral theory of molecular evolution an allele with no selective consequences can go up or down in frequency, and sometimes replace all other versions in the population (that is, it reaches fixation). An important extension of the neutral theory (the nearly-neutral theory) considers alleles that can be weakly deleterious or weakly advantageous. The important difference between the two theories is that in a very large population a very weakly deleterious allele is unlikely to reach fixation, as selection is given enough opportunity to weed out alleles of very small deleterious effects. By contrast, in a very small population a few chance events increasing the frequency of an allele can be enough for fixation. More generally then, in large populations the odds are stacked against weakly deleterious mutations and so selection should be more efficient in large populations.

In this framework, mutations in protein-coding genes that are synonymous – that is, that replace one codon with another specifying the same amino acid and, therefore, do not affect the protein – or mutations in the DNA between genes (intergene spacers) are assumed to be unaffected by selection. Until recently, a neutralist position has dominated thinking at the genomic/molecular level. This is indeed reflected in the use of the term 'junk DNA' to describe intergene spacer DNA.

These two schools of thought then could not be more antithetical. And this is where genome evolution comes in. The big question for me is just what is the reach of selection. There is little argument about selection as the best explanation for gross features of organismic anatomy. But what about more subtle changes in genomes? Population genetics theory can tell you that, in principle, selection will be limited when the population comprises few individuals and when the strength of selection against a deleterious mutation is small. But none of this actually tells you what the reach of selection is, as a priori we do not know what the likely selective impact of any given mutation will be, not least because we cannot always know the consequences of apparently innocuous changes. The issue then becomes empirical, and genome evolution provides a plethora of possible test cases. In examining these cases we can hope to uncover not just what mutations selection is interested in, but also to discover why, and in turn to understand how genomes work. Central to the issue is whether our genome is an exquisite adaption or a noisy error-prone mess.
Sandwalk readers will be familiar with this problem. In the context of genome studies, the adaptationist approach is most often reflected as a bias in favor of treating all observations as evidence of functionality. It you detect it, then it must have been selected. If it was selected, it must be important.

As Hurst points out, the real question in evaluating genome studies boils down to a choice between an exquisitely adapted genome or one that is messy and full of mistakes. The battlefields are studies on the frequency of alternative splicing, transcription, the importance of small RNAs, and binding sites for regulatory proteins.

Let's take transcription studies as an example.
Consider, for example, the problem of transcription. Although maybe only 5% of the human genome comprises genes encoding proteins, the great majority of the DNA in our genome is transcribed into RNA [1]. In this the human genome is not unusual. But is all this transcription functionally important? The selectionist model would propose that the transcription is physiologically relevant. Maybe the transcripts specify previously unrecognized proteins. If not, perhaps the transcripts are involved in RNA-level regulation of other genes. Or the process of transcription may be important in keeping the DNA in a configuration that enables or suppresses transcription from closely linked sites.

The alternative model suggests that all this excess transcription is unavoidable noise resulting from promiscuity of transcription-factor binding. A solid defense can be given for this. If you take 100 random base pairs of DNA and ask what proportion of the sequence matches some transcription factor binding site in the human genome, you find that upwards of 50% of the random sequence is potentially bound by transcription factors and that there are, on average, 15 such binding sites per 100 nucleotides. This may just reflect our poor understanding of transcription factor binding sites, but it could also mean that our genome is mostly transcription factor binding site. If so, transcription everywhere in the genome is just so much noise that the genome must cope with.
There is no definitive solution to this conflict. Both sides have passionate advocates and right now you can't choose one over the other. My own bias is that most of the transcription is just noise—it is not biologically relevant.

That's not the point, however. The point is that as a reader of the scientific literature you have to make up your mind whether the data and the interpretation are believable.

Here's two criteria that I use to evaluate a paper on genome level transcription.
  1. I look to see whether the authors are aware of the adaptation vs noise controversy. If they completely ignore the possibility that what they are looking at could be transcriptional noise, then I tend to dismiss the paper. It is not good science to ignore alternative hypotheses. Furthermore, such papers will hardly ever have controls or experiments that attempt to falsify the adaptationist interpretation. That's because they are unaware of the fact that a controversy exists.1
  2. Does the paper have details about the abundance of individual transcripts? If the paper is making the case for functional significance then one of the important bits of evidence is reporting on the abundance of the rare transcripts. If the authors omit this bit of information, or skim over it quickly, then you should be suspicious. Many of these rare transcripts are present in less that one or two copies per cell and that's perfectly consistent with transcriptional noise—even if it's only one cell type that's expressing the RNA. There aren't many functional roles for an RNA whose concentration is in the nanomole range. Critical thinkers will have thought about the problem and be prepared to address it head-on.

1. Or, maybe they know there's a controversy but they don't want you to be thinking about it as you read their paper. Or, maybe they think the issue has been settled and the "messy" genome advocates have been routed. Either way, these are not authors you should trust.

Hurst, L.D. (2009) Evolutionary genomics and the reach of selection. Journal of Biology 8:12 [DOI:10.1186/jbiol113]


  1. Question: What about 'translational' noise? Does that exist? Or do you think once you get to the protein stage, the cell has 'invested' too much for it to be waste, thus its 'doing something'?

  2. Has anyone tried building a null model of expected rates of accidental transcription, given positional information on a chromosome, or something similar? My background in molecular biology is limited to a couple courses, but it still seems like this would be the kind of tool that someone would have worked on...

  3. Question: What about 'translational' noise? Does that exist?Of course it exists. The noise always exists anywhere - it's just a matter of degree and its importance.

    For normal cell physiology, the translational noise is a non-issue. At least as non-significant as polymerase errors.

    Luckily, technology does not yet allow 'omics folks to quantify every mistranslated peptide at 10^(-6) abundance and claim its functional importance.


  4. Hi I don't have access to the article: Hurst, L.D. (2009) Evolutionary genomics and the reach of selection. Journal of Biology 8:12

    The sentence from this article is intresting:

    "Although maybe only 5% of the human genome comprises genes encoding proteins, the great majority of the DNA in our genome is transcribed into RNA [1]".

    Could anyone let me know what is that reference ([1])?

    This kind of information turns my ideas about genes and genome upside-down. I have always thought that junk-DNA is not transcribed.

  5. Since I come at molecular evolution from a Protist angle I think an important point often missed in regards to neutrality is that not all Synonymous mutations are necessariloy neutral by definition. Especially in genomes undergoing reduction (or small genomes in general like viruses) there is a clear codon usage bias and even codon pair bias. These biases also seem to, in many cases, correlate with relative abundance of certain tRNAs suggesting an effect on transcriptional efficiency.

    Right now I'm in the camp that leans towards much of this being transcriptional noise but I don't think the prevalence and relative important of small RNAs can be overlooked. Look at RNA editing as an example beyond just alternative splicing. Evolution has taken some bizarre twists and turns at the molecular level but interestingly enough the existence of all of these RNAs just leads me further in the "mess direction" as opposed to the perfectly adapted direction.

  6. Oh forgot to add to Abbie that I am sure translational noise probably exists ass well as transcriptional noise does.

  7. lazyelephant asks,

    Could anyone let me know what is that reference ([1])?The reference is Kapranov et al. (2007) Nat. Rev. Genet. 8:413.

    There are lots of other papers published since then. We've known since the 1970s that significant amounts of the genome, including repetitive sequences, are transcribed at low levels.

    The recent flurry of activity is based on chip technology and not Rot analysis. The interpretation of the results is very different from the consensus in the 1970's.

    Many modern scientists seem to be worried about the low number of genes in the human genome and they are looking for ways to explain what they think is a paradox; namely, that the complexity of humans isn't reflected in the number of genes. That's what's behind many of the claims of massive alternative splicing, an adaptive role for transposons, and abundant functional non-coding RNA's.

    It's what I call The Deflated Ego Problem.

  8. I thought the Deflated Ego Problem was older scientists pooh-poohing new ideas and data as "stuff we've known for the last 30 or 40 years" because the older scientists didn't come up with it 30 or 40 years ago and the thought that science goes on in important directions that differ from their pre-conceived expectations is difficult to bear. :)

    "New scientific truth usually becomes accepted, not because opponents become convinced, but because opponents die, and because the rising generation is familiar with the new truth at the outset"
    -- Max Planck, Naturwissenschaften 33 (1946), p. 230.

  9. Anonymous says,

    I thought the Deflated Ego Problem was older scientists pooh-poohing new ideas and data as "stuff we've known for the last 30 or 40 years" because the older scientists didn't come up with it 30 or 40 years ago and the thought that science goes on in important directions that differ from their pre-conceived expectations is difficult to bear. :)30 or 40 years ago we thought that noise was a property of living systems so we weren't all that surprised to find that much of the genome was transcribed at low levels every so often.

    The data isn't new, except that now we can identify the exact regions that are being transcribed.

    What's new is the interpretation. I don't object to a different interpretation (i.e. the rare RNAs are functional). What I object to is the fact that in advocating their pet hypothesis, many of the scientists are completely ignoring any other possibility—such as messy biology. That's not resisting change, that's objecting to bad science.

  10. How difficult it would be to engineer a genome (smaller than human genome, but still containing many UTR sequences) deleted of most of its "junk DNA" and testing for its feasibility?
    I believe such an experiment would answer once and for all the question whether the "transcriptional noise" is necessary for the organism proper function or not.