Sunday, July 14, 2013

How Not to Do Science

& Junk DNA
Many reputable scientists are convinced that most of our genome is junk. However, there are still a few holdouts and one of the most prominent is John Mattick. He believes that most of our genome is made up of thousand of genes for regulatory noncoding RNA. These RNAs (about 100 of them for every single protein-coding gene) are mostly involved in subtle controls of the levels of protein in human cells. (I'm not making this up. See: John Mattick on the Importance of Non-coding RNA )

It was a reasonable hypothesis at one point in time.

How do you evaluate a hypothesis in science? Well, one of the things you should always try to do is falsify your hypothesis. Let's see how that works ...
  1. The RNAs should be conserved. FALSE
  2. The RNAs should be abundant (>1 copy per cell). FALSE
  3. There should be dozens of well-studied specific examples. FALSE
  4. The hypothesis should account for variations in genome size. FALSE
  5. The hypothesis should be consistent with other data, such as that on genetic load. FALSE
  6. The hypothesis should be consistent with what we already know about the regulation of gene expression. FALSE
  7. You should be able to refute existing hypotheses, such as transcription errors. FALSE
Normally, you would abandon a hypothesis that had such a bad track record but true believers aren't about to do that. So what's next? Maybe these regulatory RNAs don't show sequence conservation but maybe their secondary structures are conserved. In other words, these RNAs originated as functional RNAs with a secondary structure but over the course of time all traces of sequence conservation have been lost and only the "conserved" secondary structure remains.1 The Mattick lab looked at the "conservation" of secondary structure as an indicator of function using the latest algorithms (Smith et al., 2013). Here's how they describe their attempts to prove their hypothesis in light of conflicting data ...
The majority of the human genome is dynamically transcribed into RNA, most of which does not code for proteins (1–4). The once common presumption that most non–protein-coding sequences are nonfunctional for the organism is being adjusted to the increasing evidence that noncoding RNAs (ncRNAs) represent a previously unappreciated layer of gene expression essential for the epigenetic regulation of differentiation and development (5–8). Yet despite an exponential accumulation of transcriptomic data and the recent dissemination of genome-wide data from the ENCODE consortium (9), limited functional data have fuelled discourse on the amount of functionally pertinent genomic sequence in higher eukaryotes (1, 10–12). What is incontrovertible, however, is that evolutionary conservation of structural components over an adequate evolutionary distance is a direct property of purifying (negative) selection and, consequently, a sufficient indicator of biological function The majority of studies investigating the prevalence of purifying selection in mammalian genomes are predicated on measuring nucleotide substitution rates, which are then rated against a statistical threshold trained from a set of genomic loci arguably qualified as neutrally evolving (13, 14). Conversely, lack of conservation does not impute lack of function, as variation underlies natural selection. Given that the molecular function of ncRNA may at least be partially conveyed through secondary or tertiary structures, mining evolutionary data for evidence of such features promises to increase the resolution of functional genomic annotations.
Here's what they found ..
When applied to consistency-based multiple genome alignments of 35 mammals, our approach confidently identifies >4 million evolutionarily constrained RNA structures using a conservative sensitivity threshold that entails historically low false discovery rates for such analyses (5–22%). These predictions comprise 13.6% of the human genome, 88% of which fall outside any known sequence-constrained element, suggesting that a large proportion of the mammalian genome is functional.
Apparently 13.6% of the human genome is a "large proportion." Taken at face value, however, the Mattick lab has now shown that the vast majority of transcribed sequences don't show any of the characteristics of functional RNA, including conservation of secondary structure. Of course, that's not the conclusion they emphasize in their paper.

Why not?

1. I can't imagine how this would happen, can you? You'd almost have to have selection AGAINST sequence conservation.

Smith, M.A., Gese, T., Stadler, P.F. and Mattick, J.S. (2013) Widespread purifying selection on RNA structure in mammals. Nucleic Acid Research advance access July 11, 2013 [doi: 10.1093/nar/gkt596]


  1. Laurence A. Moran: “It was a reasonable hypothesis at one point in time”


    1. At no point in time the hypothesis that “most of our genome is made up of thousands of genes for regulatory noncoding RNA” was reasonable.

  2. There are bigger problems than passing off 13.6% of the human genome as most of the genome.

    So the sequence varies as if it were neutral evolution, but the "secondary structure" is conserved.

    And the "secondary structure" is inferred by some kind of computational algorithm? Any experimental confirmation this computational algorithm is reliable? Or experimental confirmation that conserved RNA "secondary structure" is really an indicator of functionality?

    I mean what's their negative control? You'd need to show that DNA sequence (if transcribed into RNA) without conserved secondary structure is NOT functional. I can't imagine any other negative control.

    The PI's two questions are always:

    1. How can we test this? and

    2. What's our negative control?

    1. So the thermodynamic scoring metrics used by the chosen algorithms are based on biochemical data, e.g. melting and stacking energies. There are several papers describing the accuracy of the employed tools. The SISSIz algorithm, as detailed in the manuscript, applies a very sophisticated randomization strategy (see Tanja Gesell's work) that combines both a nearest-neighbour model and a phylogenetic tree to produce a background model. The RNAalifold algorithm combines the thermodynamic scoring to a covariation score, based on 'synonymous' (compensatory) mutations in the alignment.

      Take the MALAT1 example in the paper. The well characterised mascRNA at the 3' end was more accurately predicted than previous endeavours. This cloverleaf-like structure motif is preceded by a stable hairpin and is cleaved out by genomic RNAseP and exported to the cytoplasm. The cleaved 3' end of the transcript then forms an RNA triplex acting as a knot of sorts to prevent exonuclease degraadation to compensate for the lack of a Poly(A) tail. Prof Joan Steiz's lab published the triplex finding while this manuscript was in publication, however the tools we employed picked up additional evolutionary structural constraints in the regions between the hairpin and the cloverleaf.

      We spent much effort on the negative control issue, which was raised several times during peer review. The problem with using any native biological sequences is that you are forced to make assumptions on their biological functionality. Therefore, we used several in silico methods to scramble the alignments (2 are reported in the manuscript, however we tested 4 additional ones, including those employed in older screens, c.f. Nature biotech & Genome Research).

      I think the issue of identifying true negatives in biology is one that Prof Moran's blog would be well suited for given his opinions and rants. I would be more than pleased if you could provide specific examples of genomic sequences that are devoid of any function or any insight into producing a computational model to reproduce/simulate this. Heck, perhaps this blog could be put to collaborative/constructive use instead of science-bashing?

    2. "I would be more than pleased if you could provide specific examples of genomic sequences that are devoid of any function"

      OK. Start with the megabase deletion mouse. Get the sequence of its deleted regions, and run your method on that.

      What? That never came up in group meeting?

      Next: recall that the bladderwort Utricularia gibba has a genome of 80 Mbp, 75% of which was coding. Now recall that its close relative U. prehensilis has a genome more than 4x larger. That genome should be about 75% junk. Align the two genomes, snip out the U. gibba aligned sequences, then run your method on the difference. Would the result be the same as or different from the human genome?

      A more distant relative, Genlisea hispidula, has a genome of 1510 Mbp. Junk-a-rama. Run your method on G. hispidula, or better, on the difference between the two genomes.

      If your method is F, then do this:

      F(G. hispidula) - F(U. gibba) = ??

      Would it be the same as the human, or different?

      This was my idea. If you publish it, I'm co-author.

      What? The genome of U. prehensilis has not been sequenced, you say? There's your NIH grant right there.

      "constructive use instead of science-bashing?"

      You got us, Inspector Colombo! We iz de anti-science roun here!

    3. Hi Martin,

      "I would be more than pleased if you could provide specific examples of genomic sequences that are devoid of any function or any insight into producing a computational model to reproduce/simulate this."

      Easily done. All of the basic computational models of nucleotide sequence evolution (let's say the HKY85 model or the GTR model, with nucleotide frequencies and exchangeabilities constant across sites) describe sequences that are devoid of function. This is because these simple models do not incorporate site-to-site rate variation or dependencies between sites, so there is no mechanism through which selection can come into the picture. These models can be used for either analysis or simulation, and all of the theory was worked out way back in the 80s - has had lots of time to mature.

      If all you want is to simulate neutral nucleotide sequences, even an old program like Evolver (in the PAML package) will do it for you. Alternatively, you may find the HyPhy package (full disclosure: I work with the HyPhy team) has more flexibility and (we hope) a more user-friendly GUI.

  3. Why would the genome composed mostly of “junk DNA” need several and overlapping DNA repair mechanisms dedicated to preventing any random changes? Junk is junk, "who", or rather, "what" cares, if any random change happens to this supposed "junk DNA"?

    On the other hand, if there are so many mechanisms dedicated to preventing random mutations, these systems would have to be switched off or become dysfunctional for evolutionary theory to be true. Too bad that dysfunctional mutation protection system is the origin of cancer and hereditary diseases, which reduce the capacity to live and to reproduce.

    This leads us to another paradox in evolutionary theory between the necessity and the disadvantage of dysfunctional mutation protection system.
    Well, not the first and not the last one.

    Here is a link to a video for those who like paradoxical comedy like I do:

    Link for those would prefer to read:

    1. This is complete insanity.

      The existing knowledge of mutation rates fits perfectly within the neutral theory of molecular evolution, while it is very hard to see why the intelligent designer would have made prokaryotes so good at it while leaving us, the supposed center of creation, with such sloppy machinery

      Lynch M. (2010) Evolution of the mutation rate. Trends Genet 26(8):345-52.

    2. Complete insanity is to believe the prokaryotes could have evolved into eukaryotes even though the mechanism of it is unknown and all available evidence points in the opposite direction.

    3. I am highly intrigues by the statement that "all available evidence points in the opposite direction".

      Could you please elaborate.

      Also, you posted this about 5 minutes after my post, which is not exactly sufficient time to read the paper I linked to, so I assume you haven't done so

    4. I know the paper. Didn't have to read it.

      Look up Uprooting the tree of life, Scientific American 282(2):72–77, 2000 By Doolittle

      "...Many eukaryotic genes turn out to be unlike those of any known archaea or bacteria; they seem to have come from nowhere..."

    5. And that is evidence that archaea and bacteria evolved from eukaryotes????

      I fail to see the logic.

      Also, the "Uprooting the tree of life" does in no way support your view of the subject, so it's not clear to me why you bring it up except to engage in the time-honored creationist tactic of dishonest quote-mining...

    6. Vashti presents no evidence. As usual we get an authority quote, here chopped up so it is not even a full sentence. Creationists conceal the evidence in order to make it hard or impossible for us to scrutinize the data.

      On the subject of the mutation rate, it can be tuned by evolution of the polymerase enzymes. From extensive experimentation we know that a mutation rate anywhere from the natural rate to about 10x higher drives rapid increases in fitness. That's observed-- not speculation and not controversial.

    7. Vashti,

      Please don't be so much of an idiot.

      1. DNA repair mechanisms prevent changes anywhere in the DNA sequence, not just within junk DNA. Otherwise the probability for deleterious mutations in functional DNA would increase. DNA repair mechanisms cannot tell junk DNA from other DNA.

      2. DNA repair mechanisms are not perfect. We know this from experiments. There's nothing paradoxical about organisms having DNA repair mechanisms and the presence of mutationa enough for evolution to occur.

      3. «Too bad that dysfunctional mutation protection system is the origin of cancer and hereditary diseases, which reduce the capacity to live and to reproduce» Exactly idiot. Without DNA repair there would be too many mutations, and the probability for problems would increase. Lower mutation rates make it possible for organisms to survive enough to reach reproductive age, and to increase variability within populations, which in turn are part of what makes evolution possible.

      Truly, stop ridiculing yourself Vashti. Ask yourself if it is really possible that scientists would not have noticed something that looks like such an obvious "problem" for such a widely accepted theory. Instead of presenting your "paradoxes" and misinformation as if you knew that such things contradict the sciences that you don't like, you should ask if they do indeed contradict the science. If you asked instead of affirming, at the very least you would not be presenting yourself as an arrogantly misinformed creationist idiot.

    8. Vash - have you ever read that 'Uprooting...' paper beyond the ENV soundbite?

  4. No matter how one wants to look at it, another big chunk of so-called "junk DNA" turned out to have some function.

    In other news, another big chunk of someone's ego also lost its function. At this pace, there will be nothing left in few years :)

    1. Vashti,

      You seem familiar with Doolittle’s work, so what do you think about his recent assertion (1) that:

      by developing a “larger theoretical framework, embracing informational and structural roles for DNA, neutral as well as adaptive causes of complexity, and selection as a multilevel phenomenon …much that we now call junk could then become functional (emphasis added)?

      1. Doolittle WF. 2013. Is junk DNA bunk? A critique of ENCODE”; Proc Natl Acad Sci U S A. 110:5294-300.

    2. Also, what do you think about the nucleoskeletal and nucleotypic hypotheses on the biological roles of the so called ‘junk DNA’proposed by Thomas Cavalier-Smith, Michael Bennett and Ryan Gregory, who are some of the top experts on genome evolution?

    3. No matter how one wants to look at it, NONE of the RNA transcripts identified in this study have been shown experimentally to have function.

    4. @Diogenes, that statement is false. Please read the paper with more attention. I would like to clarify the semantics: Zero transcripts are identified in this study because it's a genomics paper; it identifies genomic regions showing evidence of evolutionary selection on RNA structure.

    5. I understand that. I know no transcripts are identified in the paper. My statement was intended as a satire of Vashti's statement, but it is not technically correct. The authors, if I understand correctly, operated under a what-if scenario: what if all this stuff were transcribed? I apologize for any misunderstanding.

    6. At the current rate of discovery of function in junk, all of it will be found functional in approximately 9000 years.

      "At this pace, there will be nothing left in few years :)"

      Right. Keep the faith brother!

    7. Let's ask Vashti to compute how long "a few years" will be at current rates of discovery of function.

      Oh wait, that might involve long division. Drat.

  5. Dear bloggers,

    The 13.6% is what is detected with high-confidence by the algorithms employed, which only predict ~30-40% of true positive structures. I realise the manuscript is computationally intense, but it is clearly stated in the discussion that the 13.6% is indicative that 13.6/0.4 = 34% of the genome is likely to be evolutionarily constrained at the RNA structure level (see below). Furthermore, these results are based on very specific search parameters (limited to 200 nucleotides). Larry, how many RNA transcripts are over 200 nt?
    Granted 34% is not quite the >85% reported by ENCODE, it is in the same order of magnitude (>3x more evolutionary constrained regions & ~20x more than protein coding genes).
    FYI, this is not trying to prove that the entire genome is functional. The problem was approached objectively with as little assumptions as possible (other than assuming all of the genome is transcribed, which was intrinsic to the methodology and not yet published by ENCODE at the time).

    [...] In this work, the practicality of RFAM alignments with regards to consensus sequence-based RNA structure prediction is 2-fold: (i) to calculate an upper limit for sliding-window predictions on validated data, and (ii) to estimate the experimental error incurred by multiple sequence alignment heuristics. By comparing both results, it is possible to extrapolate the approximate accuracy of a classical scans for evolutionarily conserved RNA secondary structure.
    Hence, the RNA structure predictions we report using conservative thresholds are likely to span >13.6% of the human genome we report. This number is probably a substantial underestimate of the true proportion given the conservative scoring thresholds employed, the neglect of pseudoknots, the liberal distance between overlapping windows and the incapacity of the sliding-window approach to detect base-pair interactions outside the fixed window length. A less conservative estimate would place this ratio somewhere above 20% from the reported sensitivities measured from native RFAM alignments and over 30% from the observed sensitivities derived from sequence-based realignment of RFAM data (Table 1, Figure 1 and Supplementary Figure S4). [...] By breaking down the control data in function of their sequence characteristics and by reproducing experimental conditions through sequence-based realignment of the input, we set the foundation for an optimized genome-wide investigation of RNA secondary structure conservation.

    1. You didn't answer my question. What is the negative control?

      If an RNA transcript evolves at the neutral rate but does NOT have conserved secondary structure, how much of that has bewn shown experimentally to be NOT functional?

      And of RNA transcripts that evolve at the neutral rate but do have conserved secondary structure, how much of that has bewn shown experimentally to be functional? Percentage wise.

    2. Reading the paper, I notice that half of the ECSs are in repeats.

      ECSs in repeats can be conserved for obvious reasons that have nothing to do with function in the "important for organismal fitness" sense.

    3. @Mart,

      A good control would be to use 3 billion base pairs of random sequence and ask what percentage is recognized by your algorithms. Did you do that? What's the answer? Do you subtract that number from your results?

      You said,

      Granted 34% is not quite the >85% reported by ENCODE, it is in the same order of magnitude (>3x more evolutionary constrained regions & ~20x more than protein coding genes).

      We know that protein encoding genes make up 25% of the genome, including introns. What percentage of that is predicted to have secondary structure? Did you subtract that value from your estimate?

      FYI, this is not trying to prove that the entire genome is functional.

      That's not exactly the truth, is it? Mattick's lab has been trying for years to prove that the vast majority of the genome produces regulatory RNAs. This is just part of that attempt and you would have been delighted if the number had turned out to be much larger.

      BTW, what is all this "functional" RNA supposed to be doing and does your speculation pass the onion test?

    4. Yes, indeed. I suppose by "obvious" you mean self-propagation? This is likely the case with ERVs, which showed the highest enrichment for ECS. How about ALUs? There is increasing evidence that ALUs are involved in regulating gene expression, particularly in the brain. It is well documented that SINEs and other repeats can be "exapted" (to paraphrase SJ Gould) or domesticated by the host. I'm sure this has been discussed elsewhere.

    5. @Larry

      Yes, that is almost exactly how we tested the false discovery rate.

      The 25% of the genome you speak of, that includes non-coding RNA elements (UTRs, miRtrons, etc..). How do you functionally discriminate between a mRNA and, say, a transcribed pseudogene which has lost protein coding capacity yet maintained RNA structural and regulatory features (c.f. Kevin Morris' PTEN paper in Nature)? It's a hard line to draw, especially when considering how the protein-coding world likely originated from the RNA world.

      I believe that these findings are quite delightful. If 75% was the result, I would have strongly doubted the findings. RNA structure (and ncRNAs) can't be accountable for everything in the cell.

      With regards to "what is all this functional RNA supposed to be doing?", that's a very pertinent question. Anyone looking to start a PhD?

      I'm not too familiar with Allium biology, it the onion genome pervasively transcribed in a developmentally coordinated manner? I suspect any plant would require a bunch of extra genetic material to fight off infections, respond to metabolic pressures, and other issues related to having an immobile life with a limited diversity of specialised organs. In any case, this is not my field of expertise. Give me 30 onion Omics datasets and I'll have a crack at answering some of these questions.

  6. 13.6% is a lot. Consider that Yogi Berra once made the observation that baseball is 90% mental and the other half is physical.

    1. Sounds like the 90-90 rule of project scheduling: the first 90% of a project takes 90% of the time, and the remaining 10% takes the other 90% of the time.

  7. so..... Mattick pretty much disproved himself. Cool.

  8. So Mart - do you real scientists really think that 13.6% is a 'large proportion'?