Sandwalk: How Not to Do Science

Sunday, July 14, 2013

How Not to Do Science

Theme
Genomes
& Junk DNAMany reputable scientists are convinced that most of our genome is junk. However, there are still a few holdouts and one of the most prominent is John Mattick. He believes that most of our genome is made up of thousand of genes for regulatory noncoding RNA. These RNAs (about 100 of them for every single protein-coding gene) are mostly involved in subtle controls of the levels of protein in human cells. (I'm not making this up. See: John Mattick on the Importance of Non-coding RNA )

It was a reasonable hypothesis at one point in time.

How do you evaluate a hypothesis in science? Well, one of the things you should always try to do is falsify your hypothesis. Let's see how that works ...

The RNAs should be conserved. FALSE
The RNAs should be abundant (>1 copy per cell). FALSE
There should be dozens of well-studied specific examples. FALSE
The hypothesis should account for variations in genome size. FALSE
The hypothesis should be consistent with other data, such as that on genetic load. FALSE
The hypothesis should be consistent with what we already know about the regulation of gene expression. FALSE
You should be able to refute existing hypotheses, such as transcription errors. FALSE

Normally, you would abandon a hypothesis that had such a bad track record but true believers aren't about to do that. So what's next? Maybe these regulatory RNAs don't show sequence conservation but maybe their secondary structures are conserved. In other words, these RNAs originated as functional RNAs with a secondary structure but over the course of time all traces of sequence conservation have been lost and only the "conserved" secondary structure remains.¹ The Mattick lab looked at the "conservation" of secondary structure as an indicator of function using the latest algorithms (Smith et al., 2013). Here's how they describe their attempts to prove their hypothesis in light of conflicting data ...

The majority of the human genome is dynamically transcribed into RNA, most of which does not code for proteins (1–4). The once common presumption that most non–protein-coding sequences are nonfunctional for the organism is being adjusted to the increasing evidence that noncoding RNAs (ncRNAs) represent a previously unappreciated layer of gene expression essential for the epigenetic regulation of differentiation and development (5–8). Yet despite an exponential accumulation of transcriptomic data and the recent dissemination of genome-wide data from the ENCODE consortium (9), limited functional data have fuelled discourse on the amount of functionally pertinent genomic sequence in higher eukaryotes (1, 10–12). What is incontrovertible, however, is that evolutionary conservation of structural components over an adequate evolutionary distance is a direct property of purifying (negative) selection and, consequently, a sufficient indicator of biological function The majority of studies investigating the prevalence of purifying selection in mammalian genomes are predicated on measuring nucleotide substitution rates, which are then rated against a statistical threshold trained from a set of genomic loci arguably qualified as neutrally evolving (13, 14). Conversely, lack of conservation does not impute lack of function, as variation underlies natural selection. Given that the molecular function of ncRNA may at least be partially conveyed through secondary or tertiary structures, mining evolutionary data for evidence of such features promises to increase the resolution of functional genomic annotations.

Here's what they found ..

When applied to consistency-based multiple genome alignments of 35 mammals, our approach confidently identifies >4 million evolutionarily constrained RNA structures using a conservative sensitivity threshold that entails historically low false discovery rates for such analyses (5–22%). These predictions comprise 13.6% of the human genome, 88% of which fall outside any known sequence-constrained element, suggesting that a large proportion of the mammalian genome is functional.

Apparently 13.6% of the human genome is a "large proportion." Taken at face value, however, the Mattick lab has now shown that the vast majority of transcribed sequences don't show any of the characteristics of functional RNA, including conservation of secondary structure. Of course, that's not the conclusion they emphasize in their paper.

Why not?

1. I can't imagine how this would happen, can you? You'd almost have to have selection AGAINST sequence conservation.

Smith, M.A., Gese, T., Stadler, P.F. and Mattick, J.S. (2013) Widespread purifying selection on RNA structure in mammals. Nucleic Acid Research advance access July 11, 2013 [doi: 10.1093/nar/gkt596]

35 comments:

Claudiu BandeaSunday, July 14, 2013 4:44:00 PM
Laurence A. Moran: “It was a reasonable hypothesis at one point in time”

When?
ReplyDelete
Replies
DiogenesSunday, July 14, 2013 8:06:00 PM
There are bigger problems than passing off 13.6% of the human genome as most of the genome.

So the sequence varies as if it were neutral evolution, but the "secondary structure" is conserved.

And the "secondary structure" is inferred by some kind of computational algorithm? Any experimental confirmation this computational algorithm is reliable? Or experimental confirmation that conserved RNA "secondary structure" is really an indicator of functionality?

I mean what's their negative control? You'd need to show that DNA sequence (if transcribed into RNA) without conserved secondary structure is NOT functional. I can't imagine any other negative control.

The PI's two questions are always:

1. How can we test this? and

2. What's our negative control?
ReplyDelete
Replies
NewbieSunday, July 14, 2013 8:23:00 PM
Why would the genome composed mostly of “junk DNA” need several and overlapping DNA repair mechanisms dedicated to preventing any random changes? Junk is junk, "who", or rather, "what" cares, if any random change happens to this supposed "junk DNA"?

On the other hand, if there are so many mechanisms dedicated to preventing random mutations, these systems would have to be switched off or become dysfunctional for evolutionary theory to be true. Too bad that dysfunctional mutation protection system is the origin of cancer and hereditary diseases, which reduce the capacity to live and to reproduce.

This leads us to another paradox in evolutionary theory between the necessity and the disadvantage of dysfunctional mutation protection system.
Well, not the first and not the last one.

Here is a link to a video for those who like paradoxical comedy like I do:

http://www.youtube.com/watch?v=dzh6Ct5cg1o

Link for those would prefer to read:

http://www.arn.org/blogs/index.php/literature/2011/04/26/dna_repair_mechanisms_reveal_a_contradic
ReplyDelete
Replies
NewbieSunday, July 14, 2013 8:30:00 PM
No matter how one wants to look at it, another big chunk of so-called "junk DNA" turned out to have some function.

In other news, another big chunk of someone's ego also lost its function. At this pace, there will be nothing left in few years :)
ReplyDelete
Replies
MartSunday, July 14, 2013 10:15:00 PM
Dear bloggers,

The 13.6% is what is detected with high-confidence by the algorithms employed, which only predict ~30-40% of true positive structures. I realise the manuscript is computationally intense, but it is clearly stated in the discussion that the 13.6% is indicative that 13.6/0.4 = 34% of the genome is likely to be evolutionarily constrained at the RNA structure level (see below). Furthermore, these results are based on very specific search parameters (limited to 200 nucleotides). Larry, how many RNA transcripts are over 200 nt?
Granted 34% is not quite the >85% reported by ENCODE, it is in the same order of magnitude (>3x more evolutionary constrained regions & ~20x more than protein coding genes).
FYI, this is not trying to prove that the entire genome is functional. The problem was approached objectively with as little assumptions as possible (other than assuming all of the genome is transcribed, which was intrinsic to the methodology and not yet published by ENCODE at the time).

[...] In this work, the practicality of RFAM alignments with regards to consensus sequence-based RNA structure prediction is 2-fold: (i) to calculate an upper limit for sliding-window predictions on validated data, and (ii) to estimate the experimental error incurred by multiple sequence alignment heuristics. By comparing both results, it is possible to extrapolate the approximate accuracy of a classical scans for evolutionarily conserved RNA secondary structure.
Hence, the RNA structure predictions we report using conservative thresholds are likely to span >13.6% of the human genome we report. This number is probably a substantial underestimate of the true proportion given the conservative scoring thresholds employed, the neglect of pseudoknots, the liberal distance between overlapping windows and the incapacity of the sliding-window approach to detect base-pair interactions outside the fixed window length. A less conservative estimate would place this ratio somewhere above 20% from the reported sensitivities measured from native RFAM alignments and over 30% from the observed sensitivities derived from sequence-based realignment of RFAM data (Table 1, Figure 1 and Supplementary Figure S4). [...] By breaking down the control data in function of their sequence characteristics and by reproducing experimental conditions through sequence-based realignment of the input, we set the foundation for an optimized genome-wide investigation of RNA secondary structure conservation.
ReplyDelete
Replies
AnonymousSunday, July 14, 2013 10:35:00 PM
13.6% is a lot. Consider that Yogi Berra once made the observation that baseball is 90% mental and the other half is physical.
ReplyDelete
Replies
AnonymousMonday, July 15, 2013 2:47:00 AM
so..... Mattick pretty much disproved himself. Cool.
ReplyDelete
Replies
nmanningThursday, July 18, 2013 1:17:00 PM
So Mart - do you real scientists really think that 13.6% is a 'large proportion'?
ReplyDelete
Replies

Add comment