Thursday, July 04, 2013

How to Make a Scientific Argument

The debate over the amount of junk in our genome is a genuine scientific debate. There are legitimate scientific points of view on both sides although the weight of evidence and logic is tilting heavily in favor of junk DNA. It looks more and more like most (~90%) of our genome is junk.

The problem with the debate is that the scientific literature is full of papers attacking junk DNA while there are very few papers promoting it. This is partly because there haven't been any new discoveries in favor of junk DNA. On the other hand, there have been quite a few discoveries showing that some small part of the genome that was thought to be junk might have a function. Even though these discoveries make an insignificant contribution to the big picture, they are often blown up out of all proportion and promoted as an end to junk DNA.

A recent paper in PLoS Genetics illustrates the problem.
Hangauer, M.J., Vaughn, I.W. and McManus, M.T. (2013) Pervasive Transcription of the Human Genome Produces Thousands of Previously Unidentified Long Intergenic Noncoding RNAs. PLoS Genetics 9, e1003569. [doi: 10.1371/journal.pgen.1003569]

Much of the human genome is composed of intergenic sequence, the regions between genes. Intergenic sequence was once thought to be transcriptionally silent “junk DNA,” but it has recently become apparent that intergenic regions can be transcribed. However, the scope, nature, and identity of this intergenic transcription remain unknown. Here, by analyzing a large set of RNA-seq data, we found that >85% of the genome is transcribed, allowing us to generate a comprehensive catalog of an important class of intergenic transcripts: long intergenic noncoding RNAs (lincRNAs). We found that the genome encodes far more lincRNAs than previously known. A key question in the field is whether these intergenic transcripts are functional or transcriptional noise. We found that the lincRNAs we identified have many characteristics that are inconsistent with noise, including specific regulation of their expression, the presence of conserved sequence and evidence for regulated processing. Furthermore, these lincRNAs are strongly enriched with intergenic sequences that were previously known to be functional in human traits and diseases. This study provides an essential framework from which the functional elements in intergenic regions can be identified and characterized, facilitating future efforts toward understanding the roles of intergenic transcription in human health and disease.
Even if every one of their presumed lincRNAs has a biological function, it would only account for 2% of the genome. This hardly spells the end of the junk DNA debate.

Here's how the authors of this paper begin the introduction ...
A large fraction of the human genome consists of intergenic sequence. Once referred to as “junk DNA”, it is now clear that functional elements exist in intergenic regions. In fact, genome wide association studies have revealed that approximately half of all disease and trait-associated genomic regions are intergenic [1]. While some of these regions may function solely as DNA elements, it is now known that intergenic regions can be transcribed [2]–[7], and a growing list of functional noncoding RNA genes within intergenic regions has emerged [8].
I believe that this is very deceptive. It doesn't take into account the total evidence in the scientific literature and it ignores history. It seems to me that part of the problem with this debate is that we have become very lax in our standards of scientific discourse.

I'm quite fond of a quotation by Richard Feynman. He makes the same point made by dozens of other respectable scientists.
Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can — if you know anything at all wrong, or possibly wrong — to explain it. If you make a theory, for example, and advertise it, or put it out, then you must also put down all the facts that disagree with it, as well as those that agree with it.

Richard Feynman (1918-1988) "Cargo Cult Science" in Surely You're Joking, Mr. Feynman!
Let me restate Feynman's point in the context of the junk DNA debate. If you are going to argue for or against the presence of junk DNA then you owe it to your audience to present both sides of the issue. It's not good science to ignore all the evidence against your idea and only present the evidence that supports it.

This used to be the standard in scientific publications but, somehow, it isn't any more. Here's how that first paragraph should have been written. (I exaggerate a bit in order to make my point.)
The human genome contains about 25,000 known genes1 that make up about 25% of the genome. Only a small fraction of this is present in mature functional RNAs of various sorts—the rest is mostly introns and most intron RNA is discarded during processing.

Intergenic regions have a variety of functions, most of which have been known for three or four decades. They include regulatory sequences, centromeres, telomeres, SARs, and origins of replication. No reputable group of scientists has ever claimed that all integenic DNA is junk in spite of the fact that this myth is widely promoted in the scientific literature.

Known functional regions of the genome make up less than 10% of the total and much of the rest is thought to be junk DNA—this includes most of the introns. The evidence is based on decades of work on genetic load, the C-value paradox (genome comparisons), modern evolutionary theory (population genetics), and the human genome sequence showing that 50% of our genome is composed of broken transposons and pseudogenes.

It has been known since the early 1970s that much of our genome is transcribed at some time or another during development or in some tissues. This "pervasive transcription" appears to be transcriptional noise based on the fact that the transcripts are very rare and on the known frequency of spurious binding of transcription factors and RNA polymerase. Such an interpretation is consistent with the evidence that most of our genome is junk.

However, the function of most of these low-level transcripts is still an open question and it is possible that they represent functional RNAs in which case a large fraction our genome may not be junk after all. If true, it would mean that the human genome contains tens of thousands of genes that have remained completely undetected in spite of decades of work in biochemistry and molecular biology labs over a period of forty years or more. This extraordinary discovery would revolutionize our understanding of gene expression.

We investigated this question by focusing our attention on possible lincRNAs that are present in at least one copy per cell and show signs of conservation and regulation. We confirmed that >85% of the genome is transcribed but discovered that only about 2% produces lincRNAs that meet our minimal criteria for potential function. Our results suggest that most pervasive transcription does not produce functional RNAs supporting the idea that it is transcriptional noise and that most of the genome is junk.
There, that's much better.

1. This includes many known genes that encode functional RNA such as ribosomal RNA, transfer RNA, and various other RNAs including regulatory RNAs, spliceosomal RNAs, microRNAs etc. etc.


  1. Why did they put the statement that "85% of the genome is transcribed" in the abstract when in the paper itself they apply filters that result in a much smaller number of transcripts (and we can debate the appropriateness of some of those) that cover a lot less of it?

  2. At least this time Rinn removed RNAs with less than one copy per cell from their analyses. Still the question remains how so few untranslated RNAs would confer any function. Some of my colleagues claimed that they would act as sinks for miRNAs hindering the later to interfere with mRNA levels. However, if that were true one would expect higher copy numbers like in the case of linx-RoR for which the authors (Wang et al. 2013, stated:
    "To serve as a sponge, the abundance of linc-RoR should be
    comparable to or higher than miR-145. We therefore used quantitative
    real-time PCR to quantify the exact copy numbers of linc-
    RoR and miR-145 per cell (Figure S4C). As a result, we found
    that, in the self-renewal hESCs, the expression level of mature
    miR-145 was only about 10–20 copies per cell, whereas linc-
    ROR level was more than 100 copies per cell.

    I wouldn't be surprised though, if in the near future one of the linc-RNA guys will clame that there is much overlap of the spectra of different lincRNAs that compensates for low copy numbers of individual molecules.

    Back in in his 2012 Nature Biotechnology paper (doi:10.1038/nbt.2024) Rinn didn't remove low copy number lincRNAs and estimated "
    that the lncRNAs we discovered were present at an average of ~0.0006 transcripts per cell, indicating expression in only a small subpopulation of the cells sampled."

    Back then he stated the possibility that every single cell even when belonging to a single clone may possess its individual transcriptome.
    He stated:
    Indeed the low expression of many bona fide transcripts implies that there are substantial transcriptomic differences between cells, even those in clonal cell culture, suggesting that each cell has an individual if not unique transcriptomic signature. This in turn challenges the notion that there may be a single, stable transcriptome by which a cell can be characterized, although broad cell types, such as fibroblasts, may show similar patterns."
    Does removing low copy number lincRNAs from his current analysismean that he changed his mind?

    1. It's not certain that 1FPKM means 1 copy per cell. That might be true for large neurons, but for many smaller cell types with less mRNA per cell, it's more like 5FPKM = 1 copy.

      Nobody really has hard data on this though.

    2. @George Marinov,

      I don't understand what FPKM means so I just went with the authors' estimate that this corresponds to about one copy per cell. If the function requires hybridization to something in the cell then you're going to need a lot more that one copy per cell.

      However, I'm pleased that the proponents of functional RNAs are finally waking up to the idea that number of copies is important [How to Evaluate Genome Level Transcription Papers].

    3. Eventually these questions will be answered by single-cell RNA-seq - of course, done in a way that allows to count absolute transcript copies. This will tell us in how many cells in a population and at how many copies things are expressed.

      I am glad they did not use the subcellular fractions from ENCODE though - that could have been a serious trap in terms of FPKM and absolute abundances (FPKM is a relative metric, not an absolute one) that they could have easily fallen into, but they didn't.

  3. Glad to see this being covered. I'll only add that since the paper is in PLoS, it has an open online comments section, which I believe is specifically included in all PLoS papers to encourage post-publication peer review and commentary. It might be valuable to include some of these criticisms there, where they'll be directly available to anyone who looks up the original paper.

  4. Larry - pre-mRNA transcription of the currently annotated 20,000 protein coding genes covers 40% of the human genome, not 25%. I've always used a rough 25% ballpark estimate for myself too, but because I'm writing a review (so I ought to get it right, not just ballpark it), I recently did the coverage calculation for myself from the current GENCODE (v17 Jun 2013) human annotation. Oddly, the Hangauer paper gets this coverage number right (see Fig 1A) but still claims that 97% of the genome is "intergenic".

    SPARC - Rinn is not an author on this paper. He's the academic editor of it for the journal.

    1. I have trouble believing that typical mRNA precursor transcripts are complementary to 40% of the genome. I bet that includes all kinds of spurious transcription start sites producing rare transcripts that just happen to run into the 5' end of a gene and all kinds of run-on transcripts that aren't normally part of the precursor.

      If you were to start at the beginning of the first exon and end at the end of the last one, how much of the genome is covered? (I realize that this doesn't include the whole gene.) Would it be closer to 25%, meaning that 15% more is probably extra transcription before the normal promoter and after the normal poly adenylation site?

    2. Ask that again? Answer's still 40% (annotated first exon to last exon *is* the extent of an annotated pre-mRNA, of course).

      If you mean, let's only look at mRNA isoforms that are more likely to be relevant to each gene -- setting some threshold on relative expression level amongst the set of known isoforms, for example -- yeah, I wish I could do that easily. I've tried to encourage the powers that be in the community to annotate transcripts quantitatively (major vs. minor isoforms at least) rather than annotating everything that's ever been see with equal weight.

      I basically agree with your point. Though I bet GENCODE is both overannotated (extending transcripts because they've seen a rare isoform, as you say) and underannotated (I think we're still missing plenty of cell-type-specific alternative processing) -- and I bet a subset of "lincRNAs" represent the exons of such mRNAs. (It's a bet I'll win -- I know it's true in the FANTOM3 cDNAs. I wonder how prevalent the artifact is in more recent lincRNA collections.)

    3. Sorry, I meant first coding exon to last coding exon. We could look at some well-characterized genes where the regular transciption start site is known and the size of the mature mRNA has been observed repeatedly. Does the GENCODE annotation show a longer mature transcript?

    4. I dunno, UTRs cover a lot of territory. Counting the ATG to stop extent will underestimate pre-mRNA coverage by a lot.

      Somewhat related to your idea about looking at some anecdotes -- yeah, I've always wanted to compare GENCODE annotation (and the like in other organisms, like Drosophila) to old school Northerns. But scrabbling through old papers one at a time hasn't been appealing. If anyone knows of a collection of digitized Northern data for human genes, I'd love to know about it.

    5. Alternative promoters are an issue (e.g. the IGF-II gene has four different active promoters in human and Ruminatia three of which are conserved in rodents, the first coing exon is located downstram of the alternative non-coding first exons). However, this is an exeption rather than a rule. Most transcript differences are currently attributed to alternative splicing. However, IMO databases are overcrowded with noise. E.g., between 1990 and 1996 I prepared quite some SPARC Northern blots from a variety of human tissues/cells even from obscure sources like sperm cells and thrombocytes (the later contain tons of SPARC mRNA) and I always only detected two major transcripts which were due to alternate polyA signals. Occasionally, a faint band of higher molecular weight would show up that could not be interpreted but I never saw shorter ones. Primer extension showed that two different transcription start sites were used that are so close to each other that they can not be distinguished by Northern blotting. Today nine transcripts are listed in ENCODE with only one encoding the full length 303 AA protein. Four transcripts are annotated as encoding shorter peptides of 149, 115, 111 and 53 AA. The remaining four are supposed to be non-coding. I guess these are products of splicing mishaps or splicing noise rather than products of regulated alternativee splicing. I doubt that any of the non-full length transcripts has any function.

      Unfortunately, alternative splice databases have always been a mess. Try to find a single constitutively spliced intron in the human dataset. When I did some years ago I didn't find any. I must admit though that I only did a search by hand and finally just used one I had at hand anyway.

    6. Sean Eddy:

      I dunno, UTRs cover a lot of territory. Counting the ATG to stop extent will underestimate pre-mRNA coverage by a lot.

      But surely 3' and 5' UTRs are not about as long as the gene parts (first exon to second exon)? Which is what this 40% figure pretty much requires...

    7. Why not?

      Even if you don't believe actual data (i.e. the actual statistics of the current human genome annotation), a back of the envelope calculation suffices: typical mRNA = 2-4kb. Typical protein = 300-400aa, thus ~1kb of coding. Not hard to believe more UTR than coding.

    8. GENCODE has 2.86% of the genome annotated as exons of protein coding genes, of that only 1.11% are annotated as CDS. 1.5% has been specifically annotated as UTRs, i.e. more than CDSs. Note that this does not sum to the total of the exons and I have no idea why that is but I would guess it's because of non-coding transcripts of protein coding genes for which the UTRs have not been specifically annotated.

    9. Your back of the envelop example surely uses mature mRNA. But an average intron is ~20X longer than an average exon. Bingo, there is no way the UTRs are about as long as long as the transcribed genes.

    10. My gut feeling says that 3'-UTRs are much longer than 5'-UTRs. From the genes I've worked with mammalian androgen receptor genes possess the longest 5'UTR of >1000 nucleotides (1124 nt in the human AR gene; don't trust the annotation in ENSEMBL) the others were about or less than 100 nt. This estimate may be biased towards higher numbers because I worked the genes I worked with contained untranslated first exons.

  5. "We found that the genome encodes far more lincRNAs than previously known ..."

    Encondes long non-coding RNAs?

  6. Another relevant ref.

    Exaptation of Transposable Elements into Novel Cis-Regulatory Elements: Is the Evidence Always Strong?

  7. Another relevant ref.

    Exaptation of Transposable Elements into Novel Cis-Regulatory Elements: Is the Evidence Always Strong?

  8. Here's a bit of shameless self-promotion, but it's relevant. We used a massively parallel reporter assay to compare the enhancer function of ~1,200 ChIP-seq peak sequences and ~900 unbound random genomic sequences with binding motifs. If you just assayed those sequences, you'd conclude that they are almost all functional, i.e. all these sequences can regulate transcription.

    But we also included ~1000 random DNA controls, totaling ~ 100 kb of random sequence. Result: most completely random DNA has **reproducible** regulatory activity. A true definition of function should not include most randomly generated sequences.

    1. That's very cool. I assume this is the PNAS paper that's in press? I can't wait to read it.

    2. Just came out in Early Edition this week - the link is in my comment. Fig. S4 is the key figure showing what random DNA does.

  9. That this Feynman guy has to tell people to report all the facts , pro and con, indicates there is a need to do this obvious thing.
    This issue makes a creationist point.
    The minute there is disagreement then everyone claims the researchers aren't doing the right science or any. How quickly confidence in peoples scientific competence is shattered.
    likewise creationists rightly question conclusions in origin subjects.
    The "science" is not very well done after all. Lots of room for criticism.

    1. "This Feynman guy ..."

      When was the last time a creationist researcher gave serious weight to contrary evidence to their own theories? Never, you say?

    2. Robert Byers says,

      That this Feynman guy has to tell people to report all the facts , pro and con, indicates there is a need to do this obvious thing.

      There's definitely a need. Please tell your creationist friends ... and think about how it applies to you.

  10. Why is it that anti-junk people always switch to passive tense when they're lying? Passive tense pussies. I guess they think that it's OK to falsify the history of science so long as you don't name the specific person who did the thing that never happened, but instead the PTP (Passive Tense Pussy) switches to passive: "Non-coding DNA, long dismissed as Junk..." Dismissed by whom? In what paper in what journal what page number in what year?

    Intergenic sequence was once thought to be transcriptionally silent “junk DNA,”

    Thought to be BY WHOM? In what paper in what journal what page number in what year?

    A large fraction of the human genome consists of intergenic sequence. Once referred to as “junk DNA”,

    Referred to BY WHOM? In what paper in what journal what page number in what year? Pussies.

    We must ban the passive tense.

  11. Hey Larry,

    Thanks for your comment on the paper.

    "We confirmed that >85% of the genome is transcribed but discovered that only about 2% produces lincRNAs that meet our minimal criteria for potential function."

    Could you please cite a source that confirms that 2% estimate, or at least clarify what you meant by this point? I'm not a specialist, and I failed to find a proper and recent source that explains the extent of functional lincRNA in the human genome.

    Thanks again!

    1. Taking their most generous estimate, they identified 53,864 potential lincRNAs. If we assign a generous average length of 1000 bp, then this works out to 1.8% of the genome.

  12. Cornelius is at it again -