Sandwalk: How to Make a Scientific Argument

Thursday, July 04, 2013

How to Make a Scientific Argument

The debate over the amount of junk in our genome is a genuine scientific debate. There are legitimate scientific points of view on both sides although the weight of evidence and logic is tilting heavily in favor of junk DNA. It looks more and more like most (~90%) of our genome is junk.

The problem with the debate is that the scientific literature is full of papers attacking junk DNA while there are very few papers promoting it. This is partly because there haven't been any new discoveries in favor of junk DNA. On the other hand, there have been quite a few discoveries showing that some small part of the genome that was thought to be junk might have a function. Even though these discoveries make an insignificant contribution to the big picture, they are often blown up out of all proportion and promoted as an end to junk DNA.

A recent paper in PLoS Genetics illustrates the problem.

Hangauer, M.J., Vaughn, I.W. and McManus, M.T. (2013) Pervasive Transcription of the Human Genome Produces Thousands of Previously Unidentified Long Intergenic Noncoding RNAs. PLoS Genetics 9, e1003569. [doi: 10.1371/journal.pgen.1003569]

Abstract
Much of the human genome is composed of intergenic sequence, the regions between genes. Intergenic sequence was once thought to be transcriptionally silent “junk DNA,” but it has recently become apparent that intergenic regions can be transcribed. However, the scope, nature, and identity of this intergenic transcription remain unknown. Here, by analyzing a large set of RNA-seq data, we found that >85% of the genome is transcribed, allowing us to generate a comprehensive catalog of an important class of intergenic transcripts: long intergenic noncoding RNAs (lincRNAs). We found that the genome encodes far more lincRNAs than previously known. A key question in the field is whether these intergenic transcripts are functional or transcriptional noise. We found that the lincRNAs we identified have many characteristics that are inconsistent with noise, including specific regulation of their expression, the presence of conserved sequence and evidence for regulated processing. Furthermore, these lincRNAs are strongly enriched with intergenic sequences that were previously known to be functional in human traits and diseases. This study provides an essential framework from which the functional elements in intergenic regions can be identified and characterized, facilitating future efforts toward understanding the roles of intergenic transcription in human health and disease.

Even if every one of their presumed lincRNAs has a biological function, it would only account for 2% of the genome. This hardly spells the end of the junk DNA debate.

Here's how the authors of this paper begin the introduction ...

A large fraction of the human genome consists of intergenic sequence. Once referred to as “junk DNA”, it is now clear that functional elements exist in intergenic regions. In fact, genome wide association studies have revealed that approximately half of all disease and trait-associated genomic regions are intergenic [1]. While some of these regions may function solely as DNA elements, it is now known that intergenic regions can be transcribed [2]–[7], and a growing list of functional noncoding RNA genes within intergenic regions has emerged [8].

I believe that this is very deceptive. It doesn't take into account the total evidence in the scientific literature and it ignores history. It seems to me that part of the problem with this debate is that we have become very lax in our standards of scientific discourse.

I'm quite fond of a quotation by Richard Feynman. He makes the same point made by dozens of other respectable scientists.

Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can — if you know anything at all wrong, or possibly wrong — to explain it. If you make a theory, for example, and advertise it, or put it out, then you must also put down all the facts that disagree with it, as well as those that agree with it.

Richard Feynman (1918-1988) "Cargo Cult Science" in Surely You're Joking, Mr. Feynman!

Let me restate Feynman's point in the context of the junk DNA debate. If you are going to argue for or against the presence of junk DNA then you owe it to your audience to present both sides of the issue. It's not good science to ignore all the evidence against your idea and only present the evidence that supports it.

This used to be the standard in scientific publications but, somehow, it isn't any more. Here's how that first paragraph should have been written. (I exaggerate a bit in order to make my point.)

The human genome contains about 25,000 known genes¹ that make up about 25% of the genome. Only a small fraction of this is present in mature functional RNAs of various sorts—the rest is mostly introns and most intron RNA is discarded during processing.

Intergenic regions have a variety of functions, most of which have been known for three or four decades. They include regulatory sequences, centromeres, telomeres, SARs, and origins of replication. No reputable group of scientists has ever claimed that all integenic DNA is junk in spite of the fact that this myth is widely promoted in the scientific literature.

Known functional regions of the genome make up less than 10% of the total and much of the rest is thought to be junk DNA—this includes most of the introns. The evidence is based on decades of work on genetic load, the C-value paradox (genome comparisons), modern evolutionary theory (population genetics), and the human genome sequence showing that 50% of our genome is composed of broken transposons and pseudogenes.

It has been known since the early 1970s that much of our genome is transcribed at some time or another during development or in some tissues. This "pervasive transcription" appears to be transcriptional noise based on the fact that the transcripts are very rare and on the known frequency of spurious binding of transcription factors and RNA polymerase. Such an interpretation is consistent with the evidence that most of our genome is junk.

However, the function of most of these low-level transcripts is still an open question and it is possible that they represent functional RNAs in which case a large fraction our genome may not be junk after all. If true, it would mean that the human genome contains tens of thousands of genes that have remained completely undetected in spite of decades of work in biochemistry and molecular biology labs over a period of forty years or more. This extraordinary discovery would revolutionize our understanding of gene expression.

We investigated this question by focusing our attention on possible lincRNAs that are present in at least one copy per cell and show signs of conservation and regulation. We confirmed that >85% of the genome is transcribed but discovered that only about 2% produces lincRNAs that meet our minimal criteria for potential function. Our results suggest that most pervasive transcription does not produce functional RNAs supporting the idea that it is transcriptional noise and that most of the genome is junk.

There, that's much better.

1. This includes many known genes that encode functional RNA such as ribosomal RNA, transfer RNA, and various other RNAs including regulatory RNAs, spliceosomal RNAs, microRNAs etc. etc.

32 comments :

Georgi Marinov said...: Why did they put the statement that "85% of the genome is transcribed" in the abstract when in the paper itself they apply filters that result in a much smaller number of transcripts (and we can debate the appropriateness of some of those) that cover a lot less of it?; Thursday, July 04, 2013 12:10:00 PM
SPARC said...: At least this time Rinn removed RNAs with less than one copy per cell from their analyses. Still the question remains how so few untranslated RNAs would confer any function. Some of my colleagues claimed that they would act as sinks for miRNAs hindering the later to interfere with mRNA levels. However, if that were true one would expect higher copy numbers like in the case of linx-RoR for which the authors (Wang et al. 2013, http://dx.doi.org/10.1016/j.devcel.2013.03.002) stated:
"To serve as a sponge, the abundance of linc-RoR should be
comparable to or higher than miR-145. We therefore used quantitative
real-time PCR to quantify the exact copy numbers of linc-
RoR and miR-145 per cell (Figure S4C). As a result, we found
that, in the self-renewal hESCs, the expression level of mature
miR-145 was only about 10–20 copies per cell, whereas linc-
ROR level was more than 100 copies per cell.
I wouldn't be surprised though, if in the near future one of the linc-RNA guys will clame that there is much overlap of the spectra of different lincRNAs that compensates for low copy numbers of individual molecules.

Back in in his 2012 Nature Biotechnology paper (doi:10.1038/nbt.2024) Rinn didn't remove low copy number lincRNAs and estimated "
that the lncRNAs we discovered were present at an average of ~0.0006 transcripts per cell, indicating expression in only a small subpopulation of the cells sampled."
Back then he stated the possibility that every single cell even when belonging to a single clone may possess its individual transcriptome.
He stated:
Indeed the low expression of many bona fide transcripts implies that there are substantial transcriptomic differences between cells, even those in clonal cell culture, suggesting that each cell has an individual if not unique transcriptomic signature. This in turn challenges the notion that there may be a single, stable transcriptome by which a cell can be characterized, although broad cell types, such as fibroblasts, may show similar patterns."
Does removing low copy number lincRNAs from his current analysismean that he changed his mind?; Thursday, July 04, 2013 3:40:00 PM
David said...: Glad to see this being covered. I'll only add that since the paper is in PLoS, it has an open online comments section, which I believe is specifically included in all PLoS papers to encourage post-publication peer review and commentary. It might be valuable to include some of these criticisms there, where they'll be directly available to anyone who looks up the original paper.; Thursday, July 04, 2013 4:33:00 PM
Sean Eddy said...: Larry - pre-mRNA transcription of the currently annotated 20,000 protein coding genes covers 40% of the human genome, not 25%. I've always used a rough 25% ballpark estimate for myself too, but because I'm writing a review (so I ought to get it right, not just ballpark it), I recently did the coverage calculation for myself from the current GENCODE (v17 Jun 2013) human annotation. Oddly, the Hangauer paper gets this coverage number right (see Fig 1A) but still claims that 97% of the genome is "intergenic".

SPARC - Rinn is not an author on this paper. He's the academic editor of it for the journal.; Thursday, July 04, 2013 4:38:00 PM
Anonymous said...: "We found that the genome encodes far more lincRNAs than previously known ..."

Encondes long non-coding RNAs?; Thursday, July 04, 2013 4:55:00 PM
PNG said...: Another relevant ref.

Exaptation of Transposable Elements into Novel Cis-Regulatory Elements: Is the Evidence Always Strong?
http://mbe.oxfordjournals.org/content/30/6/1239.abstract; Thursday, July 04, 2013 5:14:00 PM
PNG said...: Another relevant ref.

Exaptation of Transposable Elements into Novel Cis-Regulatory Elements: Is the Evidence Always Strong?
http://mbe.oxfordjournals.org/content/30/6/1239.abstract; Thursday, July 04, 2013 5:15:00 PM
Larry Moran said...: I have trouble believing that typical mRNA precursor transcripts are complementary to 40% of the genome. I bet that includes all kinds of spurious transcription start sites producing rare transcripts that just happen to run into the 5' end of a gene and all kinds of run-on transcripts that aren't normally part of the precursor.

If you were to start at the beginning of the first exon and end at the end of the last one, how much of the genome is covered? (I realize that this doesn't include the whole gene.) Would it be closer to 25%, meaning that 15% more is probably extra transcription before the normal promoter and after the normal poly adenylation site?; Thursday, July 04, 2013 5:16:00 PM
Sean Eddy said...: Ask that again? Answer's still 40% (annotated first exon to last exon *is* the extent of an annotated pre-mRNA, of course).

If you mean, let's only look at mRNA isoforms that are more likely to be relevant to each gene -- setting some threshold on relative expression level amongst the set of known isoforms, for example -- yeah, I wish I could do that easily. I've tried to encourage the powers that be in the community to annotate transcripts quantitatively (major vs. minor isoforms at least) rather than annotating everything that's ever been see with equal weight.

I basically agree with your point. Though I bet GENCODE is both overannotated (extending transcripts because they've seen a rare isoform, as you say) and underannotated (I think we're still missing plenty of cell-type-specific alternative processing) -- and I bet a subset of "lincRNAs" represent the exons of such mRNAs. (It's a bet I'll win -- I know it's true in the FANTOM3 cDNAs. I wonder how prevalent the artifact is in more recent lincRNA collections.); Thursday, July 04, 2013 5:43:00 PM
Larry Moran said...: Sorry, I meant first coding exon to last coding exon. We could look at some well-characterized genes where the regular transciption start site is known and the size of the mature mRNA has been observed repeatedly. Does the GENCODE annotation show a longer mature transcript?; Thursday, July 04, 2013 6:16:00 PM
Sean Eddy said...: I dunno, UTRs cover a lot of territory. Counting the ATG to stop extent will underestimate pre-mRNA coverage by a lot.

Somewhat related to your idea about looking at some anecdotes -- yeah, I've always wanted to compare GENCODE annotation (and the like in other organisms, like Drosophila) to old school Northerns. But scrabbling through old papers one at a time hasn't been appealing. If anyone knows of a collection of digitized Northern data for human genes, I'd love to know about it.; Thursday, July 04, 2013 6:32:00 PM
Unknown said...: Here's a bit of shameless self-promotion, but it's relevant. We used a massively parallel reporter assay to compare the enhancer function of ~1,200 ChIP-seq peak sequences and ~900 unbound random genomic sequences with binding motifs. If you just assayed those sequences, you'd conclude that they are almost all functional, i.e. all these sequences can regulate transcription.

But we also included ~1000 random DNA controls, totaling ~ 100 kb of random sequence. Result: most completely random DNA has **reproducible** regulatory activity. A true definition of function should not include most randomly generated sequences.; Thursday, July 04, 2013 11:15:00 PM
SPARC said...: Alternative promoters are an issue (e.g. the IGF-II gene has four different active promoters in human and Ruminatia three of which are conserved in rodents, the first coing exon is located downstram of the alternative non-coding first exons). However, this is an exeption rather than a rule. Most transcript differences are currently attributed to alternative splicing. However, IMO databases are overcrowded with noise. E.g., between 1990 and 1996 I prepared quite some SPARC Northern blots from a variety of human tissues/cells even from obscure sources like sperm cells and thrombocytes (the later contain tons of SPARC mRNA) and I always only detected two major transcripts which were due to alternate polyA signals. Occasionally, a faint band of higher molecular weight would show up that could not be interpreted but I never saw shorter ones. Primer extension showed that two different transcription start sites were used that are so close to each other that they can not be distinguished by Northern blotting. Today nine transcripts are listed in ENCODE with only one encoding the full length 303 AA protein. Four transcripts are annotated as encoding shorter peptides of 149, 115, 111 and 53 AA. The remaining four are supposed to be non-coding. I guess these are products of splicing mishaps or splicing noise rather than products of regulated alternativee splicing. I doubt that any of the non-full length transcripts has any function.

Unfortunately, alternative splice databases have always been a mess. Try to find a single constitutively spliced intron in the human dataset. When I did some years ago I didn't find any. I must admit though that I only did a search by hand and finally just used one I had at hand anyway.; Friday, July 05, 2013 12:30:00 AM
Robert Byers said...: That this Feynman guy has to tell people to report all the facts , pro and con, indicates there is a need to do this obvious thing.
This issue makes a creationist point.
The minute there is disagreement then everyone claims the researchers aren't doing the right science or any. How quickly confidence in peoples scientific competence is shattered.
likewise creationists rightly question conclusions in origin subjects.
The "science" is not very well done after all. Lots of room for criticism.; Friday, July 05, 2013 12:59:00 AM
Georgi Marinov said...: It's not certain that 1FPKM means 1 copy per cell. That might be true for large neurons, but for many smaller cell types with less mRNA per cell, it's more like 5FPKM = 1 copy.

Nobody really has hard data on this though.; Friday, July 05, 2013 3:17:00 AM
AllanMiller said...: "This Feynman guy ..."

When was the last time a creationist researcher gave serious weight to contrary evidence to their own theories? Never, you say?; Friday, July 05, 2013 6:51:00 AM
Larry Moran said...: @George Marinov,

I don't understand what FPKM means so I just went with the authors' estimate that this corresponds to about one copy per cell. If the function requires hybridization to something in the cell then you're going to need a lot more that one copy per cell.

However, I'm pleased that the proponents of functional RNAs are finally waking up to the idea that number of copies is important [How to Evaluate Genome Level Transcription Papers].; Friday, July 05, 2013 9:46:00 AM
Larry Moran said...: Robert Byers says,

That this Feynman guy has to tell people to report all the facts , pro and con, indicates there is a need to do this obvious thing.

There's definitely a need. Please tell your creationist friends ... and think about how it applies to you.; Friday, July 05, 2013 9:48:00 AM
Georgi Marinov said...: Eventually these questions will be answered by single-cell RNA-seq - of course, done in a way that allows to count absolute transcript copies. This will tell us in how many cells in a population and at how many copies things are expressed.

I am glad they did not use the subcellular fractions from ENCODE though - that could have been a serious trap in terms of FPKM and absolute abundances (FPKM is a relative metric, not an absolute one) that they could have easily fallen into, but they didn't.; Friday, July 05, 2013 9:51:00 AM
Diogenes said...: Why is it that anti-junk people always switch to passive tense when they're lying? Passive tense pussies. I guess they think that it's OK to falsify the history of science so long as you don't name the specific person who did the thing that never happened, but instead the PTP (Passive Tense Pussy) switches to passive: "Non-coding DNA, long dismissed as Junk..." Dismissed by whom? In what paper in what journal what page number in what year?

Intergenic sequence was once thought to be transcriptionally silent “junk DNA,”

Thought to be BY WHOM? In what paper in what journal what page number in what year?

A large fraction of the human genome consists of intergenic sequence. Once referred to as “junk DNA”,

Referred to BY WHOM? In what paper in what journal what page number in what year? Pussies.

We must ban the passive tense.; Friday, July 05, 2013 10:17:00 AM
Larry Moran said...: That's very cool. I assume this is the PNAS paper that's in press? I can't wait to read it.; Friday, July 05, 2013 10:27:00 AM
DK said...: Sean Eddy:

I dunno, UTRs cover a lot of territory. Counting the ATG to stop extent will underestimate pre-mRNA coverage by a lot.

But surely 3' and 5' UTRs are not about as long as the gene parts (first exon to second exon)? Which is what this 40% figure pretty much requires...; Friday, July 05, 2013 1:05:00 PM
Sean Eddy said...: Why not?

Even if you don't believe actual data (i.e. the actual statistics of the current human genome annotation), a back of the envelope calculation suffices: typical mRNA = 2-4kb. Typical protein = 300-400aa, thus ~1kb of coding. Not hard to believe more UTR than coding.; Friday, July 05, 2013 4:28:00 PM
Georgi Marinov said...: GENCODE has 2.86% of the genome annotated as exons of protein coding genes, of that only 1.11% are annotated as CDS. 1.5% has been specifically annotated as UTRs, i.e. more than CDSs. Note that this does not sum to the total of the exons and I have no idea why that is but I would guess it's because of non-coding transcripts of protein coding genes for which the UTRs have not been specifically annotated.; Friday, July 05, 2013 4:44:00 PM
DK said...: Your back of the envelop example surely uses mature mRNA. But an average intron is ~20X longer than an average exon. Bingo, there is no way the UTRs are about as long as long as the transcribed genes.; Friday, July 05, 2013 9:41:00 PM
Unknown said...: Just came out in Early Edition this week - the link is in my comment. Fig. S4 is the key figure showing what random DNA does.; Friday, July 05, 2013 9:51:00 PM
Diogenes said...: Bitchin cool, Mike.; Friday, July 05, 2013 10:04:00 PM
SPARC said...: My gut feeling says that 3'-UTRs are much longer than 5'-UTRs. From the genes I've worked with mammalian androgen receptor genes possess the longest 5'UTR of >1000 nucleotides (1124 nt in the human AR gene; don't trust the annotation in ENSEMBL) the others were about or less than 100 nt. This estimate may be biased towards higher numbers because I worked the genes I worked with contained untranslated first exons.; Saturday, July 06, 2013 2:08:00 AM
un said...: Hey Larry,

Thanks for your comment on the paper.

"We confirmed that >85% of the genome is transcribed but discovered that only about 2% produces lincRNAs that meet our minimal criteria for potential function."

Could you please cite a source that confirms that 2% estimate, or at least clarify what you meant by this point? I'm not a specialist, and I failed to find a proper and recent source that explains the extent of functional lincRNA in the human genome.

Thanks again!; Sunday, July 07, 2013 3:47:00 AM
Larry Moran said...: Taking their most generous estimate, they identified 53,864 potential lincRNAs. If we assign a generous average length of 1000 bp, then this works out to 1.8% of the genome.; Sunday, July 07, 2013 9:04:00 AM
Anonymous said...: Cornelius is at it again - http://www.darwins-god.blogspot.se/2013/07/heres-new-paper-on-long-non-coding-rna.html.; Tuesday, July 09, 2013 3:11:00 PM
Rolf Aalberg said...: Re. Cornelius, Sorry, the page you were looking for in this blog does not exist.; Sunday, January 22, 2017 3:31:00 PM

Quotations

The old argument of design in nature, as given by Paley, which formerly seemed to me to be so conclusive, fails, now that the law of natural selection has been discovered. We can no longer argue that, for instance, the beautiful hinge of a bivalve shell must have been made by an intelligent being, like the hinge of a door by man. There seems to be no more design in the variability of organic beings and in the action of natural selection, than in the course which the wind blows.Charles Darwin (c1880)

Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed, during a long course of years, from a point of view directly opposite to mine. It is so easy to hide our ignorance under such expressions as "plan of creation," "unity of design," etc., and to think that we give an explanation when we only restate a fact. Any one whose disposition leads him to attach more weight to unexplained difficulties than to the explanation of a certain number of facts will certainly reject the theory.

Charles Darwin (1859)

Science reveals where religion conceals. Where religion purports to explain, it actually resorts to tautology. To assert that "God did it" is no more than an admission of ignorance dressed deceitfully as an explanation...

Peter Atkins

Quotations

The world is not inhabited exclusively by fools, and when a subject arouses intense interest, as this one has, something other than semantics is usually at stake. Stephen Jay Gould (1982)
I have championed contingency, and will continue to do so, because its large realm and legitimate claims have been so poorly attended by evolutionary scientists who cannot discern the beat of this different drummer while their brains and ears remain tuned to only the sounds of general theory. Stephen Jay Gould (2002) p.1339
The essence of Darwinism lies in its claim that natural selection creates the fit. Variation is ubiquitous and random in direction. It supplies raw material only. Natural selection directs the course of evolutionary change. Stephen Jay Gould (1977)
Rudyard Kipling asked how the leopard got its spots, the rhino its wrinkled skin. He called his answers "just-so stories." When evolutionists try to explain form and behavior, they also tell just-so stories—and the agent is natural selection. Virtuosity in invention replaces testability as the criterion for acceptance. Stephen Jay Gould (1980)
Since 'change of gene frequencies in populations' is the 'official' definition of evolution, randomness has transgressed Darwin's border and asserted itself as an agent of evolutionary change. Stephen Jay Gould (1983) p.335
The first commandment for all versions of NOMA might be summarized by stating: "Thou shalt not mix the magisteria by claiming that God directly ordains important events in the history of nature by special interference knowable only through revelation and not accessible to science." In common parlance, we refer to such special interference as "miracle"—operationally defined as a unique and temporary suspension of natural law to reorder the facts of nature by divine fiat. Stephen Jay Gould (1999) p.84

Quotations

My own view is that conclusions about the evolution of human behavior should be based on research at least as rigorous as that used in studying nonhuman animals. And if you read the animal behavior journals, you'll see that this requirement sets the bar pretty high, so that many assertions about evolutionary psychology sink without a trace.

Jerry Coyne
Why Evolution Is True

I once made the remark that two things disappeared in 1990: one was communism, the other was biochemistry and that only one of them should be allowed to come back.

Sydney Brenner
TIBS Dec. 2000

It is naïve to think that if a species' environment changes the species must adapt or else become extinct.... Just as a changed environment need not set in motion selection for new adaptations, new adaptations may evolve in an unchanging environment if new mutations arise that are superior to any pre-existing variations

Douglas Futuyma

One of the most frightening things in the Western world, and in this country in particular, is the number of people who believe in things that are scientifically false. If someone tells me that the earth is less than 10,000 years old, in my opinion he should see a psychiatrist.

Francis Crick

There will be no difficulty in computers being adapted to biology. There will be luddites. But they will be buried.

Sydney Brenner

An atheist before Darwin could have said, following Hume: 'I have no explanation for complex biological design. All I know is that God isn't a good explanation, so we must wait and hope that somebody comes up with a better one.' I can't help feeling that such a position, though logically sound, would have left one feeling pretty unsatisfied, and that although atheism might have been logically tenable before Darwin, Darwin made it possible to be an intellectually fulfilled atheist

Richard Dawkins

Another curious aspect of the theory of evolution is that everybody thinks he understand it. I mean philosophers, social scientists, and so on. While in fact very few people understand it, actually as it stands, even as it stood when Darwin expressed it, and even less as we now may be able to understand it in biology.

Jacques Monod

The false view of evolution as a process of global optimizing has been applied literally by engineers who, taken in by a mistaken metaphor, have attempted to find globally optimal solutions to design problems by writing programs that model evolution by natural selection.

Richard Lewontin

More Recent Comments

Thursday, July 04, 2013

How to Make a Scientific Argument

32 comments :