Sandwalk: How to Frame a Null Hypothesis

Wednesday, May 06, 2009

How to Frame a Null Hypothesis

A reader has alerted me to an article by Michael White at Adaptive Complexity: Genomic Junk and Transcriptional Noise.

With hot, new technologies, biologists are taking higher-resolution snapshots of what's going on inside the cell, but the results are stirring up controversy. One of the most interesting recent discoveries is that transcription is everywhere: DNA is transcribed into RNA all over the genome, even DNA that has long been thought to have a non-functional role. What is all of this transcription for? Does the 'dark matter' of the genome have some cryptic, undiscovered function?

Unfortunately, in all of the excitement over possible new functions, many biologists have forgotten how to frame a null hypothesis - the default scenario that you expect to see if there is no function to this transcribed DNA. As a result, the literature is teeming with wild, implausible speculation about how our excess DNA might be beneficial to us.

So here, let's step back and look at what we expect from DNA when it's playing absolutely no functional role; in other words, let's look at the null hypothesis of genomic junk and transcriptional noise. We can then take our null hypothesis and use it to look at a fascinating new study of how genomic parasites sculpt transcription in our cells.

If you are interested in what's wrong with science these days then you must read his article.

The point is not whether you believe that all transcription is adaptive and functional, or whether you believe that most of it is noise. The real point is that it is very bad science to ignore the null hypothesis and publish naive speculation as if it were the only possible explanation.

Whenever you see a paper that fails to address the null hypothesis you can be sure that you are reading bad science. Everything else in the paper is suspect.

The key fact that most scientists are overlooking is that RNA polymerase and the various transcription factors must bind non-specifically at thousands of sites in a random sequence of junk DNA. This is just basic biochemistry of the sort that should be taught in undergraduate classes. Transcription will be initiated by accident at some of these sites even though they are not functional promoters. Again, this is basic biochemistry.

[Image Credit: Horton et al. Principles of Biochemistry 4/e p.657]

24 comments :

Harriet said...: My next probs and stats class are going to be required to read this article. :-); Wednesday, May 06, 2009 3:26:00 PM
Sigmund said...: Larry, you protesteth too much!
I disagree with your final paragraph. In a mammalian genome RNA Pol and transcription factors do not have to bind non specifically across non functional sequence. We have known for years that the chromatin structure of genomic regions is critical to the binding capacity of such factors. Promoter regions tend to be modified for access and downstream parts of genes and non functional parts of the genome correspondingly modified to prevent access to these factors.
A hypothesis that RNA Polymerase is simply binding and transcribing noisily across the genome at random is simply not supported by the current data (Encode project, CAGE tag deep sequence analysis etc).
I am not personally of the opinion that all or even most of the RNA that is transcribed has some adaptive function but I think the data is at least sufficient to suggest an agnostic approach to the question, at least in principle.; Wednesday, May 06, 2009 5:13:00 PM
Larry Moran said...: MartinC says,

I disagree with your final paragraph. In a mammalian genome RNA Pol and transcription factors do not have to bind non specifically across non functional sequence. We have known for years that the chromatin structure of genomic regions is critical to the binding capacity of such factors.

Yes, that's true. There are parts of the genome that are heterochromatic or at least bound in a "closed" conformation of chromatin. Those regions are less likely to bind RNA polymerase. It doesn't change the argument very much.

Promoter regions tend to be modified for access and downstream parts of genes and non functional parts of the genome correspondingly modified to prevent access to these factors.
A hypothesis that RNA Polymerase is simply binding and transcribing noisily across the genome at random is simply not supported by the current data (Encode project, CAGE tag deep sequence analysis etc).

Which part of the data rules out noise? If you have widespread transcription then it implies that a large part of the genome is available for binding, right?

I am not personally of the opinion that all or even most of the RNA that is transcribed has some adaptive function but I think the data is at least sufficient to suggest an agnostic approach to the question, at least in principle.

Agnosticism is good. I'd like to see a lot more of it. Can you point out a paper from one of the megaprojects that exhibits the kind of agnosticism that you admire?; Wednesday, May 06, 2009 5:20:00 PM
Sigmund said...: Larry, I tend to read most of the big project papers as data dumps rather than take their conclusions as gospel, so to speak. Theres so much interesting data being produced recently that its going to take several years before we put it into some sort of perspective.
Being overly speculative is simply a necessity for recieving continuing funding purposes these days. You wont get a decent publication by simply confirming previously known or speculated points.
As for an example of a decently done paper I would suggest Barski et al in Cell 2007, High-Resolution Profiling of Histone Methylations in the Human Genome looking at chromatin structure corresponging to transciption and silencing.; Wednesday, May 06, 2009 6:18:00 PM
John S. Wilkins said...: Agnosticism is good. I'd like to see a lot more of it.Only in molecular biology, or in other fields as well?; Wednesday, May 06, 2009 6:43:00 PM
DK said...: Theres so much interesting data being produced recently that its going to take several years before we put it into some sort of perspective.

Like human, yeast and worm "protein interactomes" that overlap at best at couple percents of total "interactions". Isn't it more reasonable to discard them as massive noise artefacts than spend years trying to make sense of something that obviously does not make sense?; Wednesday, May 06, 2009 8:07:00 PM
Anthonzi said...: These kinds of people are almost as bad as ID proponents sometimes ._.; Wednesday, May 06, 2009 11:15:00 PM
Georgi Marinov said...: Which part of the data rules out noise? If you have widespread transcription then it implies that a large part of the genome is available for binding, right? If you have multiple CAGE tags, or TFs and PolII binding consistently mapping to the same sites in the middle of nowhere, this is a good evidence it is not just transcriptional noise and things are more complicated than we thought.

It still does not mean those are functional transcripts, of course, although it seems certain at this point that the repertoire of functional RNA molecules that get produced is really greater than the traditionally expected.

As it was pointed out though, part of the problem maybe the way the papers are presented. Those are indeed data-heavy papers that not always have clear conclusions being apparent in the data. But because something on the order of a million dollars and above has been spent, they have to be published in prestigious journals, which means that "a story" has to be present. So this maybe the source of some of the stretching the limits of sound scientific reasoning we see.; Thursday, May 07, 2009 12:26:00 AM
PonderingFool said...: If you have multiple CAGE tags, or TFs and PolII binding consistently mapping to the same sites in the middle of nowhere, this is a good evidence it is not just transcriptional noise and things are more complicated than we thought.

***********************

Or it could still be noise just the sequence there for whatever reason (including chance) that has nothing to do with the transcript made is favored over other random sequences, hence it shows up over and over again. Certain sequences are favored by the polymerase. Long stretches of sequence you would expect by chance certain regions would be favored.; Thursday, May 07, 2009 8:41:00 AM
Sigmund said...: Pondering Fool said:
"Certain sequences are favored by the polymerase. Long stretches of sequence you would expect by chance certain regions would be favored."
I suppose the question we are asking is how do we distinguish the types of favored sites you mention with actual functional regions. Without the sort of whole genome approach that's been applied recently we are really just speculating and even at this stage we still have a lot of confirmatory work to do to really work out the rules. I would, however, suggest that what we are discussing here is not just a matter of a random sequence that just happens to produce a higher than background spike of RNA PolII binding. The sort of things we see from the data are a convergence of many factors (RNA POLII binding, multiple independent chromatin modifications, DNAse accessibility, high numbers of transcripts etc). We know that these factors are associated with promoter or other such regulatory regions so the evidence does suggest something more than background noise. As I've said, we are at an early stage in the understanding of this but its not a question of pure untestable speculation as some seem to imply.; Thursday, May 07, 2009 9:48:00 AM
Art said...: "The sort of things we see from the data are a convergence of many factors (RNA POLII binding, multiple independent chromatin modifications, DNAse accessibility, high numbers of transcripts etc). We know that these factors are associated with promoter or other such regulatory regions so the evidence does suggest something more than background noise. As I've said, we are at an early stage in the understanding of this but its not a question of pure untestable speculation as some seem to imply."I think that the "noise" explanation is still pretty good. From a paper by Neil et al:

"Our data reveal numerous new CUTs with such a potential regulatory role. However, most of the identified CUTs corresponded to transcripts divergent from the promoter regions of genes, indicating that they represent by-products of divergent transcription occurring at many and possibly most promoters. Eukaryotic promoter regions are thus intrinsically bidirectional, a fundamental property that escaped previous analyses because in most cases divergent transcription generates short-lived unstable transcripts present at very low steady-state levels."The paper: Helen Neil, Christophe Malabat, Yves d’Aubenton-Carafa, Zhenyu Xu, Lars M. Steinmetz & Alain Jacquier. 2009. Widespread bidirectional promoters are the major source of cryptic transcripts in yeast. Nature 457, 1038.

A bit more about this subject.; Thursday, May 07, 2009 9:40:00 PM
Sigmund said...: Art, those papers you linked to do not support the idea Larry described in his final paragraph. Read what he said again. I agree with the conclusions in the papers (a lot of apparent non-coding transcripts seem to come from bidirectional promoters and that a lot of eukaryotic promoters seem to be inherently bidirectional. That is quite a different point to that made by Larry. Whether the CUTs have a function in of themselves is a different matter (there are evidence that some do in a sequence specific manner - for instance those associated with CCND1, or in a non sequence specific manner as 'pioneer' transcripts that allow for the opening of chromatin for access to high output transcription of coding transcripts on the same or opposite strand) but that is a different question and one that really needs a lot more work in order to draw firm conclusions.; Friday, May 08, 2009 2:37:00 AM
Art said...: Larry:

"The key fact that most scientists are overlooking is that RNA polymerase and the various transcription factors must bind non-specifically at thousands of sites in a random sequence of junk DNA. This is just basic biochemistry of the sort that should be taught in undergraduate classes. Transcription will be initiated by accident at some of these sites even though they are not functional promoters. Again, this is basic biochemistry."Neil et al.:

"However, most of the identified CUTs corresponded to transcripts divergent from the promoter regions of genes, indicating that they represent by-products of divergent transcription occurring at many and possibly most promoters."One can be pedantic about this and find possible items of disagreement, but the basic gists of these two quotes are very similar.; Friday, May 08, 2009 8:18:00 PM
Sigmund said...: Art, take as a model a 1 Mb genomic sequence which contains a single well defined promoter of a known functional gene exactly at the center point.
Now ask yourself, if we look at the EST database results from multiple tissues (essentially a sampling of the transcripts from the 1 Mb segment) do the two paragraphs predict the same result?
The second paragraph suggests that we will see many ESTs corresponding to the known gene and others corresponding to transcription initiated at the same promoter, but in the opposite orientation (essentially transcription linked to the promoter but in more than one direction).
Larry's paragraph, however, suggests multiple initiation events throughout the 1 Mb segment.
If the important point here is to distinguish 'noise' from signal then it is certainly not pedantic to point out that these two models predict very different transcription profiles and thus different possible interpretations of 'noise'.; Saturday, May 09, 2009 6:31:00 AM
Anonymous said...: Seconding the "you protesteth too much" remark.; Saturday, May 09, 2009 8:35:00 PM
Art said...: MartinC, I'm content to ask if Larry can reconcile his remarks with some of the interesting new results that have come out in the past few years. I can, but I would rather not be putting words in Larry's mouth.; Sunday, May 10, 2009 8:29:00 AM
Larry Moran said...: MartinC asks,

The sort of things we see from the data are a convergence of many factors (RNA POLII binding, multiple independent chromatin modifications, DNAse accessibility, high numbers of transcripts etc). We know that these factors are associated with promoter or other such regulatory regions so the evidence does suggest something more than background noise. As I've said, we are at an early stage in the understanding of this but its not a question of pure untestable speculation as some seem to imply.

RNA POLII binding, multiple independent chromatin modifications, and DNAse accessibility are not independent variables. They would all be associated with random noise so you can't use them to distinguish between noise and function.

The abundance of transcripts, on the other hand, is important. That's why I list it as one of the criterion necessary to Evaluate Genome Level Transcription Papers.

Unfortunately, you won't find much information about the abundance of various transcripts in most of those papers. The authors know very well that they're dealing with only a few—perhaps less than one—transcripts per cell but for some strange reason they don't think it's important to mention this in the paper.; Sunday, May 10, 2009 9:08:00 AM
Sigmund said...: Transcript levels are certainly important and an unbiased deep sequencing approach using cDNA isolated from a known number of cells is probably the best way to examine this question but this is something that has only recently become methodologically possible.
It can be a mistake, however, to assume that low level transcription simply equates to non-functional noise. Indeed the figure of less than one transcript per cell that Larry mentioned is not exactly unusual for known protein encoding mRNAs. A single mRNA gives rise to several thousand molecules of protein such that a gene expressing an mRNA with a short half-life can still have important functional effects even though its average mRNA transcript level is below one per cell since the protein can still be present at several thousand copies. There are many low copy number RNAs that have recently been identified, that show evidence of function since siRNA targeting leads to important cellular effects (frequently at the level of chromatin remodelling of specific promoters).
Neither of these points allow us to propose a generalized model for transcriptional regulation but it should at least remind us to keep out minds open about the possibility that novel functional transcripts exist in the database.
It doesn't, by itself rule out these same transcripts as simple noise either but suggests that a combination of chromatin analysis, transcriptional profiling and transcriptional functional analysis (siRNA targeting, for instance) provide the best route towards creating such a model.; Monday, May 11, 2009 8:09:00 AM
Georgi Marinov said...: The argument that low transcript levels mean noise is not a convincing one. If you look at some RNA-Seq data (which allows you to get a crude estimate of the number of transcripts per cell) one of the striking things to be noticed is that some very famous (and presumably essential) genes are expressed at a few transcripts per cell at most. Of course, this might be an artifact of the cell culture systems and tissues that the datasets I have looked at personally come from, but it is definitely not a result that supports the "low expression = non-functionality" argument.

I am not arguing that most of those transcripts are functional, let it be clear, but I don't think we should dismiss them without further consideration either. Probably a few of the novel RNA classes described will turn out to be reproducible errors inherent to the process, or having something to do with the silencing of the regions they originate from, or turn out to be trivial for some other reason, but some will turn out to be more than that. The future will tell; Monday, May 11, 2009 8:33:00 AM
Larry Moran said...: MartinC

It can be a mistake, however, to assume that low level transcription simply equates to non-functional noise.

I agree, that's why I would never make such a stupid argument. On the other hand, if you are going to argue that a low abundance transcript is functional than you have to invoke hypotheses that make those transcripts unusual.

What I'm challenging is the belief that because it exists, it must be functional. I'm also challenging the fact that most papers ignore the fact that these transcripts are rare.

Indeed the figure of less than one transcript per cell that Larry mentioned is not exactly unusual for known protein encoding mRNAs. A single mRNA gives rise to several thousand molecules of protein ...

That's an incorrect statement. A typical mammalian mRNA is only translated about 100 times or less. And there are very few proteins that can be functional in a mammalian cell at a concentration of only 100 molecules. A typical regulatory protein, for example, has to be present in >10,000 copies.

... such that a gene expressing an mRNA with a short half-life can still have important functional effects even though its average mRNA transcript level is below one per cell since the protein can still be present at several thousand copies.

Your reasoning is incorrect because your facts are wrong.; Monday, May 11, 2009 9:22:00 AM
Larry Moran said...: The argument that low transcript levels mean noise is not a convincing one.

I agree 100%. But low abundance is an important bit of information that's consistent with noise. That fact (low abundance) should not be ignored.

I am not arguing that most of those transcripts are functional, let it be clear, but I don't think we should dismiss them without further consideration either.

Let's be real clear on what I'm saying. I'm saying that it is scientifically unethical to claim that transcripts are functional simply because they exist. Ignoring an important counter-argument is not the way good scientists are supposed to behave.

I find it interesting that so few of you have found papers where the issue is treated, correctly, as a controversy.

Why is that?; Monday, May 11, 2009 9:29:00 AM
Georgi Marinov said...: Let's be real clear on what I'm saying. I'm saying that it is scientifically unethical to claim that transcripts are functional simply because they exist. Ignoring an important counter-argument is not the way good scientists are supposed to behave.
And I agree with this 100% too.

I find it interesting that so few of you have found papers where the issue is treated, correctly, as a controversy.

Why is that?
The cynical explanation as I said above is that when you have the hottest technology to come out since PCR, and you have spent a good amount of money to do the experiments (because this type of experiments are not cheap yet, although they will become soon), it is somewhat not in your best interest to treat the results as noise. I don't think it is all noise, as I said in previous posts, but I am rationalizing as to why if it was noise it would still be reported as more than that by the authors, who certainly should know what their data tell them better than anybody else. So what you do is to say "Hey, we discovered such and such transcripts, we don't know what they do, but it would be interesting if they turn out to be functional".

The other explanation is that the standards of scientific reasoning one needs to meet in order to get the high-profile publication and the amount of publicity these papers receive aren't that high. It is as much a failure of authors as a failure of editors and reviewers.

Truth to be said, I don't recall any of these papers (and I admit that I have yet to find time to read the FANTOM papers in depth) making the grand claim that everything they found is functional, they just do not spend too much time talking about the possibility that most of it is noise (which is still not the correct thing to do, of course); Monday, May 11, 2009 9:45:00 AM
Larry Moran said...: Georgi Marinov says,

Truth to be said, I don't recall any of these papers (and I admit that I have yet to find time to read the FANTOM papers in depth) making the grand claim that everything they found is functional, ...

You are correct. None of the papers makes the overt claim that everything is functional. Instead, they state or imply that a large percentage of the non-coding RNAs are functional.

Here's a recent review from a few weeks ago by John Mattick: The Genetic Signatures of Noncoding RNAs. What do you think of this form of scientific paper?; Tuesday, May 12, 2009 9:32:00 AM
Georgi Marinov said...: Well, the facts are facts, the question is how you interpret them when you don't have all the facts you need. If you ask me whether there is a lot of overselling in this article, the answer is yes, I agree with that. But this does not mean that we should automatically switch to the opposite extreme of the opinion spectrum either - that all ncRNA phenomena are products of that queen of the omics sciences, the artifactomics.

The correct position according to me is to admit that there is a lot we don't know and we have yet to learn, then start figuring it out (which we are doing) but in the same time be very careful how we formulate and communicate our hypothesis about what might be going on to the public (which isn't happening). Because, as it is well known, the subtle details of the scientific debate will almost certainly be ignored or misinterpreted.; Tuesday, May 12, 2009 3:51:00 PM

Quotations

The old argument of design in nature, as given by Paley, which formerly seemed to me to be so conclusive, fails, now that the law of natural selection has been discovered. We can no longer argue that, for instance, the beautiful hinge of a bivalve shell must have been made by an intelligent being, like the hinge of a door by man. There seems to be no more design in the variability of organic beings and in the action of natural selection, than in the course which the wind blows.Charles Darwin (c1880)

Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed, during a long course of years, from a point of view directly opposite to mine. It is so easy to hide our ignorance under such expressions as "plan of creation," "unity of design," etc., and to think that we give an explanation when we only restate a fact. Any one whose disposition leads him to attach more weight to unexplained difficulties than to the explanation of a certain number of facts will certainly reject the theory.

Charles Darwin (1859)

Science reveals where religion conceals. Where religion purports to explain, it actually resorts to tautology. To assert that "God did it" is no more than an admission of ignorance dressed deceitfully as an explanation...

Peter Atkins

Quotations

The world is not inhabited exclusively by fools, and when a subject arouses intense interest, as this one has, something other than semantics is usually at stake. Stephen Jay Gould (1982)
I have championed contingency, and will continue to do so, because its large realm and legitimate claims have been so poorly attended by evolutionary scientists who cannot discern the beat of this different drummer while their brains and ears remain tuned to only the sounds of general theory. Stephen Jay Gould (2002) p.1339
The essence of Darwinism lies in its claim that natural selection creates the fit. Variation is ubiquitous and random in direction. It supplies raw material only. Natural selection directs the course of evolutionary change. Stephen Jay Gould (1977)
Rudyard Kipling asked how the leopard got its spots, the rhino its wrinkled skin. He called his answers "just-so stories." When evolutionists try to explain form and behavior, they also tell just-so stories—and the agent is natural selection. Virtuosity in invention replaces testability as the criterion for acceptance. Stephen Jay Gould (1980)
Since 'change of gene frequencies in populations' is the 'official' definition of evolution, randomness has transgressed Darwin's border and asserted itself as an agent of evolutionary change. Stephen Jay Gould (1983) p.335
The first commandment for all versions of NOMA might be summarized by stating: "Thou shalt not mix the magisteria by claiming that God directly ordains important events in the history of nature by special interference knowable only through revelation and not accessible to science." In common parlance, we refer to such special interference as "miracle"—operationally defined as a unique and temporary suspension of natural law to reorder the facts of nature by divine fiat. Stephen Jay Gould (1999) p.84

Quotations

My own view is that conclusions about the evolution of human behavior should be based on research at least as rigorous as that used in studying nonhuman animals. And if you read the animal behavior journals, you'll see that this requirement sets the bar pretty high, so that many assertions about evolutionary psychology sink without a trace.

Jerry Coyne
Why Evolution Is True

I once made the remark that two things disappeared in 1990: one was communism, the other was biochemistry and that only one of them should be allowed to come back.

Sydney Brenner
TIBS Dec. 2000

It is naïve to think that if a species' environment changes the species must adapt or else become extinct.... Just as a changed environment need not set in motion selection for new adaptations, new adaptations may evolve in an unchanging environment if new mutations arise that are superior to any pre-existing variations

Douglas Futuyma

One of the most frightening things in the Western world, and in this country in particular, is the number of people who believe in things that are scientifically false. If someone tells me that the earth is less than 10,000 years old, in my opinion he should see a psychiatrist.

Francis Crick

There will be no difficulty in computers being adapted to biology. There will be luddites. But they will be buried.

Sydney Brenner

An atheist before Darwin could have said, following Hume: 'I have no explanation for complex biological design. All I know is that God isn't a good explanation, so we must wait and hope that somebody comes up with a better one.' I can't help feeling that such a position, though logically sound, would have left one feeling pretty unsatisfied, and that although atheism might have been logically tenable before Darwin, Darwin made it possible to be an intellectually fulfilled atheist

Richard Dawkins

Another curious aspect of the theory of evolution is that everybody thinks he understand it. I mean philosophers, social scientists, and so on. While in fact very few people understand it, actually as it stands, even as it stood when Darwin expressed it, and even less as we now may be able to understand it in biology.

Jacques Monod

The false view of evolution as a process of global optimizing has been applied literally by engineers who, taken in by a mistaken metaphor, have attempted to find globally optimal solutions to design problems by writing programs that model evolution by natural selection.

Richard Lewontin

More Recent Comments

Wednesday, May 06, 2009

How to Frame a Null Hypothesis

24 comments :