More Recent Comments

Wednesday, May 06, 2009

How to Frame a Null Hypothesis

A reader has alerted me to an article by Michael White at Adaptive Complexity: Genomic Junk and Transcriptional Noise.
With hot, new technologies, biologists are taking higher-resolution snapshots of what's going on inside the cell, but the results are stirring up controversy. One of the most interesting recent discoveries is that transcription is everywhere: DNA is transcribed into RNA all over the genome, even DNA that has long been thought to have a non-functional role. What is all of this transcription for? Does the 'dark matter' of the genome have some cryptic, undiscovered function?

Unfortunately, in all of the excitement over possible new functions, many biologists have forgotten how to frame a null hypothesis - the default scenario that you expect to see if there is no function to this transcribed DNA. As a result, the literature is teeming with wild, implausible speculation about how our excess DNA might be beneficial to us.

So here, let's step back and look at what we expect from DNA when it's playing absolutely no functional role; in other words, let's look at the null hypothesis of genomic junk and transcriptional noise. We can then take our null hypothesis and use it to look at a fascinating new study of how genomic parasites sculpt transcription in our cells.
If you are interested in what's wrong with science these days then you must read his article.

The point is not whether you believe that all transcription is adaptive and functional, or whether you believe that most of it is noise. The real point is that it is very bad science to ignore the null hypothesis and publish naive speculation as if it were the only possible explanation.

Whenever you see a paper that fails to address the null hypothesis you can be sure that you are reading bad science. Everything else in the paper is suspect.

The key fact that most scientists are overlooking is that RNA polymerase and the various transcription factors must bind non-specifically at thousands of sites in a random sequence of junk DNA. This is just basic biochemistry of the sort that should be taught in undergraduate classes. Transcription will be initiated by accident at some of these sites even though they are not functional promoters. Again, this is basic biochemistry.


[Image Credit: Horton et al. Principles of Biochemistry 4/e p.657]

24 comments :

Harriet said...

My next probs and stats class are going to be required to read this article. :-)

Sigmund said...

Larry, you protesteth too much!
I disagree with your final paragraph. In a mammalian genome RNA Pol and transcription factors do not have to bind non specifically across non functional sequence. We have known for years that the chromatin structure of genomic regions is critical to the binding capacity of such factors. Promoter regions tend to be modified for access and downstream parts of genes and non functional parts of the genome correspondingly modified to prevent access to these factors.
A hypothesis that RNA Polymerase is simply binding and transcribing noisily across the genome at random is simply not supported by the current data (Encode project, CAGE tag deep sequence analysis etc).
I am not personally of the opinion that all or even most of the RNA that is transcribed has some adaptive function but I think the data is at least sufficient to suggest an agnostic approach to the question, at least in principle.

Larry Moran said...

MartinC says,

I disagree with your final paragraph. In a mammalian genome RNA Pol and transcription factors do not have to bind non specifically across non functional sequence. We have known for years that the chromatin structure of genomic regions is critical to the binding capacity of such factors.

Yes, that's true. There are parts of the genome that are heterochromatic or at least bound in a "closed" conformation of chromatin. Those regions are less likely to bind RNA polymerase. It doesn't change the argument very much.

Promoter regions tend to be modified for access and downstream parts of genes and non functional parts of the genome correspondingly modified to prevent access to these factors.
A hypothesis that RNA Polymerase is simply binding and transcribing noisily across the genome at random is simply not supported by the current data (Encode project, CAGE tag deep sequence analysis etc).


Which part of the data rules out noise? If you have widespread transcription then it implies that a large part of the genome is available for binding, right?

I am not personally of the opinion that all or even most of the RNA that is transcribed has some adaptive function but I think the data is at least sufficient to suggest an agnostic approach to the question, at least in principle.

Agnosticism is good. I'd like to see a lot more of it. Can you point out a paper from one of the megaprojects that exhibits the kind of agnosticism that you admire?

Sigmund said...

Larry, I tend to read most of the big project papers as data dumps rather than take their conclusions as gospel, so to speak. Theres so much interesting data being produced recently that its going to take several years before we put it into some sort of perspective.
Being overly speculative is simply a necessity for recieving continuing funding purposes these days. You wont get a decent publication by simply confirming previously known or speculated points.
As for an example of a decently done paper I would suggest Barski et al in Cell 2007, High-Resolution Profiling of Histone Methylations in the Human Genome looking at chromatin structure corresponging to transciption and silencing.

John S. Wilkins said...

Agnosticism is good. I'd like to see a lot more of it.Only in molecular biology, or in other fields as well?

DK said...

Theres so much interesting data being produced recently that its going to take several years before we put it into some sort of perspective.

Like human, yeast and worm "protein interactomes" that overlap at best at couple percents of total "interactions". Isn't it more reasonable to discard them as massive noise artefacts than spend years trying to make sense of something that obviously does not make sense?

Anthonzi said...

These kinds of people are almost as bad as ID proponents sometimes ._.

Georgi Marinov said...

Which part of the data rules out noise? If you have widespread transcription then it implies that a large part of the genome is available for binding, right? If you have multiple CAGE tags, or TFs and PolII binding consistently mapping to the same sites in the middle of nowhere, this is a good evidence it is not just transcriptional noise and things are more complicated than we thought.

It still does not mean those are functional transcripts, of course, although it seems certain at this point that the repertoire of functional RNA molecules that get produced is really greater than the traditionally expected.

As it was pointed out though, part of the problem maybe the way the papers are presented. Those are indeed data-heavy papers that not always have clear conclusions being apparent in the data. But because something on the order of a million dollars and above has been spent, they have to be published in prestigious journals, which means that "a story" has to be present. So this maybe the source of some of the stretching the limits of sound scientific reasoning we see.

PonderingFool said...

If you have multiple CAGE tags, or TFs and PolII binding consistently mapping to the same sites in the middle of nowhere, this is a good evidence it is not just transcriptional noise and things are more complicated than we thought.

***********************

Or it could still be noise just the sequence there for whatever reason (including chance) that has nothing to do with the transcript made is favored over other random sequences, hence it shows up over and over again. Certain sequences are favored by the polymerase. Long stretches of sequence you would expect by chance certain regions would be favored.

Sigmund said...

Pondering Fool said:
"Certain sequences are favored by the polymerase. Long stretches of sequence you would expect by chance certain regions would be favored."
I suppose the question we are asking is how do we distinguish the types of favored sites you mention with actual functional regions. Without the sort of whole genome approach that's been applied recently we are really just speculating and even at this stage we still have a lot of confirmatory work to do to really work out the rules. I would, however, suggest that what we are discussing here is not just a matter of a random sequence that just happens to produce a higher than background spike of RNA PolII binding. The sort of things we see from the data are a convergence of many factors (RNA POLII binding, multiple independent chromatin modifications, DNAse accessibility, high numbers of transcripts etc). We know that these factors are associated with promoter or other such regulatory regions so the evidence does suggest something more than background noise. As I've said, we are at an early stage in the understanding of this but its not a question of pure untestable speculation as some seem to imply.

Art said...

"The sort of things we see from the data are a convergence of many factors (RNA POLII binding, multiple independent chromatin modifications, DNAse accessibility, high numbers of transcripts etc). We know that these factors are associated with promoter or other such regulatory regions so the evidence does suggest something more than background noise. As I've said, we are at an early stage in the understanding of this but its not a question of pure untestable speculation as some seem to imply."I think that the "noise" explanation is still pretty good. From a paper by Neil et al:

"Our data reveal numerous new CUTs with such a potential regulatory role. However, most of the identified CUTs corresponded to transcripts divergent from the promoter regions of genes, indicating that they represent by-products of divergent transcription occurring at many and possibly most promoters. Eukaryotic promoter regions are thus intrinsically bidirectional, a fundamental property that escaped previous analyses because in most cases divergent transcription generates short-lived unstable transcripts present at very low steady-state levels."The paper: Helen Neil, Christophe Malabat, Yves d’Aubenton-Carafa, Zhenyu Xu, Lars M. Steinmetz & Alain Jacquier. 2009. Widespread bidirectional promoters are the major source of cryptic transcripts in yeast. Nature 457, 1038.

A bit more about this subject.

Sigmund said...

Art, those papers you linked to do not support the idea Larry described in his final paragraph. Read what he said again. I agree with the conclusions in the papers (a lot of apparent non-coding transcripts seem to come from bidirectional promoters and that a lot of eukaryotic promoters seem to be inherently bidirectional. That is quite a different point to that made by Larry. Whether the CUTs have a function in of themselves is a different matter (there are evidence that some do in a sequence specific manner - for instance those associated with CCND1, or in a non sequence specific manner as 'pioneer' transcripts that allow for the opening of chromatin for access to high output transcription of coding transcripts on the same or opposite strand) but that is a different question and one that really needs a lot more work in order to draw firm conclusions.

Art said...

Larry:

"The key fact that most scientists are overlooking is that RNA polymerase and the various transcription factors must bind non-specifically at thousands of sites in a random sequence of junk DNA. This is just basic biochemistry of the sort that should be taught in undergraduate classes. Transcription will be initiated by accident at some of these sites even though they are not functional promoters. Again, this is basic biochemistry."Neil et al.:

"However, most of the identified CUTs corresponded to transcripts divergent from the promoter regions of genes, indicating that they represent by-products of divergent transcription occurring at many and possibly most promoters."One can be pedantic about this and find possible items of disagreement, but the basic gists of these two quotes are very similar.

Sigmund said...

Art, take as a model a 1 Mb genomic sequence which contains a single well defined promoter of a known functional gene exactly at the center point.
Now ask yourself, if we look at the EST database results from multiple tissues (essentially a sampling of the transcripts from the 1 Mb segment) do the two paragraphs predict the same result?
The second paragraph suggests that we will see many ESTs corresponding to the known gene and others corresponding to transcription initiated at the same promoter, but in the opposite orientation (essentially transcription linked to the promoter but in more than one direction).
Larry's paragraph, however, suggests multiple initiation events throughout the 1 Mb segment.
If the important point here is to distinguish 'noise' from signal then it is certainly not pedantic to point out that these two models predict very different transcription profiles and thus different possible interpretations of 'noise'.

Anonymous said...

Seconding the "you protesteth too much" remark.

Art said...

MartinC, I'm content to ask if Larry can reconcile his remarks with some of the interesting new results that have come out in the past few years. I can, but I would rather not be putting words in Larry's mouth.

Larry Moran said...

MartinC asks,

The sort of things we see from the data are a convergence of many factors (RNA POLII binding, multiple independent chromatin modifications, DNAse accessibility, high numbers of transcripts etc). We know that these factors are associated with promoter or other such regulatory regions so the evidence does suggest something more than background noise. As I've said, we are at an early stage in the understanding of this but its not a question of pure untestable speculation as some seem to imply.

RNA POLII binding, multiple independent chromatin modifications, and DNAse accessibility are not independent variables. They would all be associated with random noise so you can't use them to distinguish between noise and function.

The abundance of transcripts, on the other hand, is important. That's why I list it as one of the criterion necessary to Evaluate Genome Level Transcription Papers.

Unfortunately, you won't find much information about the abundance of various transcripts in most of those papers. The authors know very well that they're dealing with only a few—perhaps less than one—transcripts per cell but for some strange reason they don't think it's important to mention this in the paper.

Sigmund said...

Transcript levels are certainly important and an unbiased deep sequencing approach using cDNA isolated from a known number of cells is probably the best way to examine this question but this is something that has only recently become methodologically possible.
It can be a mistake, however, to assume that low level transcription simply equates to non-functional noise. Indeed the figure of less than one transcript per cell that Larry mentioned is not exactly unusual for known protein encoding mRNAs. A single mRNA gives rise to several thousand molecules of protein such that a gene expressing an mRNA with a short half-life can still have important functional effects even though its average mRNA transcript level is below one per cell since the protein can still be present at several thousand copies. There are many low copy number RNAs that have recently been identified, that show evidence of function since siRNA targeting leads to important cellular effects (frequently at the level of chromatin remodelling of specific promoters).
Neither of these points allow us to propose a generalized model for transcriptional regulation but it should at least remind us to keep out minds open about the possibility that novel functional transcripts exist in the database.
It doesn't, by itself rule out these same transcripts as simple noise either but suggests that a combination of chromatin analysis, transcriptional profiling and transcriptional functional analysis (siRNA targeting, for instance) provide the best route towards creating such a model.

Georgi Marinov said...

The argument that low transcript levels mean noise is not a convincing one. If you look at some RNA-Seq data (which allows you to get a crude estimate of the number of transcripts per cell) one of the striking things to be noticed is that some very famous (and presumably essential) genes are expressed at a few transcripts per cell at most. Of course, this might be an artifact of the cell culture systems and tissues that the datasets I have looked at personally come from, but it is definitely not a result that supports the "low expression = non-functionality" argument.

I am not arguing that most of those transcripts are functional, let it be clear, but I don't think we should dismiss them without further consideration either. Probably a few of the novel RNA classes described will turn out to be reproducible errors inherent to the process, or having something to do with the silencing of the regions they originate from, or turn out to be trivial for some other reason, but some will turn out to be more than that. The future will tell

Larry Moran said...

MartinC

It can be a mistake, however, to assume that low level transcription simply equates to non-functional noise.

I agree, that's why I would never make such a stupid argument. On the other hand, if you are going to argue that a low abundance transcript is functional than you have to invoke hypotheses that make those transcripts unusual.

What I'm challenging is the belief that because it exists, it must be functional. I'm also challenging the fact that most papers ignore the fact that these transcripts are rare.

Indeed the figure of less than one transcript per cell that Larry mentioned is not exactly unusual for known protein encoding mRNAs. A single mRNA gives rise to several thousand molecules of protein ...

That's an incorrect statement. A typical mammalian mRNA is only translated about 100 times or less. And there are very few proteins that can be functional in a mammalian cell at a concentration of only 100 molecules. A typical regulatory protein, for example, has to be present in >10,000 copies.

... such that a gene expressing an mRNA with a short half-life can still have important functional effects even though its average mRNA transcript level is below one per cell since the protein can still be present at several thousand copies.

Your reasoning is incorrect because your facts are wrong.

Larry Moran said...

The argument that low transcript levels mean noise is not a convincing one.

I agree 100%. But low abundance is an important bit of information that's consistent with noise. That fact (low abundance) should not be ignored.

I am not arguing that most of those transcripts are functional, let it be clear, but I don't think we should dismiss them without further consideration either.

Let's be real clear on what I'm saying. I'm saying that it is scientifically unethical to claim that transcripts are functional simply because they exist. Ignoring an important counter-argument is not the way good scientists are supposed to behave.

I find it interesting that so few of you have found papers where the issue is treated, correctly, as a controversy.

Why is that?

Georgi Marinov said...

Let's be real clear on what I'm saying. I'm saying that it is scientifically unethical to claim that transcripts are functional simply because they exist. Ignoring an important counter-argument is not the way good scientists are supposed to behave.
And I agree with this 100% too.

I find it interesting that so few of you have found papers where the issue is treated, correctly, as a controversy.

Why is that?

The cynical explanation as I said above is that when you have the hottest technology to come out since PCR, and you have spent a good amount of money to do the experiments (because this type of experiments are not cheap yet, although they will become soon), it is somewhat not in your best interest to treat the results as noise. I don't think it is all noise, as I said in previous posts, but I am rationalizing as to why if it was noise it would still be reported as more than that by the authors, who certainly should know what their data tell them better than anybody else. So what you do is to say "Hey, we discovered such and such transcripts, we don't know what they do, but it would be interesting if they turn out to be functional".

The other explanation is that the standards of scientific reasoning one needs to meet in order to get the high-profile publication and the amount of publicity these papers receive aren't that high. It is as much a failure of authors as a failure of editors and reviewers.

Truth to be said, I don't recall any of these papers (and I admit that I have yet to find time to read the FANTOM papers in depth) making the grand claim that everything they found is functional, they just do not spend too much time talking about the possibility that most of it is noise (which is still not the correct thing to do, of course)

Larry Moran said...

Georgi Marinov says,

Truth to be said, I don't recall any of these papers (and I admit that I have yet to find time to read the FANTOM papers in depth) making the grand claim that everything they found is functional, ...

You are correct. None of the papers makes the overt claim that everything is functional. Instead, they state or imply that a large percentage of the non-coding RNAs are functional.

Here's a recent review from a few weeks ago by John Mattick: The Genetic Signatures of Noncoding RNAs. What do you think of this form of scientific paper?

Georgi Marinov said...

Well, the facts are facts, the question is how you interpret them when you don't have all the facts you need. If you ask me whether there is a lot of overselling in this article, the answer is yes, I agree with that. But this does not mean that we should automatically switch to the opposite extreme of the opinion spectrum either - that all ncRNA phenomena are products of that queen of the omics sciences, the artifactomics.

The correct position according to me is to admit that there is a lot we don't know and we have yet to learn, then start figuring it out (which we are doing) but in the same time be very careful how we formulate and communicate our hypothesis about what might be going on to the public (which isn't happening). Because, as it is well known, the subtle details of the scientific debate will almost certainly be ignored or misinterpreted.