More Recent Comments

Friday, September 27, 2013

Transcription Initiation Sites: Do You Think This Is Reasonable?

I'm interested in how scientists read the scientific literature and in how they distinguish good science from bad science. I know that when I read a paper I usually make a pretty quick judgement based on my knowledge of the field and my model of how things work. In other words, I look at the conclusions first to see whether they conflict with or agree with my model.

Many of my colleagues do it differently. They focus on the actual experiments and reach a conclusion based on how the perceive the data. If the experiments look good and the data seems reliable then they tentatively accept the conclusions even if they conflict with the model they have in their mind. They are much more likely to revamp their model than I am.

I'm about to give you the conclusions from a recently published paper in Nature. I'd like to hear from all graduate students, postdocs, and scientists on how you react to those conclusions. Do you think the conclusions are reasonable (as long as the experiments are valid) or do you think that the conclusions are unreasonable, indicating that there has to be something wrong somewhere?

The paper is Venters and Pugh (2013). It's title is Genomic organization of human transcription complexes. You don't need to read the paper unless you want to get into a more detailed debate. All I want to hear about is your initial reaction to their final two paragraphs.
Consolidated genomic view of initiation

...The discovery that transcription of the human genome is vastly more pervasive than what produces coding mRNA raises the question as to whether Pol II initiates transcription promiscuously through random collisions with chromatin as biological noise or whether it arises specifically from canonical Pol II initiation complexes in a regulated manner. Our discovery of ~150,000 non-coding promoter initiation complexes in human K562 cells and more in other cell lines suggests that pervasive non-coding transcription is promoter-specific, regulated, and not much different from coding transcription, except that it remains nuclear and non-polyadenylated. An important next question is the extent to which transcription factors regulate production of ncRNA.

We detected promoter transcription initiation complexes at 25% of all ~24,000 human coding genes, and found that there were 18-fold more non-coding complexes than coding. We therefore estimate that the human genome potentially contains as many as 500,000 promoter initiation complexes, corresponding to an average of about one every 3 kilobases (kb) in the non-repetitive portion of the human genome. This number may vary more or less depending on what constitutes a meaningful transcription initiation event. The finding that these initiation complexes are largely limited to locations having well-defined core promoters and measured TSSs indicates that they are functional and specific, but it remains to be determined to what end. Their massive numbers would seem to provide an origin for the so-called dark matter RNA of the genome, and could house a substantial portion of the missing heritability.
Looking forward to hearing from you.

Keep in mind that this is a Nature paper that has been rigorously reviewed by leading experts in the field. Does that influence your opinion?

Venters, B.J. and Pugh, B.F. (2013) Genomic organization of human transcription initiation complexes. Nature Published online 18 September 2013 [doi: 10.1038/nature12535] [PubMed] [Nature]


Matt G said...

I was wondering when you'd get around to this paper!

My first thought was: they seem to assume that because these initiation sites are there, they must have a purpose. It has an almost IDC feel to it.

Bryan said...

Messed up my comment - what I meant to write was:

My knowledge of transcriptional start sites is limited, so this may reflect my ignorance more than anything, but from what I recall:

1) High-level transcription generally requires a large number of sequential (ignoring trans-acting elements) transcription factor binding sites, and
2) Individual transcription factor binding sites are often small and have degenerate sequences

It seems like no stretch to me (assuming my memory served me correctly about the above facts) that random mutations should occasionally produce DNA sequences in our junk that match these transcription factor binding sites. This would account for a) the high number of non-promoter initiating complexes observed in this paper, and b) the "over-abundance" of pervasive, low-level transcriptions we observe.

This part bugs me: and could house a substantial portion of the missing heritability

This seems to go back to the idea that the 18-24K genes we have isn't "enough" to make a human. Basically, a bit of hubris that doesn't pass the onion test...

Georgi Marinov said...

If you just do regular TBP ChIP-seq you get nowhere near 150,000 peaks.

1. wget
2. gunzip -c wgEncodeAwgTfbsSydhK562TbpIggmusUniPk.narrowPeak.gz | wc -l

18K reproducible TBP ChIP-seq sites in K562.

Now, I haven't had time to compare the TBP ChIP-seq and TBP ChIP-exo data in detail, but my alarm bells went off when I read the paper last week - I have a strong suspicion they drastically overcalled. Binding sites are always distributed on a continuum and you can get as many as you want depending on where you draw the threshold (which does not mean all of them will be real, that's why we do things like IDR). But ChIP-exo is relatively new and there is no well established way to call ChIP-exo peaks. So the lack of discussion on how exactly those 150,000 sites were called is disturbing

That said, the results of the paper overall are by no means surprising and by no means should be interpreted in the way the authors do (i.e. these things matter) - for all I know they show evidence for how the loose definition of TSSs makes it possible for many spurious initiation sites to exist in a large genome. That's why you probably need additional mechanisms for distinguishing true TSSs from the spurious ones. This paper from a few months ago and a few similar ones I can't immediately recall probably point in the direction of what's really making the system work:

Almada AE, Wu X, Kriz AJ, Burge CB, Sharp PA. 2013. Promoter directionality is controlled by U1 snRNP and polyadenylation signals.Nature. 2013 Jul 18;499(7458):360-3.

Claudio said...

Those two paragraphs are nothing bad. Consider the press release for the paper instead:

Blas said...

"I know that when I read a paper I usually make a pretty quick judgement based on my knowledge of the field and my model of how things work. In other words, I look at the conclusions first to see whether they conflict with or agree with my model. "

So instead the data "your model" is the measure of reality.

Unknown said...

A recent paper from Michael Eisen's lab should prove an antidote to the one you've discussed:

Paris et al. (2013). Extensive Divergence of Transcription Factor Binding in Drosophila Embryos with Highly Conserved Gene Expression.

The authors examined the evolutionary conservation of transcription factor binding sites across four Drosophila species. They found mRNA levels were remarkably well conserved, but transcription factor binding was not. Their conclusion:

"That two thirds of the regions bound by any of the four TFs we examined are poorly conserved and thus are probably under weak or no purifying selection supports the emerging view that a large fraction of measurable biochemical events are not functional (in contrast to claims made by ENCODE [48])."

Larry Moran said...

Yes. That's exactly what I'm saying.My "model" is based on decades of previous work. I also know from experience that "data" isn't always what it seems.

Why didn't you answer the question?

judmarc said...

Nice, Jonathan. Almost makes one think that nothing in biology makes sense except in the light of evolution - oh, wait.... ;-)

(Hat tip to T. Dobzhansky.)

Anonymous said...

So last question first: The fact that its Nature does influence my opinion. I'm more likely to accept it.
I could swear I saw this paper 2-3 weeks ago and read the abstract. My impression was that they made a good case for more pervasive functional transcription. I didnt look at the details but I thought that rather than look at individual binding of proteins, which could be nonspecific, they looked for correlations that would indiciate functionality beyond a certain expected level. That seems reasonable to me....and it IS Nature of course.
I skimmed through it to see if it supported the "80% functionality" claim of ENCODE and though I missed the part you quote above I got the impression they've only accounted for a small fraction of the genome.
Disclaimer- I was a grad student 13 years ago

Georgi Marinov said...

That paper cannot support or reject a 80% functionality claim (which wasn't even really made in the ENCODE paper, at least in the sense that most people understand functionality, it was all the writing and interviews about the ENCODE paper that did the damage)

It's a TBP ChIP-exo paper. The ENCODE claim is based on vastly more data and of many other different kinds.

Finally, there is no way that paper could be about "functional" pervasive transcription because there is really almost nothing about function in it to begin with.

John Harshman said...

Well, the way you say it does make it sound like a bad thing. What you're really doing is being a good Bayesian. If your prior probability is low, it takes a lot to raise the posterior probability to any reasonable level. Or, to put it another way, if we have good reasons to believe X is not true, it takes better reasons before we should believe X is true. Or, still another way, extraordinary claims demand extraordinary proof.

John Harshman said...

Really? The fact that it's in Nature tends to make me *less* likely to accept it. They seem too often to be going for sensationalism at the expense of science.

Mikkel Rumraket Rasmussen said...

Yes, exactly. When you have accumulated a large body of knowledge that all implies that a specific model is accurate, it takes more than a single observation to overturn the whole thing again. We need to make sure the new observation cannot be accounted for under the previous model. This is correct empirical reasoning. The large body of knowledge previously collected justifies remaining skeptical of new contradictory observations until they can be independently verified as truly constituting contradictory evidence.

As you say, extraordinary claims require extraordinary evidence. What determines the "extraordinarity" of the claim is whether it contradicts already extremely well established facts.
In this respect, a 'mundane' claim is one that is already compatible with, or supports the established facts. Such a claim would only require mundane evidence.

Creationists emphatically do NOT understand bayesian empirical reasoning, which is why they also make horrible skeptics and have so many general issues with science.

In my opinion, bayesian reasoning as a basis for all empirical reasoning (and how and why expressions like "extraordinary claims require extraordinary evidence are so important, and what this really means both in theory and in practice) and as a rigorous approach to skepticism should be taught already in primary school and should most definitely be part of any and all critical thinking courses.

Anonymous said...

OK, well I thought that finding TI init complexes is a bit more suggestive of function than individual protein binding, most of which is probably nonspecific. It suggests that more locations could be desribed as funnctional than are currently annotated, but far short of the ENCODE hyperbole and far short of what that DI would hope for.
I knew the Nature comment wouldnt go over well but I decided to be honest. I do trust them more, depite the fact that either Nature or Science has published stuff on: cold fusion, martian meteorite bacteria, arsenic containing bacteria.....and I think I remember some homeopathy crap being published in the early 80s.
But wasnt this less about the paper and more of a 'sociology of science' survey by LarryM ?

SPARC said...

I am not a specialist in sequnece comparisons but I found the consensus sequences suspecious. They used the following sequences


In addition the allowed for up to 3 mismatches per site.
I may be wrong but I guess that this means that one can find a TATA-box like sequence every 128 bps. Through defining spacing between the sites they added some further constraints, though.
However, IIRC it takes even less to obtain basal transcription. E.g., there are TATA-less promoters with or without initiators (INRs), OTOH INRs are dispensable from TATA containing promoters. Just combine a single SP1 site in either orientation with a TATA-box or an INR and you will obtain some transcription. I don't have time to re-read the paper but I guess O'Shea-Greenfield A and Smale ST 1992 had some data on this). BTW, SP1 is quite redundant as well: It's just a string of a few G interrupted by a single H prefearable a C. In addition, one should keep in mind that SP1 can be replaced by other TF binding sites. Thus, why should anybody be surprised by spurious background transcription?

Anonymous said...

Biological slop is cell type specific, who cares? When I spill two buckets of paint, each will produce different patterns, but comparing those two patterns doesn't tell me much about the buckets of paint. Unfortunately, I see crap like this get forwarded around among teachers.

Peter said...

My impression is that the very first sentence sets up a false dichotomy. I presume the rest of the paper proceed to pound the straw man into the ground

The discovery that transcription of the human genome is vastly more pervasive than what produces coding mRNA raises the question as to whether Pol II initiates transcription promiscuously through random collisions with chromatin as biological noise or whether it arises specifically from canonical Pol II initiation complexes in a regulated manner.

My mental model for biological noise does not involve Pol II binding at random to DNA and initiating transcription. Rather, it assumes that in an extremely long sequence of random nucleotides, there will be a large number of sites that coincidentally resemble canonical Pol II initiation sites and thus acquire many or all of the same modifications as "true" promoter sites. Obviously, therefore, they will show up in ChIP experiments as more or less indistinguishable from the promoters of protein-coding genes.

This experiment therefore adds almost nothing to the sum of human knowledge. Potentially a close study of all the identified transcription start sites would be able to refine our understanding of the minimal sequence elements necessary to create a TSS. It tells you nothing whatsoever about whether the transcripts are functionally relevant to the cell.

whimple said...

It's interesting to hear the authors' opinions. Since the conclusion one way or the other isn't central to either my expertise or to how I design my next experiment, I'm perfectly happy to let the field sort it all out with data, check back in on the evolving story from time to time and get the final word from the field in about 10 years time or so.

Don't confuse a paper in Nature for a chapter in a textbook. Part of the value of these type of provocative (to those in the field) concluding statements is to spur the field on to reproduce/extend/refute the results, which is the essence of the scientific process.

Unknown said...

With ChIP-exo, they're looking for peak pairs, which probably means that you can call many more peaks with the same significance threshold, right?. Whether it's meaningful to do so is another question...

One analysis from this paper seems to be missing - how many occurrences of their extended core promoter element motif are in the genome? If it occurs much more frequently than a typical transcription factor motif, then maybe 150,000 peaks wouldn't be so surprising.

Georgi Marinov said...

As I said, I have not yet looked at the data. But there is the basic observation that you don't see nearly as many in regular ChIP-seq. That means that the order of magnitude difference is due to the ability to detect low-signal interactions that fall below the threshold of ChIP-seq. Which is not a bad thing on its own but it raises the question of how frequent and important they are.

Unknown said...

I guess you were right in having doubts about this paper: it just got retracted this month -