Wednesday, January 08, 2014

The "duon" delusion and why transcription factors MUST bind non-functionally to exon sequences

This post is about a paper recently published in Science (Dec. 13, 2013) by John Stamatoyannopoulos and his collaborators at the University of Washington in Seattle, Washington, USA.

Stergachis, A.B., Haugen, E., Shafer, A., Fu, W., Vernot, B., Reynolds, A., Raubitschek, A., Ziegler, S., LeProust, E.M., Akey, J.M. and Stamatoyannopoulos, J.A. (2013) Exonic Transcription Factor Binding Directs Codon Choice and Affects Protein Evolution. Science 342:1367-1372. [doi: 10.1126/science.1243490] [Abstract] [PDF]

Stamatoyannopoulos is one of the ENCODE workers. He recently gave a talk at the University of Toronto where he defended the idea that pervasive transcription and pervasive transcription factor binding are evidence of widespread function in the human genome. This paper looks at transcription factor binding sites in exon sequences (coding sequences) and finds lots of them. What this means is that stretches of coding region contain codons AND transcription factor binding sites (duh!).

This is such an important discovery (not!) that Stergachis et al. coined a new word, "duons," to describe sequences that have two meanings. The ridiculous hype over this paper is covered in a separate post []. Here, I want to look at the science.

Let's start by reviewing what we know about DNA binding proteins. Some of these proteins bind to specific sequences in DNA. The classic examples are the restriction enzymes (restriction endonucleases) produced by various bacterial species to protect themselves against invasion by foreign DNA. These enzymes recognize short sequences of DNA. They bind and cleave the DNA by cutting both strands [see Restriction, Modification, and Epigenetics].

Typical DNA binding proteins recognize specific sequences of about six base pairs. The restriction enzyme EcoR1, for example, binds to the sequence GAATTC. This sequence will occur quite often in any random stretch of DNA. You can calculate the frequency by determining the probability of GAATTC: it's 46 or one in 4096 base pairs. What this means is that EcoR1 will bind to any DNA about once every 4000 bp (4Kb).1

Back in the olden days, before DNA sequencing became cheap and easy, we used to construct restriction maps of DNA to define genes. Here's an example from a paper we published over thirty years ago. It shows the DNA binding sites of various restriction enzymes on Drosophila melanogaster DNA clones containing hsp70 heat shock genes (Moran et al. 1979).

The important point here is that none of these bacterial enzymes will ever see Drosophila DNA outside of the laboratory but because of their binding properties they recognize their specific binding sequence whenever they encounter it. I could have done the same experiment using transcription factors. If I had several dozen transcription factors from human cells, I could have mapped their binding sites on my Drosophila DNA and made a figure just like the one shown above. Of course, none of those binding sites would be biologically relevant since the binding of a human transcription factor to fruit fly DNA isn't ever gong to happen in the real world.

The coding regions of the genes are shown by the solid black bars in the figure. (These genes have no introns.) Note that the restriction enzyme binding sites are distributed fairly randomly but many of them bind to the coding region. What this means is that certain sequences in the coding region have a dual "meaning." Not only do they specify codons, they also specify the binding site for a restriction endonuclease. I suppose we could have made a big deal of this back in 1979 and called those sequences "duons" but I doubt very much this would have got past the reviewers. It's too obvious and it's not biologically relevant.

Same with transcription binding sites. If I had published a map of human transcription factor binding sites on Drosophila DNA nobody would think this remarkable enough to coin a new word for codons that are also transcription factor binding sites.

This brings us back to the Stergachis et al. (2013) paper. What they did was to map the sites in human DNA that bind human transcription factors. Recall that there have to be lots of these sites in our genome since most of them recognize specific sequences of about 6bp. Like restriction enzyme binding sites, these sequences will occur once every 4Kb (4000 bp) in random sequences of DNA. The human genome is not exactly random DNA but it's close enough for our purposes. For any given transcription factor, there will be about 800,000 binding sites in the human genome. Only a tiny percentage of these sites will be be biologically relevant leading to regular transcription of a nearby gene.

Stergachis et al. did not map transcription factor binding sites to naked human DNA. Instead, they mapped the binding sites in vivo which means that a given cell type had to produce the transcription factor and the DNA binding site had to be accessible. The latter distinction is important because a lot of our DNA is tightly bound to nucleosomes to make chromatin and in higher order chromatin structures the naked DNA is not "visible" to DNA binding proteins. When a gene is active, the chromatin opens up to form an "open" chromatin region where the DNA is exposed to transcription factors and RNA polymerase transcription complexes.

What this means is that only a subset of possible transcription factor binding sites can be detected in any given cell type. Many of these will be in or near active genes where the chromatin is in an "open" conformation. This includes protein coding genes and coding regions.

Stergachis et al. (2013) found a total of 11,588,043 transcription factor binding sites in 81 different cell types. The average was 1,018,514 different binding sites in a typical cell. They found a total of 24,842 binding sites within protein-coding exons. This corresponds to 1.8% of the total. In other words, 98.2% of the binding sites were in noncoding DNA and 1.8% were in coding DNA. This is pretty close to the distribution of coding and noncoding DNA in the genome suggesting strongly that the method is detecting random non-functional binding of transcription factors.

I conclude that the authors are detecting transcription factors binding to non-functional sites within coding regions. The vast majority of these sites have no biological relevance. They simply reflect the occurrence of fortuitous binding sites in the genome that just happen to match the specific binding site consensus sequence. This is not a big deal. In fact, it is predicted simply on the basis of our understanding of DNA binding proteins.

The authors do not address this possibility in their paper. Instead, they conclude ...
Our results indicate that simultaneous encoding of amino acid and regulatory information within exons is a major functional feature of complex genomes. The information architecture of the received genetic code is optimized for super-imposition of additional information, and this intrinsic flexibility has been extensively exploited by natural selection. Although TF binding within exons may serve multiple functional roles, our analyses above is agnostic to these roles, which may be complex.
While the authors are entitled to their opinion, they are NOT entitled to ignore other possible interpretations of their data. Especially since, in this case, the other interpretation contradicts the main conclusions of the paper. This is not how science should be done. This paper should never have been published as it is. The reviewers should be named and shamed. The editor(s) at Science should be fired.2

1. It's actually a bit more complicated than that. Most DNA binding proteins bind nonspecifically to any piece of DNA. It's not just part of the intrinsic affinity for a negatively charged double-helix, it's also biologically relevant since these proteins usually bind weakly to DNA and then scan the DNA in one dimension looking for their specific binding site [see Slip Slidin' Along - How DNA Binding Proteins Find Their Target]. This kind of binding would not show up in the methods used by Stergachis et al. However, specific DNA binding proteins will recognize sequences that are closely related, but not identical, to their binding site and these interactions could be detected. EcoR1, for example, will bind with appreciable affinity to sites that differ by just one base pair from GAATTC.

2. No editor at Science should be unaware of the controversy surrounding the ENCODE publicity fiasco of 2012. Thus, editors should be on high alert every time they receive a paper from another ENCODE lab. They should go out of their way to choose reviewers who have been critical of the claims of pervasive functionality.

[Image Credit: Moran, L.A., Horton, H.R., Scrimgeour, K.G., and Perry, M.D. (2012) Principles of Biochemistry 5th ed., Pearson Education Inc. page 581 [Pearson: Principles of Biochemistry 5/E] © 2012 Pearson Education Inc.]

Moran, L., Mirault, M.-E., Tissières, A., Lis, J., Schedl, P., Artavanis-Tsakonas, S., and Gehring, W.J. (1979) Physical map of two D. melanogaster DNA segments containing sequences coding for the 70,000 dalton heat shock protein.

Stergachis, A.B., Haugen, E., Shafer, A., Fu, W., Vernot, B., Reynolds, A., Raubitschek, A., Ziegler, S., LeProust, E.M., Akey, J.M. and Stamatoyannopoulos, J.A. (2013) Exonic Transcription Factor Binding Directs Codon Choice and Affects Protein Evolution. Science 342:1367-1372. [doi: 10.1126/science.1243490] [Abstract] [PDF]


  1. Although TF binding within exons may serve multiple functional roles, our analyses above is agnostic to these roles

    So you have no evidence of functionality. IOW, you got squat.

  2. Let's run a few numbers. If a TF recognizes 6 bp, then it will occur at random every 4000 bp, as Larry pointed out.

    Let's assume these guys tested all coding exons. Say there are 20,000 coding genes [leaving out RNA genes] and the average is, say, 1500 bp. That's 30 million bps.

    If a TF recognizes 6 bp and would bind at random every 4000 bps, they should have found 30 million/4,000 = 7500 binding sites per TF they tested.

    In fact they found 24,842 TF binding sites, about 3.26 times more than you would expect at random for one TF that recognizes 6 bps.

    So how many TF's did they test for?

    This result would be affected by:

    1. If they did not test all coding exon regions;

    2. If their TF's recognize more than 6 bps (in which case the number of random hits should go down.)

    Anyone have stats on this? Georgi perhaps?

    1. Typically, one does not calculate the expected number of occurences based on the k-mer length of the recognition sequence, but in bits based on PWMs (Position Weight Matrices), And you do not do this against an uniform base composition background model but against whatever the sequence composition of the genome you actually work with is.

      Restriction enzymes are actually different from eukaryote TFs in that respect - they have very strong sequence specificity even if within a short k-mer. All positions carry 2 bits of information in most cases. Transcription regulators differ too between bacteria and eukaryotes - bacterial ones tend to be a lot more specific than eukaryotic ones, which have very loose recognition preferences. Which is very interesting but is a long subject on its own.

      But all of that does not really matter here - there is the distinction between transcribed vs non-transcribed genes that Larry mentioned in the OP, with transcribed genes being more accessible to DNA binding proteins. That would account for the overrepresentation. It is not easy to disentangle cause from effect here though - one can argue that TFs bind to these genes because chromatin is more accessible, but on the other hand, one does expect to see expressed genes being regulated by TFs (because expression in eukaryotes is generally positively driven as you have to overcome the chromatin barrier to start transcribing).

      There is one prediction that can be tested and that can address the functional vs non-functional question (as a statistical trend, not for each individual instance) and it is that highly expressed genes should have more footprints under the ``neutral'' model. They did look at this - Figure S4 here:

      And they do find correlation between expression levels and the number of footprints.

      This can be further refined into housekeeping and non-housekeeping genes - generally housekeeping genes tend to be subject to less regulatory complexity than tissue-specific genes. They did not look at that and it would be interesting to parse this further.

    2. Thanks for the reply Georgi, but the plots you pointed to are a bit underwhelming.

      highly expressed genes should have more footprints under the ``neutral'' model.

      I'm not sure I understand that.

      We refer to Fig. S4 here:

      Fig S4-C is really underwhelming. It shows average gene expression for exonic regions in three categories: having 0, 1-4, and 5+ TF binding sites. The boxes are at nearly the same height, and have big error bars. I worry that the binning values could have been chosen to make the effect seem larger.

      S4-C shows a very weak correlation such that more TF binding sites corresponds to slightly more gene expression, which according to Georgi rejects or weakly supports the "neutral" model.

      S4-D is a table of pearson's r-values, and they're weak, in the range 0.15-0.22. Again, rejects or weakly supports the "neutral" model.

      I'm not sure I get the principle, though. Do you mean that, if more TF binding were only due to, say, coding genes being in "open" regions of the genome (not bound to chromatin), then we should expect their expression level to be higher?

    3. The ""neutral" model is that TFs bind to DNA because they can, not because they are necessarily needed. The expectation would then be that you would see more binding sites in coding sequence in highly expressed genes because highly expressed genes have a more open chromatin structure (because they are transcribed more often) and more binding sites are accessible. It does not have to be a very strong effect though, it's just a general positive correlation you expect to see. Thus my post above.

    4. The expectation would also be that you would see more binding sites in intronic sequence in highly expressed genes. I don't know if anyone has looked at that. Of course, housekeeping genes have fewer and shorter introns so one has to correct for that too. It's something to investigate further.

    5. Where by "binding sites" I mean "occupied binding sites", I always use those interchangeably and it's wrong :(

  3. The case that Stamatoyannopoulos and colleagues make in Science builds on ideas advanced by Tamar Schaap in 1971. They cite the many fine papers of Richard Grantham but, perhaps due to space pressure from the Editors, omit reference to Schaap’s seminal study. His paper in the Journal of Theoretical Biology (32, 293-298), and the work of Grantham et al., form the basis of webpages that deal extensively with the issues raised (e.g. Schaap ).

    Weatheritt and Babu comment on the paper in the same issue of Science. They point out that the ‘duon’ notion refers only to two of the many forms of information that compete for genome space. As to whether these forms of information can “harmoniously exist,” there has long existed evidence on the intragenomic conflicts requiring the “possible tradeoffs” to which they refer. Much of this was dealt with in my textbook – Evolutionary Bioinformatics – which is now in its second edition (2011) - and in my contribution to Lewin’s GENES XI (2014).

    Consistent with the Stamatoyannopoulos thesis, very strong selection acting at synonymous coding positions is becoming more widely recognized, for example in HIV-1 (Mayrose et al. 2013; Forsdyke 2014) and in the fruit fly genome (Lawrie et al. 2013).

    Lawrie et al. (2013) Strong purifying selection at synonymous sites in D. melanogaster. PLOS Genetics 9 e1003527.
    Mayrose et al. (2013) Synonymous site conservation in the HIV-1 genome. BMC Evolutionary Biology 13:164.
    Forsdyke DR (2014) Microbes and Infection (in press; DOI : 10.1016/j.micinf.2013.10.017)

  4. In the paper, they do look at conservation too, including at 4-fold degenerate sites, and that is a strong case for functionality of these binding sites. That said, there is nothing too surprising about regulatory elements located in coding sequence - enhancers located in introns are well known, some of the Pol3 promoters are specified by sequences quite downstream of the transcription start site, etc. etc. From the perspective of the transcription apparatus there is no coding vs noncoding sequence distinction - that is only made much further downstream by the translation machinery.

    The important thing here is that it was explicitly shown (well, OK, not quite, a lot of it is based on footprints and motifs plus some ChIP-seq, but is still strong evidence) that it is TFs that constrain some degenerate positions. Which is by no means a new idea but is still an important contribution.

    How the PR was handled is an entirely different subject...

    1. In the paper, they do look at conservation too ...

      Hmmm .... they find that coding regions of genes are conserved. But they also seem to find that the transcription factor binding sites are "significantly younger;" whatever that means. I've read that section of the paper several times and tried very hard to understand Figure 1 but I don't get it. I guess I'm too stupid to appreciate the sophistication of their assays.

      there is nothing too surprising about regulatory elements located in coding sequence

      Exactly, I was going to bring up Pol III promoters in my next post. We have known for over thirty years that there are transcription factor binding sites within the genes for transfer RNA and 5S RNA. Thus, these genes contain nucleotides that play a dual role, they determine part of the functional region of their RNA products AND they are the sites of transcription factor binding.

      This is in all of the textbooks. Undergraduates learn it in introductory courses. It's not a big deal and nobody has tried to make up a new word (like "duon") to describe this dual function. (And in this case it really is dual function, not just speculation.)

      Georgi, how many of the authors on this paper know about Pol III transcription? Do you think all of them do? Could you have written that paper without mentioning that internal promoters is old hat?

      I don't think I'm being unfair when I criticize ENCODE workers. That doesn't mean that ALL of you are ignorant. :-)

    2. 1) Undergraduates are sometimes taught about the details of Pol III transcription. It's by no means universally taught everywhere - I personally was never taught that.

      2) I can't comment on who knows what. First, I don't personally know the authors of that paper at all. Second, you have to spend a really large amount of time with someone to really get an idea what they know and what they don''t. Very few people spend that much time together.

      3) In general, if you want my bleak, cynical, pessimistic view of life, completely unrelated to the subject here, one does not actually need to know much about anything to produce papers in biology today. You only need to know enough to do the research and write it up. The system certainly does not force you to learn much and there are no checks that even whatever little is learned is retained long-term. Unless you invest a great deal of effort on your own, most of it completely unrelated to your research, you leave graduate school knowing a lot less than you did when you entered it, because you will have forgotten a lot of what you have learned as an undergrad (I have noticed this many times with myself, to my great dissatisfaction - things I had completely mastered years ago but have never touched since then and as a result I only have a vague recollection right now) and you will have only learned things in one narrow area. And you can actually be very successful following this model while trying to keep up with the literature on a wide variety of topics and broaden your horizons by exploring other fields can actually hurt you because it's time not spent directly doing research. Note that when I say this, I do not have anyone specific in mind, it's just how the system is set up.

  5. Nice analogy with restriction enzymes. So I guess we need someone to do the following experiment: make a transgenic mouse line that expresses a bacterial TF with no vertebrate counterpart (say, a sigma factor) and then do some ChIP-seq analyses. As you say, I bet they should find thousands of totally spurious binding sites wherever there is accessible chromatin, in a pattern resembling the binding of mammalian TFs. But who would have the money (and the courage) to do this?

    1. Everyone can do it - it's a <$2K experiment. You don't have to make a transgenic mouse - just express a GFP-tagged version it in a cell line and ChIP.

      It has indeed not been done and the reason it has not been done is that everyone knows you will get a positive result. This is the same principle that ZFN, TALEN and CRISPR-based genome engineering technologies use and they work quite well.

    2. Ok, I agree that it would be cheaper with a cell line, it's just that usually I don't find work with cell lines very reliable :) and transgenic mouse lines could be deemed more "physiological". And yes, everyone knows the results will be "positive", perhaps that's why it's not done. Imagine if the binding pattern of sigma32 in neurons or pancreas were qualitatively similar to the endogenous binding of Otx2 or Ngn3? ENCODE people would find it quite difficult to explain.

    3. 1) There's been an entirely unnecessary demonization of ENCODE people - they happen to be a lot more reasonable than what you might be lead to believe by what goes on in the blogosphere. What I would classify as really egregious claims (because they are backed by a long history of repeatedly making them before that) has been coming out of people who are not part of ENCODE.

      2) I don't see why anything would be difficult to explain. Current thinking is that you can separate TFs into two broad categories - pioneer factors that can bind to compact chromatin and open it, and others that preferentially bind to open chromatin. There is a flaw in this, which is that the pioneer factors are not much different in their sequence preferences from the others - it is still 6-8bp motifs and they do not open the chromatin everywhere those sequences are found, so there has to be another source of specificity, whether it is combinatorial occupancy or something else, but the distinction is useful for the discussion here. There is no histone code in bacteria because there are not histones thus it is unlikely that sigma32 would act as a pioneer factor. Therefore the expectation is that it will bind to regions of already open chromatin that contain the recognition sequence.

    4. Don't want to demonise anyone. Let me be more precise: I find ChIP-seq work on TF binding and chromatin marks wonderful in the sense of opening hypothesis about how gene regulation is going on. What I don't like in ENCODE and similar works is the automatic link between binding event/chromatin modification and biological function. In this regard, observing a bacterial TF binding in thousands of places in a mammalian genome would illustrate that the binding/function link should not be done until one gets more evidence, and that's what I find would be difficult to reconcile with the conclusions of ENCODE-like papers. I don't think ENCODE people are being dishonest or anything, it's just an interesting scientific debate that needs to be done. About pioneer x "conventional" factors, it's true a bacterial TF should behave more like a "conventional" TF, binding previously open chromatin (including exons), it would be great to check this.

    5. It has been checked but not exactly in the way you want it. People have been expressing exogenous DNA binding proteins in eukaryotic cells for a very long time, and they do bind to DNA. It's just that nobody has expressed a bacterial TF and then done ChIP-seq on it

  6. Top 10 alternate meanings for the ENCODE acronym:

    #10 Enabling Numerous Claims Of Designer(ID) Efficiency

    #9 I cant actually think of anymore. I suppose this is why I'm not a comedy writer

  7. Just out of curiosity: How reliable a method is genomic footprinting considered nowadays? Figure S3 in the supplement suggests that DNAseq yields clear signals with full protection of the occupied TF bindig sites. However, I wonder if full signals are displayed or if some lower part of the signal has been ommitted? From what I remember from those days when genomic footprinting was done by running genomic DNA on and blotting from sequencing gels and hybridisation with radioactive single strand probes and the later established PCR based methods genomic footprinting is a quite fuzzy business. Een vin in vitro footprinting DNAse I treatment is critical because the result heavily depends on the concentration of the enzyme and the duration of the treatment. I.e. one will either miss binding sites due to over-digestion or consider DNAse-I resistance sequneces as being bound by a TF.

  8. Larry Moran:
    "This corresponds to 1.8% of the total. In other words, 98.2% of the binding sites were in noncoding DNA and 1.8% were in coding DNA"

    And yet we distinguish between DNA sequence: exon vs. intron. What's the difference?

    Larry Moran
    "Undergraduates learn it in introductory courses. It's not a big deal and nobody has tried to make up a new word (like "duon") to describe this dual function."

    You always seem to overlook that the main advancement made with these genome-wide analysis is one of extent. Yes, we've know about TFs in exons but now we're getting at an accurate sense of scale and pervasiveness across the genome. And just because the analysis and methods used in these papers are over your head doesn't mean that they're wrong.

    1. I'm happy to concede that the data is correct. It's the interpretation that's wrong.

      Genome-wide analyses may uncover something new but when authors claim that their genome-wide experiments are showing something that four decades of work with individual genes never showed, I have a right to be skeptical.

  9. I think the problem with the paper can be summarized by noting that they did not show that TFs actually bind to these duons and consequently didn't show such binding had any effect on gene expression. They only showed that TF binding sites exist inside coding regions of genes, which is not news by any stretch as mentioned repeatedly above.

    A ChIP-Seq experiment involves pulling down a particular factor and sequencing the DNA that comes along with it, then mapping it to the genome. If they had done that and showed that significant peaks of ChIP-Seq data overlap the TF recognition site within an exon, they will have shown that the TF binds to this intra-exonic region, but it would still say nothing about the effects of this binding on the expression of the gene. To show this, they could split the sample such that some of it is analyzed for expression and then compare the two data sets.

    Personally, I wouldn't dismiss the possibility of seeing misregulation of gene expression based on TF binding to intra-exonic regions, but this just hasn't been shown yet, certainly not in the ENCODE paper.

    Since they did none of the experiments I proposed or other similar ones to show actual binding and/or misregulation of gene expression as a result of such binding, I completely agree with the criticism above, but I will say that I don't think it's fair to criticize ENCODE as a whole for the far-reaching and unjustified conclusions of the paper. After all, ENCODE is an enormous project with so many researchers, they can't be held accountable for the works of individual labs or even several labs, least of all for conclusions made in a paper published by one of these labs, just because the authors also happen to be involved in other ENCODE projects. The reviewers of Science deserve most of the criticism for this misleading article and the even more misleading hype around it.