Sandwalk: The "duon" delusion and why transcription factors MUST bind non-functionally to exon sequences

Wednesday, January 08, 2014

The "duon" delusion and why transcription factors MUST bind non-functionally to exon sequences

This post is about a paper recently published in Science (Dec. 13, 2013) by John Stamatoyannopoulos and his collaborators at the University of Washington in Seattle, Washington, USA.

Stergachis, A.B., Haugen, E., Shafer, A., Fu, W., Vernot, B., Reynolds, A., Raubitschek, A., Ziegler, S., LeProust, E.M., Akey, J.M. and Stamatoyannopoulos, J.A. (2013) Exonic Transcription Factor Binding Directs Codon Choice and Affects Protein Evolution. Science 342:1367-1372. [doi: 10.1126/science.1243490] [Abstract] [PDF]

Stamatoyannopoulos is one of the ENCODE workers. He recently gave a talk at the University of Toronto where he defended the idea that pervasive transcription and pervasive transcription factor binding are evidence of widespread function in the human genome. This paper looks at transcription factor binding sites in exon sequences (coding sequences) and finds lots of them. What this means is that stretches of coding region contain codons AND transcription factor binding sites (duh!).

This is such an important discovery (not!) that Stergachis et al. coined a new word, "duons," to describe sequences that have two meanings. The ridiculous hype over this paper is covered in a separate post []. Here, I want to look at the science.

Let's start by reviewing what we know about DNA binding proteins. Some of these proteins bind to specific sequences in DNA. The classic examples are the restriction enzymes (restriction endonucleases) produced by various bacterial species to protect themselves against invasion by foreign DNA. These enzymes recognize short sequences of DNA. They bind and cleave the DNA by cutting both strands [see Restriction, Modification, and Epigenetics].

Typical DNA binding proteins recognize specific sequences of about six base pairs. The restriction enzyme EcoR1, for example, binds to the sequence GAATTC. This sequence will occur quite often in any random stretch of DNA. You can calculate the frequency by determining the probability of GAATTC: it's 4⁶ or one in 4096 base pairs. What this means is that EcoR1 will bind to any DNA about once every 4000 bp (4Kb).¹

Back in the olden days, before DNA sequencing became cheap and easy, we used to construct restriction maps of DNA to define genes. Here's an example from a paper we published over thirty years ago. It shows the DNA binding sites of various restriction enzymes on Drosophila melanogaster DNA clones containing hsp70 heat shock genes (Moran et al. 1979).

The important point here is that none of these bacterial enzymes will ever see Drosophila DNA outside of the laboratory but because of their binding properties they recognize their specific binding sequence whenever they encounter it. I could have done the same experiment using transcription factors. If I had several dozen transcription factors from human cells, I could have mapped their binding sites on my Drosophila DNA and made a figure just like the one shown above. Of course, none of those binding sites would be biologically relevant since the binding of a human transcription factor to fruit fly DNA isn't ever gong to happen in the real world.

The coding regions of the genes are shown by the solid black bars in the figure. (These genes have no introns.) Note that the restriction enzyme binding sites are distributed fairly randomly but many of them bind to the coding region. What this means is that certain sequences in the coding region have a dual "meaning." Not only do they specify codons, they also specify the binding site for a restriction endonuclease. I suppose we could have made a big deal of this back in 1979 and called those sequences "duons" but I doubt very much this would have got past the reviewers. It's too obvious and it's not biologically relevant.

Same with transcription binding sites. If I had published a map of human transcription factor binding sites on Drosophila DNA nobody would think this remarkable enough to coin a new word for codons that are also transcription factor binding sites.

This brings us back to the Stergachis et al. (2013) paper. What they did was to map the sites in human DNA that bind human transcription factors. Recall that there have to be lots of these sites in our genome since most of them recognize specific sequences of about 6bp. Like restriction enzyme binding sites, these sequences will occur once every 4Kb (4000 bp) in random sequences of DNA. The human genome is not exactly random DNA but it's close enough for our purposes. For any given transcription factor, there will be about 800,000 binding sites in the human genome. Only a tiny percentage of these sites will be be biologically relevant leading to regular transcription of a nearby gene.

Stergachis et al. did not map transcription factor binding sites to naked human DNA. Instead, they mapped the binding sites in vivo which means that a given cell type had to produce the transcription factor and the DNA binding site had to be accessible. The latter distinction is important because a lot of our DNA is tightly bound to nucleosomes to make chromatin and in higher order chromatin structures the naked DNA is not "visible" to DNA binding proteins. When a gene is active, the chromatin opens up to form an "open" chromatin region where the DNA is exposed to transcription factors and RNA polymerase transcription complexes.

What this means is that only a subset of possible transcription factor binding sites can be detected in any given cell type. Many of these will be in or near active genes where the chromatin is in an "open" conformation. This includes protein coding genes and coding regions.

Stergachis et al. (2013) found a total of 11,588,043 transcription factor binding sites in 81 different cell types. The average was 1,018,514 different binding sites in a typical cell. They found a total of 24,842 binding sites within protein-coding exons. This corresponds to 1.8% of the total. In other words, 98.2% of the binding sites were in noncoding DNA and 1.8% were in coding DNA. This is pretty close to the distribution of coding and noncoding DNA in the genome suggesting strongly that the method is detecting random non-functional binding of transcription factors.

I conclude that the authors are detecting transcription factors binding to non-functional sites within coding regions. The vast majority of these sites have no biological relevance. They simply reflect the occurrence of fortuitous binding sites in the genome that just happen to match the specific binding site consensus sequence. This is not a big deal. In fact, it is predicted simply on the basis of our understanding of DNA binding proteins.

The authors do not address this possibility in their paper. Instead, they conclude ...

Our results indicate that simultaneous encoding of amino acid and regulatory information within exons is a major functional feature of complex genomes. The information architecture of the received genetic code is optimized for super-imposition of additional information, and this intrinsic flexibility has been extensively exploited by natural selection. Although TF binding within exons may serve multiple functional roles, our analyses above is agnostic to these roles, which may be complex.

While the authors are entitled to their opinion, they are NOT entitled to ignore other possible interpretations of their data. Especially since, in this case, the other interpretation contradicts the main conclusions of the paper. This is not how science should be done. This paper should never have been published as it is. The reviewers should be named and shamed. The editor(s) at Science should be fired.²

1. It's actually a bit more complicated than that. Most DNA binding proteins bind nonspecifically to any piece of DNA. It's not just part of the intrinsic affinity for a negatively charged double-helix, it's also biologically relevant since these proteins usually bind weakly to DNA and then scan the DNA in one dimension looking for their specific binding site [see Slip Slidin' Along - How DNA Binding Proteins Find Their Target]. This kind of binding would not show up in the methods used by Stergachis et al. However, specific DNA binding proteins will recognize sequences that are closely related, but not identical, to their binding site and these interactions could be detected. EcoR1, for example, will bind with appreciable affinity to sites that differ by just one base pair from GAATTC.

2. No editor at Science should be unaware of the controversy surrounding the ENCODE publicity fiasco of 2012. Thus, editors should be on high alert every time they receive a paper from another ENCODE lab. They should go out of their way to choose reviewers who have been critical of the claims of pervasive functionality.

[Image Credit: Moran, L.A., Horton, H.R., Scrimgeour, K.G., and Perry, M.D. (2012) Principles of Biochemistry 5th ed., Pearson Education Inc. page 581 [Pearson: Principles of Biochemistry 5/E] © 2012 Pearson Education Inc.]

Moran, L., Mirault, M.-E., Tissières, A., Lis, J., Schedl, P., Artavanis-Tsakonas, S., and Gehring, W.J. (1979) Physical map of two D. melanogaster DNA segments containing sequences coding for the 70,000 dalton heat shock protein.

Stergachis, A.B., Haugen, E., Shafer, A., Fu, W., Vernot, B., Reynolds, A., Raubitschek, A., Ziegler, S., LeProust, E.M., Akey, J.M. and Stamatoyannopoulos, J.A. (2013) Exonic Transcription Factor Binding Directs Codon Choice and Affects Protein Evolution. Science 342:1367-1372. [doi: 10.1126/science.1243490] [Abstract] [PDF]

22 comments:

DiogenesWednesday, January 08, 2014 2:20:00 PM
Although TF binding within exons may serve multiple functional roles, our analyses above is agnostic to these roles

So you have no evidence of functionality. IOW, you got squat.
ReplyDelete
Replies
DiogenesWednesday, January 08, 2014 2:27:00 PM
Let's run a few numbers. If a TF recognizes 6 bp, then it will occur at random every 4000 bp, as Larry pointed out.

Let's assume these guys tested all coding exons. Say there are 20,000 coding genes [leaving out RNA genes] and the average is, say, 1500 bp. That's 30 million bps.

If a TF recognizes 6 bp and would bind at random every 4000 bps, they should have found 30 million/4,000 = 7500 binding sites per TF they tested.

In fact they found 24,842 TF binding sites, about 3.26 times more than you would expect at random for one TF that recognizes 6 bps.

So how many TF's did they test for?

This result would be affected by:

1. If they did not test all coding exon regions;

2. If their TF's recognize more than 6 bps (in which case the number of random hits should go down.)

Anyone have stats on this? Georgi perhaps?
ReplyDelete
Replies
Donald ForsdykeWednesday, January 08, 2014 2:29:00 PM
The case that Stamatoyannopoulos and colleagues make in Science builds on ideas advanced by Tamar Schaap in 1971. They cite the many fine papers of Richard Grantham but, perhaps due to space pressure from the Editors, omit reference to Schaap’s seminal study. His paper in the Journal of Theoretical Biology (32, 293-298), and the work of Grantham et al., form the basis of webpages that deal extensively with the issues raised (e.g. Schaap http://post.queensu.ca/~forsdyke/bioinfo3.htm ).

Weatheritt and Babu comment on the paper in the same issue of Science. They point out that the ‘duon’ notion refers only to two of the many forms of information that compete for genome space. As to whether these forms of information can “harmoniously exist,” there has long existed evidence on the intragenomic conflicts requiring the “possible tradeoffs” to which they refer. Much of this was dealt with in my textbook – Evolutionary Bioinformatics – which is now in its second edition (2011) - and in my contribution to Lewin’s GENES XI (2014).

Consistent with the Stamatoyannopoulos thesis, very strong selection acting at synonymous coding positions is becoming more widely recognized, for example in HIV-1 (Mayrose et al. 2013; Forsdyke 2014) and in the fruit fly genome (Lawrie et al. 2013).

Lawrie et al. (2013) Strong purifying selection at synonymous sites in D. melanogaster. PLOS Genetics 9 e1003527.
Mayrose et al. (2013) Synonymous site conservation in the HIV-1 genome. BMC Evolutionary Biology 13:164.
Forsdyke DR (2014) Microbes and Infection (in press; DOI : 10.1016/j.micinf.2013.10.017)
ReplyDelete
Replies
Georgi MarinovWednesday, January 08, 2014 2:56:00 PM
In the paper, they do look at conservation too, including at 4-fold degenerate sites, and that is a strong case for functionality of these binding sites. That said, there is nothing too surprising about regulatory elements located in coding sequence - enhancers located in introns are well known, some of the Pol3 promoters are specified by sequences quite downstream of the transcription start site, etc. etc. From the perspective of the transcription apparatus there is no coding vs noncoding sequence distinction - that is only made much further downstream by the translation machinery.

The important thing here is that it was explicitly shown (well, OK, not quite, a lot of it is based on footprints and motifs plus some ChIP-seq, but is still strong evidence) that it is TFs that constrain some degenerate positions. Which is by no means a new idea but is still an important contribution.

How the PR was handled is an entirely different subject...
ReplyDelete
Replies
UnknownWednesday, January 08, 2014 2:59:00 PM
Nice analogy with restriction enzymes. So I guess we need someone to do the following experiment: make a transgenic mouse line that expresses a bacterial TF with no vertebrate counterpart (say, a sigma factor) and then do some ChIP-seq analyses. As you say, I bet they should find thousands of totally spurious binding sites wherever there is accessible chromatin, in a pattern resembling the binding of mammalian TFs. But who would have the money (and the courage) to do this?
ReplyDelete
Replies
AnonymousWednesday, January 08, 2014 5:25:00 PM
Top 10 alternate meanings for the ENCODE acronym:

#10 Enabling Numerous Claims Of Designer(ID) Efficiency

#9 I cant actually think of anymore. I suppose this is why I'm not a comedy writer
ReplyDelete
Replies
SPARCFriday, January 10, 2014 12:16:00 AM
Just out of curiosity: How reliable a method is genomic footprinting considered nowadays? Figure S3 in the supplement suggests that DNAseq yields clear signals with full protection of the occupied TF bindig sites. However, I wonder if full signals are displayed or if some lower part of the signal has been ommitted? From what I remember from those days when genomic footprinting was done by running genomic DNA on and blotting from sequencing gels and hybridisation with radioactive single strand probes and the later established PCR based methods genomic footprinting is a quite fuzzy business. Een vin in vitro footprinting DNAse I treatment is critical because the result heavily depends on the concentration of the enzyme and the duration of the treatment. I.e. one will either miss binding sites due to over-digestion or consider DNAse-I resistance sequneces as being bound by a TF.
ReplyDelete
Replies
caynazzoFriday, January 10, 2014 8:39:00 PM
Larry Moran:
"This corresponds to 1.8% of the total. In other words, 98.2% of the binding sites were in noncoding DNA and 1.8% were in coding DNA"

And yet we distinguish between DNA sequence: exon vs. intron. What's the difference?

Larry Moran
"Undergraduates learn it in introductory courses. It's not a big deal and nobody has tried to make up a new word (like "duon") to describe this dual function."

You always seem to overlook that the main advancement made with these genome-wide analysis is one of extent. Yes, we've know about TFs in exons but now we're getting at an accurate sense of scale and pervasiveness across the genome. And just because the analysis and methods used in these papers are over your head doesn't mean that they're wrong.
ReplyDelete
Replies
RonaSaturday, April 12, 2014 9:11:00 PM
I think the problem with the paper can be summarized by noting that they did not show that TFs actually bind to these duons and consequently didn't show such binding had any effect on gene expression. They only showed that TF binding sites exist inside coding regions of genes, which is not news by any stretch as mentioned repeatedly above.

A ChIP-Seq experiment involves pulling down a particular factor and sequencing the DNA that comes along with it, then mapping it to the genome. If they had done that and showed that significant peaks of ChIP-Seq data overlap the TF recognition site within an exon, they will have shown that the TF binds to this intra-exonic region, but it would still say nothing about the effects of this binding on the expression of the gene. To show this, they could split the sample such that some of it is analyzed for expression and then compare the two data sets.

Personally, I wouldn't dismiss the possibility of seeing misregulation of gene expression based on TF binding to intra-exonic regions, but this just hasn't been shown yet, certainly not in the ENCODE paper.

Since they did none of the experiments I proposed or other similar ones to show actual binding and/or misregulation of gene expression as a result of such binding, I completely agree with the criticism above, but I will say that I don't think it's fair to criticize ENCODE as a whole for the far-reaching and unjustified conclusions of the paper. After all, ENCODE is an enormous project with so many researchers, they can't be held accountable for the works of individual labs or even several labs, least of all for conclusions made in a paper published by one of these labs, just because the authors also happen to be involved in other ENCODE projects. The reviewers of Science deserve most of the criticism for this misleading article and the even more misleading hype around it.
ReplyDelete
Replies

Add comment