More Recent Comments

Thursday, March 14, 2024

Nils Walter disputes junk DNA: (8) Transcription factors and their binding sites

I'm discussing a recent paper published by Nils Walter (Walter, 2024). He is arguing against junk DNA by claiming that the human genome contains large numbers of non-coding genes.

This is the seventh post in the series. The first one outlines the issues that led to the current paper and the second one describes Walter's view of a paradigm shift/shaft. The third post describes the differing views on how to define key terms such as 'gene' and 'function.' In the fourth post I discuss his claim that differing opinions on junk DNA are mainly due to philosophical disagreements. The fifth, sixth, and seventh posts address specific arguments in the junk DNA debate.


DNA-binding proteins

I did my PhD in a lab that specialized in DNA binding proteins so I became very familiar with the properties of these molecules. It was a time when the basic properties of DNA binding were being worked out in many labs and the results of those studies have now been in the textbooks for many decades. Unfortunately, it seems that this kind of hard core biochemistry is being ignored in most courses these days in favor of much more emphasis on human biochemistry, physiology, and disease.

Several DNA binding proteins have been studied intensely, beginning with the lac repressor in a classic series of papers by Lin and Riggs (Lin and Riggs, 1972; Lin, Riggs, and Wells, 1972; Lin and Riggs, 1975). I covered the details in previous posts but the take-home lesson is that DNA binding proteins interact with both DNA in general (non-specific binding) and with small specific DNA binding sites (specific binding). The kinetics of the interactions and the association constants of the equilibria dictate that at any one time most DNA binding proteins will be bound non-specifically and only a small fraction will be bound to their specific binding site [DNA binding proteins].

The distibution of RNA polymerase in E. coli cells is a little more complex because RNA polymerase remains bound to the gene it's transcribing for a very long time as it moves from the promoter to the termination site. Nevertheless, data from many years ago shows that a substantial percentage of RNA polymerase molecules are bound nonspecifically at any one time. I'm including a figure from my textbook to illustrate this distribution.

This has enormous consequences for the binding of transcription factors in eukaryotic cells since the amount of DNA is huge compared to the target sites. This means that a large number of these transcription factors will be bound non-specifically at any one time. This problem was addressed by Lin and Riggs (1975) who calculated the enormous difficulty that a small number of lac represor molecules would have finding an operator sequence in a human lymphocyte nucleus.

Similar calculations were performed in computer simulations by my fellow graduate student, Keith Yamamoto, using the estrogen receptor as a model. He calculated that you would need about 10,000 estrogen receptor molecules in each nucleus in order to activate the appropriate genes. That's because most of them would be bound non-specifically or to randomly occurring non-productive sites identical to the specific binding sites. He was purifying the estogen receptor at the time and the yield he was seeing was consistent with this calculation (Yamamoto and Alberts, 1974; Yamamoto and Alberts, 1975).

This is all covered in Chapter 8 of my book in a section titled "On the importance of DNA-binding proteins." The important properties of DNA-binding proteins have been confirmed time and time again in the 50 years since they were first discovered. Knowledeable scientists know that large eukaryotic genomes will be littered with transiently bound transcription factors and other DNA binding proteins such as RNA polymerase. Some of these will trigger transcription at inappropriate sites that are unrelated to biologically relevant promoters, leading to a low level of spurious transcription throughout the genome.

This spurious transcription activity has been documented in several species and so has the transcription of random segments of DNA inserted into cells.

For this reason, many of us were shocked when ENCODE published their results in 2012 claiming that every transcription factor binding site was a true regulatory sequence. That didn't make any sense and the conclusion was immediately challenged. That's why the ENCODE researchers retracted their claim and now refer to these sites as "candidate" sites. They recognize that they need to offer further evidence that these sites are functional and not accidental.

Questioning the biochemistry

Nils Walter addresses this issue in a section of his paper titled "Transcription factors bind to random genome sequences!" I have to quote the entire first paragraph because I'm not sure I could do it justice by paraphrasing.

Transcription factors bind to random genome sequences!

One final argument of geneticists is the suggestion that many random pieces of DNA can promote transcription by recruiting transcription factors locally.[75] However, rarely is the transient binding of a single transcription factor sufficient to recruit an RNA polymerase molecule to a transcription site. Rather, a combinatorial cooperation between cis-regulatory sequence elements in the genome, trans-acting transcription factors and signaling molecules, and gene-distal, but cis-acting ncRNA enhancer transcripts is needed to initiate directional transcription events that govern the tissue-specific, spatiotemporally controlled expression dynamics of genes (Figure 3B).[98] Consequently, the still poorly understood constant spatial reorganization of chromosomes in the densely packed nucleus—guided by a plethora of enhancer lncRNAs (Figure 1)—is both the result of and prerequisite for correct transcriptional programs that allow for the plasticity and adaptability of the semi-autonomous gene expression observed in each individual cell of a multicellular eukaryotic organism.[99]

I don't know whether this is intentional obfuscation or just poorly explained. The issue is how many regulatory sites there are in the human genome. The original ENCODE report in 2012 claimed that 80% of our genome was functional and this included 8.5% devoted to regulatory sites as identified by transcription factor binding or open chromatin domains. They suggested that once they look at more transcription factors, the amount of DNA devoted to regulation could be at least 20%. [What did the ENCODE Consortium say in 2012?] The hype over the past few years has emphasized that there could be one million regulatory sites in the human genome.

The argument against this view is that a huge majority of those sites are not true regulatory sites. Instead, they are spurious binding sites just like the ones predicted 50 years ago. Walter doesn't confront that controversy. He briefly mentions the argument of the "geneticists" (i.e. biochemists) then moves on to a description of complex transcription initiation sites involving enhancer RNAs (eRNAs). That's a different issue. We still want to know how many of these sites are present and what percentage of the genome they occupy.

He says there are hundreds of thousands of these sites but he offers no evidence to support this claim and he makes no attempt to deal with the fact that "transcription factors bind to random genome sequences"—the title of this section of his essay.

So, what's the point? It has something to do with his model of transcription regulation as shown in Figure 3 of the paper. I'll reproduce it below.

This is a complicated model but the main point seems to be that humans have many regulatory RNAs that have evolved from transposons. These RNAs are required for complex transcription initiation.

Nils Walter is perfectly entitled to present his personal model of transcription initiation. The next step is to develop tests of that model to see if it is supported by evidence or refuted by evidence. That's normally the way science works. But this seems to be different. It seems like Walter is giving us this model as "evidence" that there must be abundant regulatory RNAs in spite of data that seems to refute that idea. In other words, he seems to be begging the question.

This is how I interpret his closing paragraph (below) in a section that was supposed to review the evidence for abundant regulatory sites in light of data that many of them are artifacts.

In this modern view of eukaryotic gene expression, only those transcription events will occur that are sufficiently robustly proofread by a sequence of kinetically controlled, reversible assembly events that have to enhance each other and outcompete a vast number of possible alternative events.[26] In the resulting holistic model (Figure 3B), the significant number of defined transcripts detected by ENCODE then become a signature of select cellular processes that are allowed to proceed among a much larger number of possible transcripts. While we still do not understand the phenotypic functions of a majority of these primarily non-protein coding RNAs, we have to assume that the likelihood is high for eventually finding many functions that evolution has preserved across the many generations of individuals. Collectively, all these organisms and their cells were exposed to a vast array of rapidly changing environmental conditions, which imprinted on their genome and were inherited by the following generations.

Here's what I think he's saying. He's agreeing that spurious transcription is a problem but he argues that humans (and other mammals?) had to evolve sophisticated mechanisms to overcome this problem so that only the proper genes are transcribed at significant rates. Part of that sophisticated mechanism requires hundreds of thousands of regulatory RNAs.

I don't find this very helpful. The expression of dozens and dozens of human genes has been examined in considerable detail and, as far as I know, with very few exceptions, their transcription can be adequately explained by transcription factors binding upstream of the promoter and interacting with RNA polymerase complexes to form an initiation complex. Scientists have taken these regulatory regions, fused them to reporter genes such as E. coli β-galactosidase and reinserted them into mammalian cells to show that the regulatory sequences behave as predicted. In some cases, the constructs were used to create transgenic mice to show that they worked in whole organisms when inserted randomly in the genome. My colleagues and I did this in 1989 using 550 bp of the HSP70, heat-shock inducible regulatory region (Kothary et al. 1989).

There's no obvious reason to assume that the majority of human genes are regulated in a much more complicated manner involving thousands of regulatory RNAs. (I'm not ruling out the occasional exception.) If you are going to advance and defend such a model, I would have expected more data and more critical thinking in an essay on "Are non-protein coding RNAs junk or treasure?" I still don't know how many transcription start sites Nils Walter envisages and how much of the genome they are supposed to occupy and those seem to be important questions if you are advocating that most of our genome is functional.


Kothary, R., Clapoff, S., Darling, S., Perry, M.D., Moran, L.A. and Rossant, J. (1989) Inducible expression of an hsp68-lacZ hybrid gene in transgenic mice. Development 105:707-714. doi: [PDF]

Lin, S.-y. and Riggs, A.D. (1972) lac represser binding to non-operator DNA: detailed studies and a comparison of equilibrium and rate competition methods. Journal of molecular biology 72:671-690. [doi: 10.1016/0022-2836(72)90184-2]

Lin, S.-y. and Riggs, A.D. (1975) The general affinity of lac repressor for E. coli DNA: implications for gene regulation in procaryotes and eucaryotes. Cell 4:107-111. [doi: 10.1016/0092-8674(75)90116-6]

Riggs, A., Lin, S. and Wells, R. (1972) Lac repressor binding to synthetic DNAs of defined nucleotide sequence. Proceedings of the National Academy of Sciences 69:761-764. [doi: 10.1016/0022-2836(72)90184-2]

Yamamoto, K.R. and Alberts, B. (1974) On the specificity of the binding of the estradiol receptor protein to deoxyribonucleic acid. Journal of Biological Chemistry 249:7076-7086. [PDF]

Yamamoto, K.R. and Alberts, B. (1975) The interaction of estradiol-receptor protein with the genome: an argument for the existence of undetected specific sites. Cell 4:301-310. [doi: 10.1016/0092-8674(75)90150-6]

Walter, N.G. (2024) Are non‐protein coding RNAs junk or treasure? An attempt to explain and reconcile opposing viewpoints of whether the human genome is mostly transcribed into non‐functional or functional RNAs. BioEssays:2300201. [doi: 10.1002/bies.202300201]

1 comment :

SPARC said...

If the binding would be as specific and exclusive as Walter and others assume I wonder if they are unaware that EMSAs and footprinting experiments which employ nuclear extracts necessitate the addition of unspecific competitor nucleic acids to reduce unspecific binding of proteins for which the labeled DNA doesn't contain any binding sites. Maybe they just do in vivo cross-linking which just tells you that a protein was bound to the DNA but not if the binding was specific.