In spite of what you might have read, the human genome does not contain one million functional enhancers.
The Sept. 15, 2022 issue of Nature contains a news article on "Gene regulation" [Two-layer design protects genes from mutations in their enhancers]. It begins with the following sentence.
The human genome contains only about 20,000 protein-coding genes, yet gene expression is controlled by around one million regulatory DNA elements called enhancers.
Sandwalk readers won't need to be told the reference for such an outlandish claim because you all know that it's the ENCODE Consortium summary paper from 2012—the one that kicked off their publicity campaign to convince everyone of the death of junk DNA (ENCODE, 2012). ENCODE identified several hundred thousand transcription factor (TF) binding sites and in 2012 they estimated that the total number of base pairs invovled in regulating gene expression could account for 20% of the genome.
How many of those transcription factor binding sites are functional and how many are due to spurious binding to sites that have nothing to do with gene regulation? We don't know the answer to that question but we do know that there will be a huge number of spurious binding sites in a genome of more than three billion base pairs [Are most transcription factor binding sites functional?].
The scientists in the ENCODE Consortium didn't know the answer either but what's surprising is that they didn't even know there was a question. It never occured to them that some of those transcription factor binding sites have nothng to do with regulation.
Fast forward ten years to 2022. Dozens of papers have been published criticizing the ENCODE Consortium for their stupidity lack of knowledge of the basic biochemical properties of DNA binding proteins. Surely nobody who is interested in this topic believes that there are one million functional regulatory elements (enhancers) in the human genome?
Wrong! The authors of this Nature article, Ran Elkon at Tel Aviv University (Israel) and Reuven Agami at the Netherlands Cancer Institute (Amsterdam, Netherlands), didn't get the message. They think it's quite plausible that the expression of every human protein-coding gene is controlled by an average of 50 regulatory sites even though there's not a single known example any such gene.
Not only that, for some reason they think it's only important to mention protein-coding genes in spite of the fact that the reference they give for 20,000 protein-coding genes (Nurk et al., 2022) also claims there are an additional 40,000 noncoding genes. This is an incorrect claim since Nurk et al. have no proof that all those transcribed regions are actually genes but let's play along and assume that there really are 60,000 genes in the human genome. That reduces the average number of enhancers to an average of "only" 17 enhancers per gene. I don't know of a single gene that has 17 or more proven enhancers, do you?
Why would two researchers who study gene regulation say that the human genome contains one million enhancers when there's no evidence to support such a claim and it doesn't make any sense? Why would Nature publish this paper when surely the editors must be aware of all the criticism that arose out of the 2012 ENCODE publicity fiasco?
I can think of only two answers to the first question. Either Elkon and Agami don't know of any papers challenging the view that most TF binding sites are functional (see below) or they do know of those papers but choose to ignore them. Neither answer is acceptable.
I think that the most important question in human gene regulation is how much of the genome is devoted to regulation. How many potential regulatory sites (enhancers) are functional and how many are spurious non-functional sites? Any paper on regulation that does not mention this problem should not be published. All results have to interpreted in light of conflicting claims about function.
Here are some example of papers that raise the issue. The point is not to prove that these authors are correct - although they are correct - but to show that there's a controvesy. You can't just state that there are one million regulatory sites as if it were a fact when you know that the results are being challenged.
"The observations in the ENCODE articles can be explained by the fact that biological systems are noisy: transcription factors can interact at many nonfunctional sites, and transcription initiation takes place at different positions corresponding to sequences similar to promoter sequences, simply because biological systems are not tightly controlled." (Morange, 2014)
"... ENCODE had not shown what fraction of these activities play any substantive role in gene regulation, nor was the project designed to show that. There are other well-studied explanations for reproducible biochemical activities besides crucial human gene regulation, including residual activities (pseudogenes), functions in the molecular features that infest eukaryotic genomes (transposons, viruses, and other mobile elements), and noise." (Eddy, 2013)
"Given that experiments performed in a diverse number of eukaryotic systems have found only a small correlation between TF-binding events and mRNA expression, it appears that in most cases only a fraction of TF-binding sites significantly impacts local gene expression." (Palazzo and Gregory, 2014)
One surprising finding from the early genome-wide ChIP studies was that TF binding is widespread, with thousand to tens of thousands of binding events for many TFs. These number do not fit with existing ideas of the regulatory network structure, in which TFs were generally expected to regulate a few hundred genes, at most. Binding is not necessarily equivalent to regulation, and it is likely that only a small fraction of all binding events will have an important impact on gene expression. (Slattery et al., 2014)
Detailed maps of transcription factor (TF)-bound genomic regions are being produced by consortium-driven efforts such as ENCODE, yet the sequence features that distinguish functional cis-regulatory sites from the millions of spurious motif occurrences in large eukaryotic genomes are poorly understood. (White et al., 2013)
One outstanding issue is the fraction of factor binding in the genome that is "functional", which we define here to mean that disturbing the protein-DNA interaction leads to a measurable downstream effect on gene regulation. (Cusanovich et al., 2014)
... we expect, for example, accidental transcription factor-DNA binding to go on at some rate, so assuming that transcription equals function is not good enough. The null hypothesis after all is that most transcription is spurious and alterantive transcripts are a consequence of error-prone splicing. (Hurst, 2013)
... as a chemist, let me say that I don't find the binding of DNA-binding proteins to random, non-functional stretches of DNA surprising at all. That hardly makes these stretches physiologically important. If evolution is messy, chemistry is equally messy. Molecules stick to many other molecules, and not every one of these interactions has to lead to a physiological event. DNA-binding proteins that are designed to bind to specific DNA sequences would be expected to have some affinity for non-specific sequences just by chance; a negatively charged group could interact with a positively charged one, an aromatic ring could insert between DNA base pairs and a greasy side chain might nestle into a pocket by displacing water molecules. It was a pity the authors of ENCODE decided to define biological functionality partly in terms of chemical interactions which may or may not be biologically relevant. (Jogalekar, 2012)
Nurk, S., Koren, S., Rhie, A., Rautiainen, M., Bzikadze, A. V., Mikheenko, A., et al. (2022) The complete sequence of a human genome. Science, 376:44-53. [doi:10.1126/science.abj6987]
The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489:57-74. [doi: 10.1038/nature11247]