More Recent Comments

Thursday, January 15, 2026

Even more regulatory elements?

The expression of genes is regulated at many levels but one of the most important is regulation at the level of transcription. Transcription initiation is controlled by transcription factors that bind to sequences near the promoter and either activate or repress transcription.

A lot of work has been done on transcription regulation in mammals over the past 40 years. The general impression from these detailed studies of individual genes is that regulation usually involves a relatively small number of transcription factors that bind to sequences within 1000 bp or so of the transcription start site.

This model was challenged by the ENCODE studies in 2012. ENCODE researchers claimed to have discovered hundreds of thousands of cis-regulatory elements (CRE's) covering a substantial percentage of the genome. If they are correct, then this means that there are dozens of transcription factors controlling the expression of every gene.

All researchers need to realize that the best scientific practice is produced when, like Darwin, they persistently search for flaws in their arguments.

Bruce Alberts et al. (2015)
"Self-correction in science at work"
Science 348: 1420

Many scientists pointed out that what the ENCODE researchers were really looking at was transcription factor binding sites and not CRE's. In a genome full of junk DNA, we expect a large number of spurious transcription factor binding sites. These sites are NOT CREs although they may be good candidates for biologically relevant regulatory sites. Later ENCODE researchers seemed to (reluctantly) agree with this criticism so they began to label those sites as "candidate" cis-regulatory elements or cCRE's.

The controversy continues. I've blogged about it repeatedly in an effort to alert people to the real issue; namely, whether a transcription factor binding site is real or spurious [How many regulatory sites in the human genome?]. Last month I drew your attention to a study of TF binding sites in random DNA sequences inserted into human cells. That study confirmed that you could detect these sites in random DNA suggesting that the ENCODE data might contain a lot of spurious sites that have nothing to do with regulation [The activity of "random" DNA supports the junk DNA model].

Now we're starting 2026 with another study demonstrating that ENCODE supporters haven't listened to any of the criticism leveled against their interpretation. For reasons that are very unclear to me, this most recent study was published in Nature, one of the most prestigious science journals.

Moore, J.E., Pratt, H.E., Fan, K., Phalke, N., Fisher, J., Elhajjajy, S.I., Andrews, G., Gao, M., Shedd, N. et al. (2026) An expanded registry of candidate cis-regulatory elements. Nature:1-10. [doi: 10.1038/s41586-025-09909-9]

Mammalian genomes contain millions of regulatory elements that control the complex patterns of gene expression. Previously, the ENCODE consortium mapped biochemical signals across hundreds of cell types and tissues and integrated these data to develop a registry containing 0.9 million human and 300,000 mouse candidate cis-regulatory elements (cCREs) annotated with potential functions2. Here we have expanded the registry to include 2.37 million human and 967,000 mouse cCREs, leveraging new ENCODE datasets and enhanced computational methods. This expanded registry covers hundreds of unique cell and tissue types, providing a comprehensive understanding of gene regulation. Functional characterization data from assays such as STARR-seq, massively parallel reporter assay, CRISPR perturbation and transgenic mouse assays have profiled more than 90% of human cCREs, revealing complex regulatory functions. We identified thousands of novel silencer cCREs and demonstrated their dual enhancer and silencer roles in different cellular contexts. Integrating the registry with other ENCODE annotations facilitates genetic variation interpretation and trait-associated gene identification, exemplified by the identification of KLF1 as a novel causal gene for red blood cell traits. This expanded registry is a valuable resource for studying the regulatory genome and its impact on health and disease.

So now we have 2.37 million transcription factor binding sites that may or may not be true regulatory elements. They are "candidates" (cCREs) but the authors claim that this study provides "a comprehensive understanding of gene regulation" because 90% of these candidate sites are actually involved in regulation.

Let's think about that. 90% of 2.37 million is still 2.13 million sites. This means an average of 85 regulatory sites per gene if there are 25,000 genes. Does anyone seriously believe that the average human gene is controlled by that many regulatory sites? (Keep in mind that about 10,000 of those genes are housekeeping genes that are transcribed in almost every cell.)

Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can—if you know anything at all wrong, or possibly wrong—to explain it. If you make a theory, for example, and advertise it, or put it out, then you must also put down all the facts that disagree with it as well as those that agree with it.

Richard Fyenman (1985)
"Cargo Cult Science"
in Surely You're Joking, Mr. Feynman"

Apparently the authors are quite comfortable with this conclusion. They note that the current CRE database covers 21% of the genome but there may be more sites that have yet to be discovered. Here's part of their discussion.

With a genomic footprint of 21%, the human registry represents a comprehensive catalogue of the cis-regulatory repertoire, as it integrates data across thousands of biosamples spanning most human organs and tissues. However, we recognize the need for further evaluation using single-cell data to determine whether the registry may miss high-activity CREs specific to numerically rare cell types. Additionally, the potential emergence of novel CREs under disease or stimulation conditions remains an open area for investigation. Our initial assessments using single-cell data (Supplementary Note 1.7) support the overall completeness of the registry, but future work will be necessary to refine and expand its coverage using these more granular datasets.

Note the subtle shift from cCREs to just CREs.

Let me be clear about my critique. I'm not denying that there may be a huge number of biologically relevant regulatory sites hidden in the junk DNA. I'm skeptical, but still trying to keep an open mind.

What I object to most strongly is the fact that Moore et al. don't even consider the possibility that they may be looking at spurious TF binding sites and they don't even discuss the implications of their conclusions.

The fact that this paper was published without acknowledging the controversy tells me that peer review has failed.


8 comments :

Anonymous said...

I always wonder what authors of such papers think about how many different states of a cell exist. Surely, during development there is a lot of differentiation to form the different cell types, tissues and organs. However, once adulthood is reached how much regulation is required and more importantly why would a salamander need more DNA and multiple times as many regulatory elements as mammals to meet the demands of live. Or should we believe that the development of a salamander requires more regulation and in adulthood salamanders are challenged by more environmental changes then human beings, toads or frogs? In addition, isn’t much of the necessary day-to day regulation primarily happening on the cellular and the protein level rather than through transcriptional changes.

gert korthof said...

Larry, I asked google Ai: "how many regulatory sites do human genes have?".
The answer included a reference to: 'UMass Chan scientists annotate largest map yet of human genome’s regulatory switches', 15 Jan 2026. (press release)
which is rather surprising because that means that AI answers are apparently permanently updated. A one-day old publication is listed. So, it is not true that AI uses a fixed database of data.

"This means an average of 85 regulatory sites per gene":
could 85 be approximately true if something like 10 regulatory sites per cell type are used and the rest is non-functional in that cell-type? So, each cell type uses a different set of regulatory sites? Add to this a lot of redundancy?

Larry Moran said...

@gert korthof: What exactly are you suggesting? Are you suggesting that the 10,000 housekeeping genes are controlled by different transcription factors in different cell types? Are you thinking that the gene for triose-phosphate isomerase (TPI1) is controlled by 10 specific transcription factors in skin cells but a different 10 transcription factors in liver cells?

gert korthof said...

Larry, I didn't mention 'housekeeping genes'! Is there consensus about the number of regulatory sites for non-housekeeping genes in humans? What is the number according to you?
PS: I found estimates of 3,140 to 6,909 housekeeping genes in humans, the rest would be non-housekeeping genes.

Larry Moran said...

@gert korthof: True, you didn't mention housekeeping genes but they are important when one is making claims about there being 85 regulatory sites per gene. You proposed that this number could be explained if there were genes that were expressed in multiple cell types but used different transcription factors in each cell type.

I asked the obvious question in order to clarify your proposal. Housekeeping genes are the classic example of genes that are expressed in multiple cell types. Why didn't you answer my question?

Do you believe that there are millions of biologically relevant regulatory sites in the human genome?

Do you agree that the authors of this paper should have mentioned the possible existence of spurious transcription factor binding sites?

Arthur Hunt said...

Larry, you asked "Are you suggesting that the 10,000 housekeeping genes are controlled by different transcription factors in different cell types? Are you thinking that the gene for triose-phosphate isomerase (TPI1) is controlled by 10 specific transcription factors in skin cells but a different 10 transcription factors in liver cells?"

This doesn't sound too far-fetched to me. Well, maybe 10 completely cell-specific factors is a bit much, but the general idea seems a reasonable suggestion.

Larry Moran said...

@Arthur Hunt: Let's think about what we would see if your speculation is correct.

Imagine that the genes for all the enzymes in the Krebs cycle are controlled by transcription factors A, B, and C in liver cells and by factors X, Y, and Z in muscle cells. If we look carefully at muscle cells we should be able to show that the genes for A, B, and C are not expressed in those cells and similarly the genes for X, Y, and Z are not expressed in liver cells.

Is there any evidence that this is true of general transcription factors that are required for a large number of housekeeping genes?

Also, we should see lots of examples of promoter bashing experiments where you get very different results depending on whether you do the experiments in HeLa cells or a neuroblastoma cell line. The scientific literature should be full of such anomalies, right?

Anonymous said...

This feels similar to the initial estimates of the number of human genes as the HGP was finishing up, with some put at 100,000+. Only now they can back up their guesses by abusing genome-wide data in their analyses