More Recent Comments

Thursday, December 11, 2025

How many regulatory sites in the human genome?

The current best model of the human genome is that only 10% is functional and 90% is junk. This model was first developed over half a century ago (see Junk DNA). From the very beginning, the model recognized that regulatory sequences would make up a significant proportion of the functional elements but early suggestions that most of the repetitive DNA would turn out to be involved in regulation were rejected.

As more and more data accumulated on regulatory sequences, it became apparent that most regulatory sequences of pol II (RNA polymerase II) genes could be found in relatively short regions of DNA just upstream of the transcription start site. It also became apparent that for each transcription factor there were thousands of transcription factor binding sites even though only a small number were actually involved in genuine gene regulation.1

The results from studying a large number of individual genes led to a model where a typical gene was regulated by binding a few transcription factors to the DNA within about 1000 bp of the start site. I suggested in my book that the amount of regulatory sequence should be about 200 bp per gene. If there are 25,000 genes (a generous estimate) then this amounts to about 5 million bp or less than 0.2% of the genome.

Most of these facts were forgotten when genomics supplanted hard core biochemistry and molecular biology. The new genomics scientists seemed to be unaware of the binding properties of transcription factors and the extensive data on the regulation of specific well-studied genes. All of a sudden it became acceptable to claim that the human genome contained a million regulatory sites that took up a significant proportion of the genome. This is what the ENCODE researchers claimed in their initial publications. They said that the amount of DNA devoted to regulation could be as much as 20% [What did the ENCODE Consortium say in 2012]. That's ridiculous.

Regulatory sites were called cis-regulatory elements (CREs) and they were identified by simply looking at transcription factor binding sites and open chromatin regions. When it was pointed out that not all transcription factor binding sites were functional, the genomics researchers began to refer to these sites as candidate cis-regulatory elements (cCREs) but their publications still implied that huge numbers of them were genuine regulatory sites (Agarwal et al., 2025).

This was partly based on the false idea that all genomic transcripts were functional and that the human genome contained almost 100,000 genes. The general consensus in recent genomics papers is that there are at least one million regulatory sites, which corresponds to an average of 10 regulatory sites per gene if one accepts the absurdly high estimate of gene numbers. The fact that there were no well-studied examples of such genes didn't seem to bother anyone. They also weren't bothered by the fact that ten thousand protein coding genes are housekeeping genes that are expressed in almost every cell—it seems unlikely that they would require that level of sophisticated regulation.

Let's look at a very recent paper to see how far the genomics attempt to study regulation has advanced. The title looks intriguing; maybe this is a serious attempt to identify genuine regulatory sites and eliminate all the noise. (Spoiler alert ... not!)

Yuan, S., Ni, P. and Su, Z. (2025) Prediction of target genes and functional types of cis-regulatory modules in the human genome reveals their distinct properties. BMC biology 23:211. [doi: 10.1186/s12915-025-02313-9]

Abstract
Background
Cis-regulatory modules (CRMs) such as enhancers and silencers play critical roles in virtually all biological processes by enhancing and repressing, respectively, the transcription of their target genes in specific cell types. Although numerous CRMs have been predicted in genomes, identifying their target genes remains a challenge due to low quality of the predicted CRMs and the fact that CRMs often do not regulate their closest genes.

Results
We developed a method — correlation and physical proximity (CAPP) by leveraging our recently predicted 1.2 M CRMs in the human genome. CAPP is able to not only predict the CRMs’ target genes but also their functional types using only chromatin accessibility (CA) and RNA-seq data in a panel of cell/tissue types plus Hi-C data in a few cell types. Applying CAPP to a panel of only 107 cell/tissue types with CA and RNA-seq data available, we predict target genes for 14.3% of the 1.2 M CRMs, of which 1.4% are predicted as both enhancers and silencers (dual functional CRMs), 98.2% as exclusive enhancers, and 0.4% as exclusive silencers. Dual functional CRMs tend to regulate more distant genes than exclusive enhancers and silencers. Enhancers tend to cooperate with other enhancers, whereas silencers typically act independently. Silencers preferentially regulate genes expressed across many cell/tissue types, while enhancers are prone to regulate genes expressed in fewer cell/tissue types.

Conclusions
CAPP represents a significant advancement in predicting target genes and functional types of CRMs, especially dual functional CRMs, and different types of CRMs show distinct properties.

The authors begin by defining cis-regulator modules (CRMs) as enhancers, silencers, promoters, and insulators. They then go on to discuss ways of associating CRMs with the genes they regulate. They locate 1.2 million CRMs in the human genome, mostly by identifying transcription factor binding sites.

There is no serious attempt to distinguish between real biologically relevant regulatory sites and spurious ones. What they should have done is what other researchers are now doing and refer to these sites as 'candidate' CRMs (cCRMs) in order to demonstrate that they were aware of the problem.

The authors developed an algorithm for identifying all CRMs that are located within a topologically associating domain (TAD) containing a gene. They then assume that the CRM regulates expression of that gene. The result is that 169,061 CRMs can be associated with a gene. I wrote to the senior author to find out how they estimated the number of genes and he replied by telling me that they used the NCBI annotation which listed both protein-coding and non-coding genes. There were 58,261 annotated genes at the time they did the analysis. The current version lists 59,792 genes of which 20,076 are protein-coding genes.

They only found associated CRMs for 43,521 genes (out of 58,261) this means that there are at least three CRMs per gene. But are all of these real genes? I don't think so. I think the NCBI annotation is actually counting spurious transcripts as functional RNAs and arbitrarily assigning them to a gene. Therefore, some of these "regulatory sequences" (CRMs) are actually causing spurious transcription. They are not biologically relevant regulatory sequences.

What about all the other CRMs? The authors seem to imply in their discussion that many of these were excluded because they used strict criteria in defining their active CRMs. They also seem to imply that they "only" looked at 107 cell/tissues types and some of the other CRMs may be active in other tissues. What's missing from the discussion (and the entire paper) is the idea that many transcription factor binding sites may have nothing to do with regulating real genes.

The important point as far as I'm concerned is the idea that there might be 1.2 million functional regulatory sites for less than 25,000 real genes. I don't understand why this idea is being taken seriously. It looks to me like genomics and bioinformatics researchers are completely unaware of the properties of DNA binding proteins and the fact that 90% of our genome is junk.

Note: This is a long paper and much of it is incomprehensible even after reading it twice. It's possible that I could sort out all of the other data and the conclusions but it would take far too much time. I find it difficult to believe that the reviewers took the time.


1. For a detailed explanation see: Transcription factors and their binding sites, How many enhancers in the human genome?, and Are most transcription factor binding sites functional?.

Agarwal, V., Inoue, F., Schubach, M., Penzar, D., Martin, B.K., Dash, P.M., Keukeleire, P., Zhang, Z., Sohota, A., Zhao, J. et al. (2025) Massively parallel characterization of transcriptional regulatory elements. Nature 639:411-420. [doi: 10.1038/s41586-024-08430-9]

1 comment :

Mehrshad said...

Dear Dr. Moran,

You have argued that many scientists are overlooking false-positive and non-functional cis-regulatory sequences. I am curious about how you propose we should evaluate which sequences are truly functional versus non-functional.

I would appreciate it if your answer did not rely solely on sequence conservation, as molecular genetics research has repeatedly demonstrated that many regulatory regions can be functional despite lacking obvious conservation across species.

Thank you for your insights.