Sunday, October 15, 2023

On the conservation of regulatory sites in the human genome

There are a million potential transcription regulatory sites in the human genome. How many of these function as true regulatory sites?

One of the important questions about the human genome concerns how gene expression is regulated. The main controversy is over the number of functional regulatory sites and how that relates to abundant junk DNA. Here's how one group addresses the problem by looking at the conservation of regulatory sites in mammals. Sequence conservation is best genomics proxy for identifying functional sites.

Andrews, G., Fan, K., Pratt, H.E., Phalke, N., Zoonomia Consortium, Karlsson, E.K., Lindblad-Toh, K., Gazal, S., Moore, J.E. and Weng, Z. (2023) Mammalian evolution of human cis-regulatory elements and transcription factor binding sites. Science 380:eabn7930. [doi: 10.1126/science.abn7930]

Understanding the regulatory landscape of the human genome is a long-standing objective of modern biology. Using the reference-free alignment across 241 mammalian genomes produced by the Zoonomia Consortium, we charted evolutionary trajectories for 0.92 million human candidate cis-regulatory elements (cCREs) and 15.6 million human transcription factor binding sites (TFBSs). We identified 439,461 cCREs and 2,024,062 TFBSs under evolutionary constraint. Genes near constrained elements perform fundamental cellular processes, whereas genes near primate-specific elements are involved in environmental interaction, including odor perception and immune response. About 20% of TFBSs are transposable element–derived and exhibit intricate patterns of gains and losses during primate evolution whereas sequence variants associated with complex traits are enriched in constrained TFBSs. Our annotations illuminate the regulatory functions of the human genome.

The authors introduce the issue by pointing out two different views of functional regulatory sites. First, there's the ENCODE view, which maps the binding sites of 1600 transcription factors and the associated methylation and histone modification patterns. This analysis creates a database of almost one million candidate cis-regulatory elements (cCREs). Second, there's the evolutionary perspective, which looks at conservation of regulatory sites as the prime indicator of function. Only a fraction of candidate sites are conserved. Does this mean that most of the cCREs are not functional?

Andrews et al. set out to identify all of the cCRE's and transcription factor binding sites (TFBSs) that show evidence of conservation using an alignment of 241 mammalian genomes from the Zoonomia database and a program called phyloP.

They began with more than 920,000 human cCREs from the ENCODE Consortium results. Their results indicate that 47.5% of all CREs are highly conserved because they align to almost all of the 240 non-human mammalian genomes. (I have no idea how the phyloP program calculates "conservation.") That means approximately 439,000 sites that are likely to be genuine regulatory sequences covering 4% of the human genome. If there are 25,000 genes then this means that each gene requires about 17 regulatory sequences.

The next step was to examine 15.6 million TFBSs with a median length of 10 bp covering 5.7% of the human genome. They classified 32.5% of these sequences as highly conserved using the mysterious phyloP program. That means about 5.1 million functional transcription factor binding sites, but later on they reduce this to 2 million covering 0.8% of the genome. This is equivalent to an average of 80 per gene.

I don't believe that the authors have identified functional sites. There is no critical analysis of the results or the methodology and no attempt to rationalize the extraordinary claim that every gene requires so many regulatory sites. About 10,000 genes are regular housekeeping genes, such as those encoding the standard metabolic enzymes, and it's difficult to imagine that those genes require such complex regulation.


Image credit: ©Laurence A. Moran, What's in Your Genome?, p. 289.

2 comments:

  1. How did you get 4% of the genome for the cCREs? p129 of your book says 200bp/gene for several binding sites, and 17 is about 5x several, so 1000bp/gene, less than 1% in total.

    ReplyDelete
  2. @Graham Jones: The 4% is their value, not mine. I don't agree with their estimate because it makes no sense. I assume that their average cCRE covers about 70 bp.

    ReplyDelete