Monday, February 05, 2018

ENCODE's false claims about the number of regulatory sites per gene

Some beating of dead horses may be ethical, where here and there they display unexpected twitches that look like life.

Zuckerkandl and Pauling (1965)

I realize that most of you are tired of seeing criticisms of ENCODE but it's important to realize that most scientists fell hook-line-and-sinker for the ENCODE publicity campaign and they still don't know that most of the claims were ridiculous.

I was reminded of this when I re-read Brendan Maher's summary of the ENCODE results that were published in Nature on Sept. 6, 2012 (Maher, 2012). Maher's article appeared in the front section of the ENCODE issue.1 With respect to regulatory sequences he said ...
The consortium has assigned some sort of function to roughly 80% of the genome, including more than 70,000 ‘promoter’ regions — the sites, just upstream of genes, where proteins bind to control gene expression — and nearly 400,000 ‘enhancer’ regions that regulate expression of distant genes ... But the job is far from done, says [Ewan] Birney, a computational biologist at the European Molecular Biology Laboratory’s European Bioinformatics Institute in Hinxton, UK, who coordinated the data analysis for ENCODE. He says that some of the mapping efforts are about halfway to completion, and that deeper characterization of everything the genome is doing is probably only 10% finished.
We knew back in 2012 that there were only about 25,000 genes so why are there 70,000 promoters? And if this is only 10% of the total then how can there be 700,000 promoters?

Similarly, if there really are 400,000 enhancers (what ever they are) then that's 16 per gene. Throw in the unknown 90% that have yet to be discovered and you have 160 per gene. Really?
The main ENCODE claim is that a substantial percentage of the genome is devoted to regulation ...
… even using the most conservative estimates, the fraction of bases likely to be involved in direct regulation, even though incomplete, is significantly higher than that ascribed to protein codon exons (1.2%), raising the possibility that more information in the human genome may be important for gene regulation than for biochemical function. (ENCODE, 2012 p. 71)
Their value for coding region is too high but let's parse what they mean based on the idea that regulatory sequences account for more than 1.2% of the genome. That works out to 38 Mb of DNA. If we take a generous estimate of 10 bp per regulatory site then there must be 3.8 million sites or 152 sites per gene. That makes no sense. If makes even less sense if Birney is right and this is only 10% of all functional sites.

ENCODE never seriously considered the possibility that most of their sites have no function. We now know this was a serious error that tainted their conclusions. It's very common for papers to be retracted when the authors make mistakes that invalidate their conclusions. I'm sure we aren't going to see any retractions but it would be really nice if Nature (and Science) would at least publish an article admitting that they were duped by Ewan Birney and the other ENCODE researchers.

1. Brendan Maher published an online news article on the Nature website on Sept. 6, 2012. He acknowledges that many of us were highly critical of the ENCODE hype but he still defends the idea that much of the genome is functional (Fighting about ENCODE and junk). In that post, he claims that at least 20% of the genome could be devoted to regulation.

ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489:57-74. [doi: 10.1038/nature11247]

Maher, B. (2012) The Human Encycleopedia. Nature, 489:46-48. [PDF]

Zuckerkandl, E. and Pauling, L. (1965) in EVOLVING GENES AND PROTEINS, V. Bryson and H.J. Vogel eds. Academic Press, New York NY USA


  1. I was under the impression that Birney and Stamatoyannopoulos did issue a retraction of sorts. In a subsequent paper they said that by 'function' they didn't mean what everyone else meant by 'function'

    1. @lantog I believe you're thinking of

      Defining functional DNA elements in the human genome

      It's not a retraction per se, but they definitely seemed to soften their position (especially in comparison to the hype-building interviews that a small number of the ENCODE researchers gave to the media back in 2012) regarding how much of the human genome must be functional. For example:

      "Thus, unanswered questions related to biological noise, along with differences in the resolution, sensitivity, and activity level of the corresponding assays, help to explain divergent estimates of the portion of the human genome encoding functional elements. Nevertheless, they do not account for the entire gulf between constrained regions and biochemical activity. Our analysis revealed a vast portion of the genome that appears to be evolving neutrally according to our metrics, even though it shows reproducible biochemical activity, which we previously referred to as “biochemically active but selectively neutral” (68). It could be argued that some of these regions are unlikely to serve critical functions, especially those with lower-level biochemical signal. However, we also acknowledge substantial limitations in our current detection of constraint, given that some human-specific functions are essential but not conserved and that disease-relevant regions need not be selectively constrained to be functional. Despite these limitations, all three approaches [genetic, evolutionary, and biochemical - Dave] are needed to complete the unfinished process of inferring functional DNA elements, specifying their boundaries, and defining what functions they serve at molecular, cellular, and organismal levels."

    2. In analogy to a "notpology", that is a "nottraction".