Friday, August 26, 2022

ENCODE and their current definition of "function"

ENCODE has mostly abandoned it's definition of function based on biochemical activity and replaced it with "candidate" function or "likely" function, but the message isn't getting out.

Back in 2012, the ENCODE Consortium announced that 80% of the human genome was functional and junk DNA was dead [What did the ENCODE Consortium say in 2012?]. This claim was widely disputed, causing the ENCODE Consortium leaders to back down in 2014 and restate their goal (Kellis et al. 2014). The new goal is merely to map all the potential functional elements.

... the Encyclopedia of DNA Elements Project [ENCODE] was launched to contribute maps of RNA transcripts, transcriptional regulator binding sites, and chromatin states in many cell types.

The new goal was repeated when the ENCODE III results were published in 2020, although you had to read carefully to recognize that they were no longer claiming to identify functional elements in the genome and they were raising no objections to junk DNA [ENCODE 3: A lesson in obfuscation and opaqueness].

The message doesn't seem to be getting through to the average scientist and certainly not to the general public. I was reminded of this yesterday when I saw the following edit to the Human genome article on Wikipedia under the subsection "Coding vs. noncoding DNA."

There is no consensus on what constitutes a "functional" element in the genome since geneticists, evolutionary biologists, and molecular biologists employ different definitions and methods.[41][42] In evolutionary definitions, "functional" DNA, whether it is coding or non-coding, contributes to the fitness of the organism, and therefore is under positive evolutionary pressure wheras "non-functional" DNA has no benefit to the organism and therefore is under neutral selective pressure. This type of DNA has been described as junk DNA[43][44] In genetic definitions, "functional" DNA is related to how DNA segments manifest by phenotype and "nonfunctional" is related to loss-of-function effects on the organism.[41] In biochemical definitions, "functional" DNA relates to how DNA affects biochemcial activity at the cellular DNA sequences that specify molecular products (e.g. noncoding RNAs) and biochemical activities with mechanistic roles in gene or genome regulation (i.e. DNA sequences that impact cellular level activity such as cell type, condition, and molecular processes).[45][41]

The "genetic" definition refers to the Kellis et al. (2014) paper (reference #41). A new reference (#45) has been added. It refers to an ENDODE III paper and it includes the following quotation.

"Operationally, functional elements are defined as discrete, linearly ordered sequence features that specify molecular products (for example, protein-coding genes or noncoding RNAs) or biochemical activities with mechanistic roles in gene or genome regulation (for example, transcriptional promoters or enhancers)."

The quotation is correct but the Wikipedia reference lists the wrong authors. Here's the correct citation.

Moore, J.E., Purcaro, M.J., Pratt, H.E., Epstein, C.B., Shoresh, N., Adrian, J. et al. (2020) Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583:699-710. doi:[doi: 10.1038/s41586-020-2493-4]

The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.

I believe that the person who edited the Wikipedia article is misrepresenting the current position of ENCODE because the article goes on to say,

There is no consensus in the literature on the amount of functional DNA since, depending on how "function" is understood, ranges have been estimated from up to 90% of the human genome is nonfunctional DNA (junk DNA)[46] to 80% of the genome is actually functional.[47] It is also possible that junk DNA may acquire a function in the future and therefore may play a role in evolution,[48] but this is likely to occur only very rarely.[43] Finally DNA that is deliterious to the organism and is under negative selective pressure is called garbage DNA.[44]

The issue is how much of the human genome has a biologically relevant function. ENCODE is making an effort to distance itself from the naive claim that all biochemical markers define real functions. In the Moore et al. paper, for example, they make it very clear that transcription factor binding sites and DNase I hypersensitive sites identify candidate cis-regulatory sites (cCREs) and not proven ones. They then go on to describe experiments that might validate those candidate sites and point out that only a minority are supported by additional evidence of functionality.

Similarly, with transcripts they go out of their way to avoid stating that they all represent functional RNAs.

This is why they say in the last section of the paper that their goal is to identify potential functional DNA elements.

ENCODE element annotations aim to delineate specific segments of the human and mouse genomes that encode a potential biological function. We aim to predict the activities of ENCODE sequence elements within a given biological context or of the different combinations of elements that become active in different biological contexts.

The idea here is that the so-called "biochemical" elements are not necessarily functional.

Despite the very large number of biochemically defined elements within the ENCODE Encyclopedia, their functional annotation is currently limited to a few broad categories (enhancer, promoter, and insulator).

This brings us back to the quotation that was inserted in the Wikipedia article—a quotation that seems to imply that ENCODE still thinks that biochemical elements are real examples of function. Here's the full context of that quote with the part referenced in the Wikipedia article underlined (by me)—I've put a key word in bold face.

The human genome comprises a vast repository of DNA-encoded instructions that are read, interpreted, and executed by the cellular protein and RNA machinery to enable the diverse functions of living cells and tissues. The ENCODE Project aims to delineate precisely and comprehensively the segments of the human and mouse genomes that encode functional elements. Operationally, functional elements are defined as discrete, linearly ordered sequence features that specify molecular products (for example, protein-coding genes or noncoding RNAs) or biochemical activities with mechanistic roles in gene or genome regulation (for example, transcriptional promoters or enhancers). Commencing with the ENCODE Pilot Project in 2003 (which focused on a defined 1% of the human genome sequence) and scaling to the entire genome in a production phase II that began in 2007, ENCODE has applied a succession of state-of-the-art assays to identify likely functional elements with increasing precision across an expanding range of cellular and biological contexts.

I admit that the ENCODE researchers are being (deliberately) obtuse but, to me, the meaning is fairly clear if you read the entire paper. They do not claim that all of their biochemical elements are functional—they are only likely to be functional. More data is needed to prove function. In the paragraph above, they attempted to define what a real functional sequence would look like by describing an "operational" function as one that has real ("mechanistic") roles. That's NOT a definition of biochemical function as implied in the Wikipedia edit.1

I believe that it's no longer accurate to say that ENCODE still claims that 80% of our genome is functional. I think they have stopped making that claim in their scientific papers and have reverted to the claim that they are identifying "candidate" or "likely" functional elements as Kellis et al. said in 2014. As of 2022, they make no overt claim in the scientific literature about the amount of the genome that's functional.

It's time to stop repeating that 80% claim because even the ENCODE researchers know that it was wrong (and stupd) to say that in a scientific publication.

Of course, we all know that ENCODE is being disingenuous in two ways.

  1. They never, ever, discuss the possibility that their candidate regulatory elements or their transcripts could be spurious non-functional elements and they never, ever, mention the possibility that most of our genome is junk. They want to leave you with the impression that they just haven't nailed down the exact biological function of their candidates.
  2. In public, many of the ENCODE leaders continue to speak out against junk DNA [Manolis Kellis dismisses junk DNA]. The lesson they've learned is not to do that in print in the scientific literature.

1. It's also an incredibly stupid definition, but that's not the point.

Kellis, M. et al. (2014) Defining functional DNA elements in the human genome. Proc. Natl. Acad. Sci. (USA) April 24, 2014 published online [doi: 10.1073/pnas.1318948111 ]

1 comment:

  1. I am not surprised that they walk their definitions back silently since for Mouse ENCODE they never bothered to concede that their claim that tissue specific expression clusters more by species rather than by tissue although Gilad and Mizrahi-Man had shown that published conclusions were based on an error in the algorithm they used. For zoologists the Mouse ENCODE publication was quite counterintuitive already back then given the similarities of the genes, proteins, tissues, organs and the development of mouse and men.

    ReplyDelete