Wednesday, December 14, 2016

The ENCODE publicity campaign of 2007

ENCODE1 published the results of a pilot project in 2007 (Birney et al., 2007). They looked at 1% (30Mb) of the genome with a view to establishing their techniques and dealing with large amounts of data from many different groups. The goal was to "provide a more biologically informative representation of the human genome by using high-throughput methods to identify and catalogue the functional elements encoded."

The most striking result of this preliminary study was the confirmation of pervasive transcription. Here's what the ENCODE Consortium leaders said in the abstract,
Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap with one another.
ENCODE concluded that 93% of the genome is transcribed in one tissue or another. There are two possible explanations that account for pervasive transcription.
  1. The genome is chock-full of genes. (A gene is a DNA sequence that's transcribed.) Known genes (~25,000) account for only 30% of the genome. The result suggests there are many thousands of genes yet to be discovered.
  2. A good percentage of these transcripts are accidental transcripts due to mistakes in initiating transcription at spurious promoters. They do not represent genes, they are junk RNA.
The main 2007 ENCODE paper in Nature sometimes recognizes these two possibilities and attempts to distinguish between them, but you have to read the paper very carefully to see any mention of #2.

In the main body of the paper, the authors note that most of the transcripts do not encode proteins, ruling out the presence of new protein-coding genes. They note that many of the transcripts come from DNA sequences within known genes and/or from the opposite strand. They confirm the existence of alternative splicing since the 487 known genes produce an average of 5.4 different transcript per locus. This leads them to suggest (p. 802),
Instead of the traditional view that many genes have one or more alternative transcripts that code for alternative proteins, our data suggests that a given gene may both encode multiple protein products and produce other transcripts that include sequences from both strands and from neighbouring loci (often without encoding a different protein).
It's clear that the ENCODE authors aren't seriously considering explanation #2. They are focused on the idea that every single transcript is biologically relevant.

However, later on in the paper (p. 804), when the authors are considering intergenic transcripts, they do raise the question of relevance.
The biological relevance of these unannotated transcripts remains unanswered by these studies. Evolutionary information (detailed below) is mixed in this regard; for example, it indicates that unannotated transcripts show weaker evolutionary conservation than many other annotated features. As with other ENCODE detected elements, it is difficult to identify clear biological roles for the majority of these transcripts; such experiments are challenging to perform on such a large scale and, furthermore, it seems likely that many of the corresponding biochemical events may be evolutionarily neutral.
Recall that the main proxy for function is sequence conservation. In this case, the extra transcripts are not conserved. Furthermore, there's no evidence of biological function because it's too difficult to do the proper experiments. The obvious conclusion is that the transcripts probably don't have a biological function.

But that's clearly not the conclusion the authors favor or they would have stated it up front.

ENCODE looked at orthologous sequences in 14 other vertebrates and determined that only 4.9% of the sequences in their target regions were conserved. The conserved regions correspond to known functional elements such as coding regions and regulatory regions. The correlation between known function and conservation is striking.

But this is a problem. It leads to a discussion under the heading "Unconstrained experimentally identified functional elements." The "problem" is that ENCODE assumes mere existence is equivalent to function. It's the same problem they will have five years later (2012) when they publish their analysis of the rest of the genome. The ENCODE workers just don't seem to recognize that experimentally detectable elements (e.g. lots of transcripts) may not have a function.

How do they explain the fact that their "functional" elements fail to pass the most definitive test of biological function? They come up with five excuses biological reasons.
  1. The nonconserved transcript may connect one or more conserved bits that are the core of the true functional element.
  2. The function may not require specific sequences; for example, transcription may be important to maintain an open chromatin conformation but the actual sequence of the transcript is irrelevant.
  3. There's a pool of neutral elements that evolve rapidly.
  4. Some of the elements might evolve a biological role and come under selection.
  5. A neutral element could replace an existing element.
The last three hypotheses should be grouped together. They represent a single teleological argument for future "function" in present-day nonfunctional elements. The idea is that this pool of neutral elements has a function: to prepare for future evolution.

The ENCODE researchers are very proud of their excuse hypotheses. They state (p. 812),
Our data support these hypotheses, and we have generalized this idea over many different functional elements. The presence of conserved function encoded by conserved orthologous bases is a commonplace assumption in comparative genomics; our findings indicate that there could be a sizable set of functionally conserved but non-orthologous elements in the human genome, and these seem unconstrained across mammals.
This confusing statement is remarkable for two reasons. First, the data actually supports the hypothesis that the nonconserved elements are junk—they do not have a biologically relevant function. That's the conclusion that most of us reached when we read the paper. Second, ENCODE is attacking the very idea that sequence conservation is a good proxy for function but they have no evidence to support such a claim. All they've done is to come up with some rather silly hypotheses ideas to avoid reaching the obvious conclusion.

How did everyone react to this 2007 paper? As you might imagine, the Nature News article that accompanied the paper focused on the silly ideas and not on the obvious conclusion: Genome project turns up evolutionary surprises: Findings reveal how DNA is conserved across animals.
... the idea that important DNA might also be unstable is newer, and intriguing, because it undermines the assumption that biological function requires evolutionary constraint.

"We're generalizing this principle over mammals, and over many functional elements," says Ewan Birney, head of genome annotation at the European Bioinformatics Institute in Cambridge, UK, and a leader of ENCODE. "We're coming out quite strongly that this is not merely a curiosity of our genome—it's a really important part of the way our genome works."
What Nature is doing here is to reflect the firm belief of ENCODE worker that they have discovered pervasive biological function in spite of the fact that only 5% of the genome is conserved. What Nature is not doing is to mention the alternative explanation that most of what ENCODE is reporting is junk.

The News & Views article was written by John Greally: Genomics: Encyclopaedia of humble DNA. He says,
Researchers of the ENCODE consortium have analysed 1% of the human genome. Their findings bring us a step closer to understanding the role of the vast amount of obscure DNA that does not function as genes.

We usually think of the functional sequences in the genome solely in terms of genes, the sequences transcribed to messenger RNA to generate proteins. This perception is really the result of effective publicity by the genes, who take all of the credit even though their function is basically limited to communicating genomic information to the outside world. They have even managed to have the entire DNA sequence referred to as the 'genome', as if the collective importance of genes is all you need to know about the DNA in a cell.

We should have guessed that this was merely prima-donna behaviour on the part of narcissist genes when the sequencing of the human genome revealed that they comprise only a small percentage of the DNA. And our confidence should have been shaken when some sequences located far from any genes were found to be strikingly conserved, indicating that they have some important function. Now, on page 799 of this issue, the ENCODE Project Consortium shows through the analysis of 1% of the human genome that the humble, unpretentious non-gene sequences have essential regulatory roles.
You'll have to excuse John Greally for not understanding what a gene is. He seems to think that the only things that count as a gene are protein-coding exons.

If ENCODE is right, then genes make up more than 90% of our genome and there's no such thing as "a vast amount of obscure DNA that does not function as genes."

The take-home message from the News & Views article was that ENCODE has discovered a lot of biological function in our genome.

So, Nature published a sloppy paper that ignored the obvious conclusion from their data. They published an accompanying News article that promoted function in the face of facts. They published a News & Views article promoting the idea that most of our genome is functional but not genes.

Naturally, the rest of the world picked up on Nature's misrepresentation of the data. Nature has kindly made a list of headlines from around the world in June 2007 [search Google for "ENCODE Nature Publicaiton" (sic)]. You can see from the list below that the dominant theme was the idea that most of the genome is functional and not junk.

Nature learned its lesson in 2007. Five year later they launched an even bigger campaign to promote ENCODE and the idea that most of our genome is functional. Today, four years after that, the journal has still not recognized and admitted the extent of their misinformation campaign.

BBC News “Genome Further Unraveled”
Financial Times “Research Reveals Complexity in How Human Genes Interact
The Guardian “Study Shines New Light on Genome”
The Times “DNA Analysis Provides New Insight into the Roots of Our Illnesses”
The Glasgow Herald “Genome ‘Junk’ May Be Key To How We Work”
Business Weekly “’Parts List’ Could Reshape Genome Understanding”
New Scientist “’Junk’ DNA Make Compulsive Reading”
Nature “Genome Project Turn Up Evolutionary Surprises”
The Economist “Really New Advances”
ABC News “Landmark Genome Study Shows Complexity of Human ‘Code’”
Bloomberg “’Junk’ Isn’t Junk”
Boston Globe “DNA Study Challenges Basic Idea of Genetics”
Boston Globe “Science: Miracles and Mysteries”
CBS News “DNA Decoding Landmark”
PBS Newshour “’Landmark’ Study Changes Long-Held DNA Beliefs”
Reuters “Human Instruction Book Not So Simple;Studies”
Washington Post “Human Genome Yields Up More Secrets”
Washington Post “Intricate Toiling Found in Nooks of DNA…”
Washington Post Graphic from article above”
WebMD “Genetics Revolution Arrives”
Scientific American “The 1 Percent Genome Solution”
The Scientist “First Pages of Regulation ‘Encyclopedia’”
Science “DNA Study Forces Rethink of What It Means to Be a Gene”
Wired “Your Genome is Really, Really, REALLY Complicated”
National Public Radio “Reading between the Genomes”
Ars Technica “ENCODE Finds the Human Genome to Be an Active Place”
Chemical and Eng. News “Finding Function in the Genome”
GenomeWeb “Human Genome Not So Tidy After All, ENCODE Project Suggests”
Toronto Star “DNA ‘Junk’ Appears to Have Uses”
Agence France Presse “Landmark Study Prompts Rethink of Genetic Code”
La Republlica “Svolta Nello Studio…”
El Mundo “Un Nuevo ‘Manuel de Instrucciones’ del genoma…”
Foha de Sao Paulo “A Biologie Acaba…”
Frankfurter Neue Presse “Grammatik der Gene Viel Komplexer als Gedacht“
Belgium Cordis News “New Research Challenges Understanding of Human Genome”

1. Encyclopedia of DNA Elements.

Birney, E., Stamatoyannopoulos, J. A., Dutta, A., Guigó, R., Gingeras, T. R., Margulies, E. H., Weng, Z., Snyder, M., Dermitzakis, E. T., et al. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447:799-816. [doi: 10.1038/nature05874]


  1. It all comes down to the difference between "does something" and "functional". ENCODE doesn't know the difference between the two, or doesn't want others to know how they are different from one another. Larry and others do know the difference and aren't afraid to point it out.

  2. You say that they looked at 14 other vertebrates. I suppose that a salamander is among these. Do you know how much junk DNA they found in salamanders?

    1. Salamander genomes are huge

      Still beyond the reach of what can be sequenced and assembled well, although we're getting there

    2. Really Georgi?

      I'm not going to ask you why you don't have the answers right now.

      Aren't you puzzled by the notions "how random processes have fulfilled their responsibilities and kept the best in the nest and keep doing it?"

      You are not that stupid to believe this shit. Why are random processes smarter than you Georgie?

      Weather you come to the right conclusion or not (if you decide to analyze why you believe it) I'm more than fine with that.

      All the best!

  3. I know it would be tedious to do, but the list of newspaper reports at the end really should have each of those titles be an active link to the actual publication.

    1. You are correct. It would be tedious to search the archives for each title. But all you have to do is follow my advice and find the list published by Nature. Every article is there.