Sunday, September 09, 2012

Brendan Maher Writes About the ENCODE/Junk DNA Publicity Fiasco

Brendan Maher is a Feature Editor for Nature. He wrote a lengthy article for Nature when the ENCODE data was published on Sept. 5, 2012 [ENCODE: The human encyclopaedia]. Here's part of what he said,
After an initial pilot phase, ENCODE scientists started applying their methods to the entire genome in 2007. Now that phase has come to a close, signalled by the publication of 30 papers, in Nature, Genome Research and Genome Biology. The consortium has assigned some sort of function to roughly 80% of the genome, including more than 70,000 ‘promoter’ regions — the sites, just upstream of genes, where proteins bind to control gene expression — and nearly 400,000 ‘enhancer’ regions that regulate expression of distant genes.
I expect encyclopedias to be much more accurate than this.

As most people know by now, there are many of us who challenge the implication that 80% of the genome has a function (i.e it's not junk).1 We think the Consortium was not being very scientific by publicizing such a ridiculous claim.

The main point of Maher's article was that the ENCODE results reveal a huge network of regulatory elements controlling expression of the known genes. This is the same point made by the ENCODE researchers themselves. Here's how Brendan Maher expressed it.

The real fun starts when the various data sets are layered together. Experiments looking at histone modifications, for example, reveal patterns that correspond with the borders of the DNaseI-sensitive sites. Then researchers can add data showing exactly which transcription factors bind where, and when. The vast desert regions have now been populated with hundreds of thousands of features that contribute to gene regulation. And every cell type uses different combinations and permutations of these features to generate its unique biology. This richness helps to explain how relatively few protein-coding genes can provide the biological complexity necessary to grow and run a human being.
I think that much of this hype comes from a problem I've called The Deflated Ego Problem. It arises because many scientists were disappointed to discover that humans have about the same number of genes as many other species yet we are "obviously" much more complex than a mouse or a pine tree. There are many ways of solving this "problem." One of them is to postulate that humans have a much more sophisticated network of control elements in our genome. Of course, this ignores the fact that the genomes of mice and trees are not smaller than ours.

Brendan Maher became aware of the controversy in the hours following publication of the ENCODE results. He published a follow-up article the next day [Fighting about ENCODE and junk]. This is the article that Ed Yong and others have pointed to as an example of responsible journalism. In fact, Ed Yong refers to it as ...
The ENCODE reactions have come thick and fast, and Brendan Maher has written the best summary of them. I’m not going to duplicate his sterling efforts. Head over to Nature’s blog for more.
Let's look at this "sterling effort."
... several critics have challenged some of the most prominently reported claims in the papers, the way their publication was handled and the indelicate use of the word ‘junk’ on some material promoting the research.

First up was a scientific critique that the authors had engaged in hyperbole. In the main ENCODE summary paper, published in Nature, the authors prominently claim that the ENCODE project has thus far assigned “biochemical functions for 80% of the genome”. I had long and thorough discussions with Ewan Birney about this figure and what it actually meant, and it was clear that he was conflicted about reporting it in the paper’s abstract.
We understand. After long and thorough discussions Brendan Maher decided to report the misleading figure exactly as Ewan Birney intended without highlighting any of the problems.
It’s a big number, to be sure. The protein-encoding portion of the genome — that which has historically been considered the most important part— represents a little more than 1%, and to imply that they found similarly important and interesting functions for another 79% is an extraordinary claim. Birney had said to me and reiterates in a Q&A-style blog post that it is also a loose interpretation of the word ‘functional’ that encompassed many categories of biochemical activity, from the very broad — such as actively producing or ‘transcribing’ RNA — to being attached to some sort of transcription-factor protein, all the way down to that narrow range of protein-encoding DNA within the 1%.
No knowledgeable scientist ever said that only 1% of our genome was functional. It's extremely annoying that journalists keep repeating stuff like this as though none of us ever knew about all the other functional parts of the genome that had been solidly proven decades before anyone ever dreamed of ENCODE. This is part of the problem.

But what "defense" is Brendan Maher actually mounting here? All he's saying is that Ewan Birney invented a ridiculous definition of function and that many journalists fell for it.
But hold on, said a number of genome experts: most of that activity isn’t particularly specific or interesting and may not have an impact on what makes a human a human (or what makes one human different from another). A blog post by Ed Yong discusses some of these critiques. It was already known, for example, that vast portions of the genome are transcribed into RNA. A small amount of that RNA encodes protein, and some serves a regulatory role, but the rest of it is chock-full of seemingly nonsensical repeats, remnants of past viruses and other weird little bits that shouldn’t serve a purpose.
Exactly. Scientists knew that, and much more. Why didn't science journalists also know that?
The paper does drill down somewhat into what the authors mean by functional elements. And Birney does the same in his blog. Excluding all but the sites where there is very probable active binding by a regulatory protein, “we see a cumulative occupation of 8% of the genome,” he writes. Add to that the 1% of protein-encoding DNA and you get 9%.
Genes make up about 20% of our genome (exons plus introns). There are about 25,000 known genes. Birney is saying that the ENCODE project identified 256,000,000 bp (8%) of sequence that's required for regulating gene expression. That's roughly 10,000 bp of sequence for every gene. Since the typical transcription factor binding site is 6-8 bp, this means at least 1000 transcription factor binding sites are controlling each gene.

I'd like to know of any gene where this kind of complex regulation has been demonstrated. Does it apply to the thousand of genes encoding basic metabolic enzymes like those of the citric acid cycle? Does it apply to all of the genes for ribosomal proteins or all of the known tRNA genes?

This doesn't make sense but I excuse Brendan Maher and other journalists in this case since you have to know a lot about gene regulation to see the absurdity in the ENCODE predictions.
Birney and his colleagues have estimated how complete their sampling is, and suspect that they will find another 11% of the genome with this kind of regulatory activity. That gets them to 20%. So, perhaps the main conclusion should have been that 20% of the genome in some situation can directly influence gene expression and phenotype of at least one human cell type. It’s a far cry from 80%, but a substantial increase from 1%.
If you thought 8% was ridiculous then 19% is even worse. Feel sorry for the poor pufferfish whose genome is only 12% as large as the human genome. Think of all the complex regulation that pufferfish just can't do.

I'm not saying that my estimates are definitive proof that the ENCODE conclusions are wrong and I'm not saying that the size of the pufferfish genome disproves the estimation Birney is making. What I'm saying is that results and conclusions have to be viewed skeptically and put into bigger context before believing that they are true. That's the job science journalists have undertaken, otherwise they are just the mouthpiece of the authors.
Some suggest that a majority of the genome does have an active role in biological functions. John Mattick, director of the Garvan Institute of Medical Research in Sydney, Australia, who I spoke to in the run up to the publication of these papers, argued that the ENCODE authors were being far too conservative in their claims about the significance of all that transcription. “We have misunderstood the nature of genetic programming for the past 50 years,” he told me. Having long argued that non-coding RNA has a crucial role in cell regulatory functions, his gentle criticism is that “they’ve reported the elephant in the room then chosen to otherwise ignore it”.
Good reporting. Yes, there are some other scientists who think that all of the human genome is functional. That's why this is a genuine scientific controversy. (I think Mattick is dead wrong [Genome Size, Complexity, and the C-Value Paradox].)
The 80% number may not have been ideal, but it did provide a headline figure that was impressive to the mainstream media. This is at the core of a related critique against the ENCODE researchers and the journals that published their papers. By bandying about this big number, press releases on the project touted the idea that ENCODE had demolished some long-standing notion that much of the genome is ‘junk’. Michael Eisen, an evolutionary biologist at the University of California, Berkeley, said in a blog post that this pushed “a narrative about their results that is, at best, misleading.”

That narrative goes something like this: scientists long thought the genome was littered with junk, evolutionary remnants that serve no purpose, but ENCODE has shown that 80% of the genome (and possibly more to come) does serve a purpose. That narrative appeared in many media reports on the publication. Many on Twitter and in online conversations bemoaned the rehashing of a junk-DNA debate that they considered imaginary or at least long-settled. Eisen, perhaps rightfully, puts the blame on press releases that touted the supposed paradigm shift: the one from Nature Publishing Group started thus: “Far from being junk, the vast majority of our DNA participates in at least one biochemical event in at least one cell type.” Eisen says that “the authors undoubtedly know, nobody actually thinks that non-coding DNA is ‘junk’ anymore. It’s an idea that pretty much only appears in the popular press, and then only when someone announces that they have debunked it.”

It is an old argument, but it’s not clear that it is a dead argument. Several researchers took issue with ENCODE’s suggestion that its wobbly 80% number in any way disproves that some DNA is junk. Larry Moran, a biochemist at the University of Toronto in Ontario argued on his blog that claims about disproving the existence of junk gives ammunition to creationists who like a tidy view of every letter in the genome having some sort of divine purpose. “This is going to make my life very complicated,” he writes.

Indeed, the papers have caught the attention of at least some creationists, and of just about everyone else. This was in part designed by the project leaders and editors, who organized a simultaneous release of the publications to maximize their impact. This was a major, time-consuming event that occupied a great deal of time from the scientists involved and from the editors at their respective journals.
So, what's up Brendan Maher? Are you saying that publication of an admittedly misleading number (80%) was acceptable because "it did provide a headline figure that was impressive to the mainstream media." Or, are you going to admit that you made a mistake?

(And, for the record, I did not mean that we should pay any attention at all to what the creationists think. When I said that "This is going to make my life very complicated" I was thinking more of how I was going to explain this to scientists and people interested in science. The damage done by this publicity campaign is that it misleads the general public, not just creationists.
ENCODE was conceived of and practised as a resource-building exercise. In general, such projects have a huge potential impact on the scientific community, but they don’t get much attention in the media. The journal editors and authors at ENCODE collaborated over many months to make the biggest splash possible and capture the attention of not only the research community but also of the public at large. Similar efforts went into the coordinated publication of the first drafts of the human genome, another resource-building project, more than a decade ago. Although complaints and quibbles will probably linger for some time, the real test is whether scientists will use the data and prove ENCODE’s worth.
I'm sorry but if good journalists like Ed Yong think this is a "sterling effort" at defending the ENCODE Consortium against scientific criticism then we're in much more trouble than I originally thought.


1. Yes, I know that what the consortium actually said was that 80% has a "biochemical function" and that kind of function may, or may not, indicate a biologically relevant function. The distinction is not appreciated by the average reader and, quite frankly, not by the average science writer either.

9 comments :

  1. In anther post Maher writes that: " Nature Publishing Group started thus: “Far from being junk, the vast majority of our DNA participates in at least one biochemical event in at least one cell type.”
    Since when do "biochemical events" count as function?? I can correct them. EVERY nucleotide in every cell type participates in a biochemical event in that every one of them is replicated by DNA polymerase

    RW

    ReplyDelete
  2. Larry,
    You are right on. Keep up the good work. From reading your earlier postings. I have collected this:

    Junk DNA from LM 5 10 11
    Total Essential/Functional (so far) = 8.7%
    Total Junk (so far) = 65%
    Unknown (probably mostly junk) = 26.3

    This needs to be emphasized. This will only change when function has been proven.
    Denis Castaing

    ReplyDelete
  3. Genes make up about 20% of our genome (exons plus introns). There are about 25,000 known genes. Birney is saying that the ENCODE project identified 256,000,000 bp (8%) of sequence that's required for regulating gene expression. That's roughly 10,000 bp of sequence for every gene. Since the typical transcription factor binding site is 6-8 bp, this means at least 1000 transcription factor binding sites are controlling each gene.

    That's not what is claimed. Those base pairs are derived from ChIP-seq and DNAse-seq assays. When you do those you get a broader region of enrichment than what the actual binding site because that's what the resolution of the assay is.

    You can actually get the binding sites themselves by doing digital footprinting which ENCODE has done, in these two quite nice papers:

    http://www.cell.com/abstract/S0092-8674(12)00639-3
    http://www.nature.com/nature/journal/v489/n7414/fig_tab/nature11212_ft.html

    and by doing ChIP-exo-seq which was developed after ENCODE was already towards the end of its cycle but a lot more of it will be done in the future.

    ReplyDelete
    Replies
    1. That's not what is claimed.

      When you claim that a particular region of DNA has a function you are talking about the actual binding site. The DNA immediately flanking the binding site has no function other than just happening to be next to a binding site.

      Is this yet another strange definition of "function"? Are there any more surprises?

      How much of that 8% could be deleted and/or mutated without any effect on the individual or the species?

      Delete
    2. I don't have time to find a better illustration right now, but this one should do the job too:

      http://www.nature.com/nmeth/journal/v5/n9/images/nmeth.1246-F1.jpg

      Because you're size-selecting at ~200bp, the peak is always much much bigger than the 8bp of the binding site. So when you do peak calling you get something much bigger as that's what the resolution of the assay is; you have to bring in a lot of orthogonal evidence to say where the actual binding site is. You can have other binding sites very close by, etc, it's not at all that straightforward to get the actual binding site and count just that,

      How much of that 8% could be deleted and/or mutated without any effect on the individual or the species?

      This is explicitly discussed in the Neph et al. Nature paper:

      http://www.nature.com/nature/journal/v489/n7414/fig_tab/nature11212_F3.html

      There is a strong anti-correlation between DNAse cleavage and conservation, even within the binding site itself (those base pairs that apparently don't participate in the protein-DNA interaction).

      Delete
  4. Larry: "1. Yes, I know that what the consortium actually said was that 80% has a "biochemical function" and that kind of function may, or may not, indicate a biologically relevant function. The distinction is not appreciated by the average reader and, quite frankly, not by the average science writer either."

    As an average reader, I can testify that I do not understand the difference between "biochemical function" and "biologically relevant function."

    ReplyDelete
    Replies
    1. "Biologically relevant function" is what the layman would call "function." It helps you grow or stay alive.

      "Biochemical function" means it interacts with any other molecule, or it gets transcribed into RNA. But the interaction with another molecule might be accidental. The transcription into RNA might be at a very low level, like one RNA molecule per cell, so there might be diverse examples of tiny amounts of junk RNA that could be basically wasted.

      Delete
  5. It arises because many scientists were disappointed to discover that humans have about the same number of genes as many other species yet we are "obviously" much more complex than a mouse or a pine tree.

    I'm wondering why they would have been upset about this.

    Of course people are different than mice and pine trees, and mice are different from pine trees, more complex is probably a deceptive way of looking at it. I don't see any mice or pine trees blogging about whether or not DNA is largely junky or not. If that's a more complex behavior than mouse or pine tree behavior, perhaps it isn't based in our physical structure.

    How do you like them apples?

    Really, scientists had their feelings hurt because other organisms have more genes than we do?

    ReplyDelete
  6. Really, scientists had their feelings hurt because other organisms have more genes than we do?

    Given how pervasive the myth of humans being the top of creation is, that doesn't surprise me in the least.

    ReplyDelete