Sunday, September 09, 2012

Ed Yong Updates His Post on the ENCODE Papers

For decades we've known that less than 2% of the human genome consists of exons and that protein encoding genes represent more than 20% of the genome. (Introns account for the difference between exons and genes.) [What's in Your Genome?]. There are about 20,500 protein-encoding genes in our genome and about 4,000 genes that encode functional RNAs for a total of about 25,000 genes [Humans Have Only 20,500 Protein-Encoding Genes]. That's a little less than the number predicted by knowledgeable scientists over four decades ago [False History and the Number of Genes]. The definition of "gene" is somewhat open-ended but, at the very least, a gene has to have a function [Must a Gene Have a Function?].

We've known about all kinds of noncoding DNA that's functional, including origins of replication, centromeres, genes for functional RNAs, telomeres, and regulatory DNA. Together these functional parts of the genome make up almost 10% of the total. (Most of the DNA giving rise to introns is junk in the sense that it is not serving any function.) The idea that all noncoding DNA is junk is a myth propagated by scientists (and journalists) who don't know their history.

We've known about the genetic load argument since 1968 and we've known about the C-Value "Paradox" and it's consequences since the early 1970's. We've known about pseudogenes and we've known that almost 50% of our genome is littered with dead transposons and bits of transposons. We've known that about 3% of our genome consists of highly repetitive DNA that is not transcribed or expressed in any way. Most of this DNA is functional and a lot of it is not included in the sequenced human genome [How Much of Our Genome Is Sequenced?]. All of this evidence indicates that most of our genome is junk. This conclusion is consistent with what we know about evolution and it's consistent with what we know about genome sizes and the C-Value "Paradox." It also helps us understand why there's no correlation between genome size and complexity.

Many science writers published articles on the new ENCODE papers when the embargo was raised last week. One of them was Ed Yong who blogs at Not Exactly Rocket Science. Ed Yong is one of the best science journalists in the world. He was taking on a very difficult job and, in my opinion, he didn't get it right. In light of the ENCODE/junk DNA fiasco that erupted following those publications, Ed has updated his blog post: ENCODE: the rough guide to the human genome.

Here's the slightly modified version of the text in the main body of the article. This was the part that upset me because it seemed to ignore all the evidence for junk DNA.
For years, we’ve known that only 1.5 percent of the genome actually contains instructions for making proteins, the molecular workhorses of our cells. But ENCODE has shown that the rest of the genome – the non-coding majority – is still rife with “functional elements”. That is, it’s doing something.

It contains docking sites where proteins can stick and switch genes on or off. Or it is read and ‘transcribed’ into molecules of RNA. Or it controls whether nearby genes are transcribed (promoters; more than 70,000 of these). Or it influences the activity of other genes, sometimes across great distances (enhancers; more than 400,000 of these). Or it affects how DNA is folded and packaged. Something.

According to ENCODE’s analysis, 80 percent of the genome has a “biochemical function”. More on exactly what this means later, but the key point is: It’s not “junk”. Scientists have long recognised that some non-coding DNA has a function, and more and more solid examples have come to light [edited for clarity - Ed]. But, many maintained that much of these sequences were, indeed, junk. ENCODE says otherwise. “Almost every nucleotide is associated with a function of some sort or another, and we now know where they are, what binds to them, what their associations are, and more,” says Tom Gingeras, one of the study’s many senior scientists.

And what’s in the remaining 20 percent? Possibly not junk either, according to Ewan Birney, the project’s Lead Analysis Coordinator and self-described “cat-herder-in-chief”. He explains that ENCODE only (!) looked at 147 types of cells, and the human body has a few thousand. A given part of the genome might control a gene in one cell type, but not others. If every cell is included, functions may emerge for the phantom proportion. “It’s likely that 80 percent will go to 100 percent,” says Birney. “We don’t really have any large chunks of redundant DNA. This metaphor of junk isn’t that useful.”
Ed Yong has now added several updates to his post in order to point to critics of the claims made by the leadership of the ENCODE Consortium. For example,
[Update 07/09 23:00] Birney was right about the scepticism. Gregory says, “80 percent is the figure only if your definition is so loose as to be all but meaningless.” Larry Moran from the University of Toronto adds, “Functional” simply means a little bit of DNA that’s been identified in an assay of some sort or another. That’s a remarkably silly definition of function and if you’re using it to discount junk DNA it’s downright disingenuous.”

This is the main criticism of ENCODE thus far, repeated across many blogs and touched on in the opening section of this post. There are other concerns. For example, White notes that many DNA-binding proteins recognise short sequences that crop up all over the genome just by chance. The upshot is that you’d expect many of the elements that ENCODE identified if you just wrote out a random string of As, Gs, Cs, and Ts. “I’ve spent the summer testing a lot of random DNA,” he tweeted. “It’s not hard to make it do something biochemically interesting.”

Gregory asks why, if ENCODE is right and our genome is full of functional elements, does an onion have around five times as much non-coding DNA as we do? Or why pufferfishes can get by with just a tenth as much? Birney says the onion test is silly. While many genomes have a tight grip upon their repetitive jumping DNA, many plants seem to have relaxed that control. Consequently, their genomes have bloated in size (bolstered by the occasional mass doubling). “It’s almost as if the genome throws in the towel and goes: Oh sod it, just replicate everywhere.” Conversely, the pufferfish has maintained an incredibly tight rein upon its jumping sequences. “Its genome management is pretty much perfect,” says Birney. Hence: the smaller genome.

But Gregory thinks that these answers are a dodge. “I would still like Birney to answer the question. How is it that humans “need” 100% of their non-coding DNA, but a pufferfish does fine with 1/10 as much [and] a salamander has at least 4 times as much?” [I think Birney is writing a post on this, so expect more updates as they happen, and this post to balloon to onion proportions].
These criticisms only hint at the much larger problem; namely, the fact that Birney (and most science journalists) ignored years of evidence supporting junk DNA. They also ignored many papers in the scientific literature than challenge the conclusions of the ENCODE pilot project in 2007 and challenge many other papers claiming that transcription and DNA binding were evidence of function.

There's nothing fundamentally new new in the ENCODE results that we didn't know before. What's new is the spin that flies in the face of evidence.

The most interesting update comment is ...
Update (07/09/12 11:00): The ENCODE reactions have come thick and fast, and Brendan Maher has written the best summary of them. I’m not going to duplicate his sterling efforts. Head over to Nature’s blog for more.
We're going to take a look at that paper in my next post.


[Image Credit: The human karyotype is from the Ensembl website.]

16 comments:

  1. Larry, shouldn't "functional" in the last sentence of this quote be "nonfunctional"?

    "We've known about pseudogenes and we've known that almost 50% of our genome is littered with dead transposons and bits of transposons. We've known that about 3% of our genome consists of highly repetitive DNA that is not transcribed or expressed in any way. Most of this DNA is functional and a lot of it is not included in the sequenced human genome "

    ReplyDelete
    Replies
    1. No, most of it is required for centromere function.

      Delete
  2. The "plants have relaxed control"-argument seems untenable under the observation that even within different species of Onions the genome sizes vary by a factor of 5. Why do SOME species of Onion have five times as much non-coding as OTHER species of Onion?

    ReplyDelete
  3. Shouldn't "nonfunctional" be more accurately replaced with "not known to have a function as of now"? How do you know what appears "nonfunctional" today will remain so?

    Why do SOME species of Onion have five times as much non-coding as OTHER species of Onion?

    Perhaps that is a question that can only be answered by thinking like an onion. As a human, who knows why? Isn't that a question that risks falling into concepts of teleology and design?

    I could not venture to propose to any other person so great an alteration of terms, but you I am sure will give it an impartial consideration, and if you really think the change will produce a better understanding of your work, will not hesitate to adopt it.

    It is evidently also necessary not to personify “nature” too much,—though I am very apt to do it myself,—since people will not understand that all such phrases are metaphors.

    Natural selection, is, when understood, so necessary & self evident a principle, that it is a pity it should be in any way obscured; & it therefore occurs to me, that the free use of “survival of the fittest”,—which is a compact & accurate definition of it,—would tend much to its being more widely accepted and prevent its being so much misrepresented & misunderstood.


    http://www.darwinproject.ac.uk/entry-5140#back-mark-5140.f5

    Seems to me that you can either have your assertion that the formal, scientific description of the very real phenomenon of evolution isn't teleological and directed or you can change the way you talk about it to prevent yourselves giving it in those terms. As Wallace warned Darwin in that letter:

    Combined with the enormous multiplying powers of all organisms, & the “struggle for existence” leading to the constant destruction of by far the largest proportion,—facts which no one of your opponents, as far as I am aware, has denied or misunderstood,—“the survival of the fittest” rather than of those who were less fit, could not possibly be denied or misunderstood. Neither would it be possible to say, that to ensure the “survival of the fittest” any intelligent chooser was necessary,—whereas when you say natural selection acts so as to choose those that are fittest it is misunderstood & apparently always will be. Referring to your book I find such expressions as “Man selects only for his own good; Nature only for that of the being which she tends”.f6 This it seems will always be misunderstood; but if you had said “Man selects only for his own good; Nature, by the inevitable “survival of the fittest”, only for that of the being she tends”,—it would have been less liable to be so.

    If you want to ask "why" in that way, be prepared for people to automatically think in terms of choices being made from among alternative possibilities.

    Not that "survival of the fittest" would work any better. I think it would be better to just say that questions of design and teleology are not answerable with science and leave it at that, since people, including materialists, constantly lapse into language that will strongly imply design and teleology. Or stop complaining when what was predicted in 1866 happens.

    ReplyDelete
  4. Birney says the onion test is silly. While many genomes have a tight grip upon their repetitive jumping DNA, many plants seem to have relaxed that control. Consequently, their genomes have bloated in size (bolstered by the occasional mass doubling). “It’s almost as if the genome throws in the towel and goes: Oh sod it, just replicate everywhere.” Conversely, the pufferfish has maintained an incredibly tight rein upon its jumping sequences. “Its genome management is pretty much perfect,” says Birney. Hence: the smaller genome.


    Maybe it's me but I don't see how that answers the criticism. So the pufferfish has been lucky enough to evolve a near-perfect genome management system, well, good for the pufferfish. That doesn't mean that every organism is going to have the same good fortune. Evolution doesn't work that way. Personally, I would expect it to be more difficult to develop tight genome regulation the more complex the organism. In other words, why should we expect the human genome to exhibit a similar degree of genome regulation to the pufferfish? Why shouldn't we expect to see more junk?

    ReplyDelete

  5. Trouble is that many of these "dead transposons" have been identified by the ENCODE consortium and other researchers as having a regulatory role, among other functions:

    http://www.nature.com/scitable/topicpage/transposons-or-jumping-genes-not-junk-dna-1211

    ReplyDelete
    Replies
    1. Quantify "many" out of the total known amount of transposons we don't know the function, and the ones we know don't have any.

      Delete
  6. I don't understand all the fuss about these papers. Some journalist, strangely, linked it as a proof against evolution. It actually shows, if they are correct, that evolution took place quite precisely by a mean of millions of years of precise modifications.
    I also don't understand all the war-like attitudes of some writers, such as David Ropeik [posted earlier], to explain as a blow to science! when science itself has shown ( whether true or not) that the "junk" theory is overoptimistic.
    Now, my third point that I don't understand is, again, the big fuss about these papers. We already knew that DNA is not only around 2% coding! we already knew that a huge chunk of that DNA is functional, by someway or another (mentioned earlier in this post). DNA should not be all functional because we think it should be ( which I completely agree with Dr.Moran). It also can't make sense that millions of bases will be docks for 20,000 proteins available (proteins mainly have cellular functions,unless there are a pattern regions that can be modified collectively). So is the whole debate is on the percentage of the "junk" or the definition of it ? Cells are highly efficient systems and from the first look, it's hard to assert that 98% is junk! but there are many many evidence-based claims to support it. And there is no contradiction, from an evolutionary point of view, to sweep away such junk. Many DNA regions may control the differentiation of cells, or i.e. the level of gene expression. Onions and fishes have very different evolutionary tree, and this may be associated with their genes (interbreeding, natural cloning, etc.)and doesn't necessarily explain complexity of a species or lack of it. Finally and most importantly, it never means that you need large numbers for complexity, it can easily arise from scarcity. you can build DNA with almost infinite outcomes using only four bases or moreover build a hole universe from four forces and a hydrogen atom with atomic number=1.

    ReplyDelete
    Replies
    1. Now, my third point that I don't understand is, again, the big fuss about these papers.

      The fuss is quite unfortunate. There is a huge amount of great work of high scientific value that ENCODE has done, and if all that the package is remembered for continues to be "ENCODE claimed 80% of the genome is functional and there is no junk DNA; how could they have done so, that's not true!!!", all that work will not have the impact it should have. All that is especially unfortunate given that I've been in countless ENCODE conference phone calls and numerous ENCODE meetings and I don't remember the words "junk DNA" be mentioned or discussed even once - I don't think that's what was on anyone's mind when this was done.

      So is the whole debate is on the percentage of the "junk" or the definition of it ?

      What ENCODE has done is assign biochemical activity to the genome - if, for example, there was a way to comprehensively assay H3K9me3 (mark of tightly packed repressed heterochromatin, most of which would be real "junk"; the reason there is no way to do that is that only less than 80% of the genome is actually directly visible to short read technologies as the rest is not uniquely mappable), we would probably call that as "assigned biochemical activity" (in this case being very repressed) and count that towards the total. That's not at all the same as claiming it has function in the classical sense of the word. Now, when you run the numbers for all the functional genomic assays that ENCODE has done, you indeed get large percentages (the biggest contributor being transcribed regions, especially in the nuclear subcellular fractions) but from the point of view of assigning biochemical activity to regions in the genome, that's a good thing - that was the goal of the exercise after all. I don't think anybody is seriously claiming all of those base pairs have function in the classic sense.

      Cells are highly efficient systems and from the first look, it's hard to assert that 98% is junk!

      Cells are only as efficient as they need to be in order to survive. Which means that in practice they get quite inefficient. Also, there is a difference between general cellular metabolic efficiency and efficiency of information storage in the genome. Finally, there are organisms on this planet with genomes 40 times larger than ours, some of them unicellular; one has to deal with that fact

      Delete
    2. Cells are only as efficient as they need to be in order to survive.

      This is exactly the kind of statement that gets you boys into trouble. It gets mighty close to being tautological.

      How do you know they are "only" that efficient? How do you know they might, at times, be more efficient than they need to be to survive? How do you know that cells that are less efficient than cells are wouldn't survive? Only, in that case, they wouldn't be in the same kind of organism, would they?

      Claiming to know more than you do isn't exactly becoming to the claims of science as it's purported to be.

      Delete
    3. All that is especially unfortunate given that I've been in countless ENCODE conference phone calls and numerous ENCODE meetings and I don't remember the words "junk DNA" be mentioned or discussed even once - I don't think that's what was on anyone's mind when this was done.

      That surprises me a bit since the goal of the entire ENCODE project was designed to find functions in the genome.

      Here's what they say on their website ...

      The National Human Genome Research Institute (NHGRI) launched a public research consortium named ENCODE, the Encyclopedia Of DNA Elements, in September 2003, to carry out a project to identify all functional elements in the human genome sequence.

      The experimental design of all of the assays was intended to identify functional regions of the genome, no?

      Are you telling me that nobody was talking about possible nonfunctional regions and how they would distinguish functional regions from junk DNA? That's quite shocking.

      Were any of the investigators aware of the controversy surrounding the publication of the pilot project? Didn't they take steps to address those issues?

      I don't think anybody is seriously claiming all of those base pairs have function in the classic sense.

      Given the controversy and confusion that accompanied the publication of the pilot project I would have been inclined to make this point very clearly and very forcibly in all the papers. I would have taken great pains to point out that just because some DNA binds a factor or is transcribed doesn't mean that it has a biological function. It could still be junk DNA.

      Do you think the authors did that?

      Delete
    4. Given the controversy and confusion that accompanied the publication of the pilot project I would have been inclined to make this point very clearly and very forcibly in all the papers. I would have taken great pains to point out that just because some DNA binds a factor or is transcribed doesn't mean that it has a biological function. It could still be junk DNA.

      Do you think the authors did that?


      Depends on how you read it - I myself have very hard time reading these things from the perspective of someone who sees them for the first time because I am not such a person; I can read something and know what it actually means, others may take it to mean something else. Obviously, you want as many people as possible to read it properly, and if that's not the case, the situation is suboptimal. I don't think anyone with extensive functional genomics experience would take the ENCODE results to mean what has been portrayed in the media, now for those who have no idea what a ChIP-seq peak looks like, it is different.

      Delete
    5. Here's what they say on their website ...

      Yes, functional elements, I don't see the words "junk DNA" there

      Are you telling me that nobody was talking about possible nonfunctional regions and how they would distinguish functional regions from junk DNA? That's quite shocking.

      Were any of the investigators aware of the controversy surrounding the publication of the pilot project? Didn't they take steps to address those issues?


      I don't know what the PIs have been discussing, I'm just a graduate student who has been involved in parts of this; there are many levels on which these things are happening. What I was saying is that to my knowledge (nobody has been in all the phone calls, I haven't either, I may have missed it if it has happened) there hasn't been a consortium-wide discussion of the "We know junk DNA is a 'controversial' topic because the creationists will be all over it, science education will suffer, Larry Moran will go ballistic on us, we should be very careful what we're saying" sort. People's goals are identifying enhancers, promoters, chromatin states, characterizing the transcriptome in depth, understanding the relation between transcription factor binding and expression, building regulatory networks, improving functional genomic assays, etc. That's more than enough to keep one fully occupied and there is not that much time for these more general topics. Again, I have to be very careful not to say something that does not correspond to reality - I have no way to know who has discussed what with whom and when if I have not been present, and I am not speaking for the consortium as a whole here; that's just based on what I have seen and been part of personally.

      Delete
    6. People's goals are identifying enhancers, promoters, chromatin states, characterizing the transcriptome in depth, understanding the relation between transcription factor binding and expression, building regulatory networks, improving functional genomic assays, etc. That's more than enough to keep one fully occupied and there is not that much time for these more general topics.

      What I don't understand is how they could be so interested in identifying enhancers and promotes without being aware of the fact that there are many false positives.

      I think it's rather sad that graduate students and PIs are so absorbed in the technology that they don't spend any time thinking about the significance of their result. We see this in my department as well in those groups that are doing massive screenings.

      You should advise your group that it's time to make time for these more general topics. That's where the real science is. The rest is stamp collecting.

      Is your group now discussing this debate or do they still not have time?

      Delete
    7. People are well aware of the false positives, my point was that those numbers everyone is fixating on are a very small part of what ENCODE has done and is doing, but are a huge source of misunderstanding, which is very unfortunate because the actual meaningful work gets lost as a result.

      Delete
  7. Thanks for acknowledging the update, Larry.

    ReplyDelete