More Recent Comments

Tuesday, October 16, 2018

John Mattick's latest attack on junk DNA

John Mattick is the most prominent defender of the idea that the human genome is full of functional sequences. In fact, he is just about the only scientist of any prominence who's on that side of the debate. His main "evidence" is the fact that genomes are pervasively transcribed and that most of the transcripts are functional. Let's look at his latest review paper to see how well this argument stands up to close scrutiny (Mattick, 2018).1

As you read this post, keep in mind that in 2012 John Mattick was awarded a prize by the Human Genome Organization for proving his hypothesis [John Mattick Wins Chen Award for Distinguished Academic Achievement in Human Genetic and Genomic Research].
The Award Reviewing Committee commented that Professor Mattick’s “work on long non-coding RNA has dramatically changed our concept of 95% of our genome”, and that he has been a “true visionary in his field; he has demonstrated an extraordinary degree of perseverance and ingenuity in gradually proving his hypothesis over the course of 18 years.”
Mattick follows his usual format by giving us his version of history. He has argued for the past 15 years that the scientific community has been reluctant to accept the evidence of massive amounts of regulatory RNA genes because it conflicts with the standard paradigm of the supremacy of proteins. In the past he has claimed that this paradigm is based on the Central Dogma which states, according to him, that the only real function of DNA is to make proteins [How Much Junk in the Human Genome?]. As we shall see, he hasn't abandoned that argument but at least he no longer refers to the Central Dogma for support

Mattick is also famous for arguing that there's a correlation between genome size and complexity; notably in a 2004 Scientific American article (Mattick, 2004) [Genome Size, Complexity, and the C-Value Paradox ]. That's the article that has the famous Dog-Ass Plot (left) with humans representing the epitome of complexity and genome size. He claims that this correlation is evidence that most of the genomes of complex animals must have a function. He repeats this claim in his latest paper (see below).

He begins his review by pointing out that regulatory RNAs were discovered in the 1990s when the mechanism of RNA interference was worked out (RNAi).2 He notes that a role for some lncRNAs was also demonstrated in the 1990s; notably, a lncRNA involved in X-chromosome inactivation (Xist). The next major discovery was the fact that there are tens of thousands of RNAs produced in various tissues. This leads to the following ...
However, despite the RNAi precedent, and with few exceptions, the existence of these uncharacterized lncRNAs was initially ignored or dismissed as “transcriptional noise”. Not only was it unclear how they might fit into the existing conceptual framework of genetic information and gene regulation, assumed to be transacted by proteins acting in combinatoric fashion, their sheer number, if functional, threatened the primacy of this framework, which has long been an article of faith in molecular, cellular and developmental biology.
This is a classic example of a paradigm shaft. It's simply not true that there was some "article of faith" that attributed all regulation to proteins and excluded regulatory RNAs. The idea that most of the transcripts were noise was based on real data, not dogmatic resistance.
The possibility that lncRNAs might be functional also contradicted the widely held belief, dating from the late 1970s, that the intronic and ‘intergenic’ sequences from which they are transcribed, and which dominate the real estate of the mammalian genome, are largely evolutionary debris (‘junk’), comprised of hangovers from the prebiotic assembly of “genes” expanded by accumulation of retrotransposon parasites (“selfish DNA”).
The concept of massive amounts of junk DNA was not some sort of "belief" from the 1970s. It's based on solid evidence such as genetic load arguments; the resolution of the C-value paradox; the discovery of introns (junk); our understanding of Neutral Theory; the lack of conservation of much of the genome; and the discovery that much of our genome was littered with the debris of former transposons and genes (i.e. pseudogenes) [Five Things You Should Know if You Want to Participate in the Junk DNA Debate]. All of this evidence has to be refuted if your are attacking junk DNA. You can't just dismiss it out-of-hand as some sort of prejudice based on a false premise.

Mattick now realizes that another reasonable explanation for pervasive transcription is that most transcripts are just "noise" produced by mistakes in transcription. He seems to be aware of the criticisms of people like my colleagues Palazzo and Lee (2015) and by others like Kopp and Mendell (2018) [see How many lncRNAs are functional?].3 They argue that the default explanation for these transcripts is that they are non-functional junk RNA and the onus is on proponents of function to provide evidence that a huge percentage of them are functional. They point out that mistakes in transcription have been well-documented in the scientific literature so that the idea of noise, or junk RNA, is not just hypothetical but based on real data. They argue that most of these transcripts are present at concentrations too low to be functional and most are not conserved [Functional RNAs?]. This is perfectly consistent with transcriptional noise and inconsistent with function.

Let's see how Mattick deals with these arguments.
The idea that the most non-coding RNAs are noise from biologically inert regions of the genome was superficially bolstered by the observation that most are expressed at low levels and are generally less conserved than protein-coding sequences .... This is a circular argument of dubious merit, since there is increasing evidence that retrotransposon-derived sequences have been exapted for various functions and coopted as mobile modules to alter the patterns of gene expression.
This makes no sense. It's a fact that some transposon-related sequences have been secondarily exapted as regulatory sequences. That has no bearing on the fact that the majority of transcripts are present at very low concentrations and are not conserved. It's very unlikely that you could have a single functional regulatory RNA at a concentration less than one molecule per cell. It's extremely unlikely that you could have tens of thousands of such regulatory RNAs.
The comparison of the rate of divergence of an extant set of ancient repeats also does not include the (unknown) number that have diverged to the point of unrecognizability, which therefore underestimates the rate of their presumed neutral evolution, and the extent of evolutionary selection on the genome.
I think I understand what this means. I think he's saying that some sequences don't look conserved because they appear to be changing at the rate of neutral evolution but, in fact, the actual rate of neutral evolution is much greater than scientists realize so these sequences are actually evolving more slowly than the real neutral rate. In other words, they are under some low level of selective constraint. This means they are functional.

Mattick backs up this claim by referencing papers from his own lab. The claims to have shown that "at least 45% of the alignable regions of the mammalian genome are not evolving neutrally (Oldmeadow et al., 2010). He also claims that "18% of the mammalian genome [is] conserved at the level of predicted RNA structure (Smith et al., 2013).

I don't think this is a valid argument and I'm skeptical of his data but I leave it to the experts to deal with it. As far as I know, the overall divergence of genome sequences (e.g. chimpanzees and humans) is pretty much what you expect from the neutral rate suggesting that a huge percentage of the genome is evolving neutrally [Calculating time of divergence using genome sequences and mutation rates (humans vs other apes)].
The (lack of a high) conservation argument also fails to take into account the fact that adaptive radiation occurs mainly by the relatively rapid evolution of the regulatory sequences under positive selection, that such sequences have quite different structure–function constraints to proteins, and that they are subject to rapid turnover. Thus, many lncRNAs are likely to be lineage-specific.
This is a very common argument that's often used to rationalize the lack of sequence conservation. There's some validity to the argument because there are known examples of genes that evolved recently in a particular lineage. By definition these de novo genes are not conserved even though they are functional. However, what Mattick is proposing is that there are tens of thousands of these newly evolved genes in humans. Presumably every other mammalian species has evolved a similar number of new genes. I don't think there's any known mechanism that could select so rapidly for such a huge number of new functions covering most of the genome.4

It's true that there are some examples of putative noncoding genes that appear to be slightly conserved raising the possibility that they have become functional in the recent past but most of the cases are ambiguous and, furthermore, there are only a handful of examples. This challenge to the gold standard of sequence conservation is not new. It's trotted out in order to avoid the obvious conclusion whenever the conservation test for functionality fails [Ad hoc rescue]. I'm reminded of a paper from back in 2004 that addressed the issue (Wang et al., 2004). What the authors said back then is still valid today.
Given that all of the best techniques for detecting RNA genes depend on sequence conservation, the absence of this cannot be summarily dismissed, even if isolated examples of RNA genes being weakly conserved can be found. Extraordinary claims require extraordinary proof — this is particularly true when much of the data support an alternative interpretation that they are simply non-functional cDNAs.
Now we get to what I think is the heart of Mattick's argument for function. It's the same argument he used when he drew the dog-ass plot back in 2004.
At this point it is important to remember that the metazoan proteome is remarkably static. Both the nematode and human genomes contain ~20,000 protein-coding genes, most of which are functionally orthologous, despite orders of magnitude difference in their developmental (and cognitive) complexity. By contrast, the proportion of the genome that is non-protein-coding, and the number and range of non-coding RNAs expressed therefrom, increases with developmental complexity, raising the obvious possibility that these sequences are responsible for specifying developmental complexity and phenotypic diversity.
This is what I call the Deflated Ego Problem. People like John Mattick are convinced that humans are much more complex than, say, nematodes, and this calls out for an explanation in terms of the number of genes. They were disappointed when they learned that humans don't have very many more genes than less complex species so they have been proposing a number of explanations to restore their deflated ego. In Mattick's case, the explanation is that the number of protein-coding genes doesn't reflect complexity; instead, it's the massive increase in noncoding RNA genes that make human special.

There are four problems with this view.
  1. The correlation between genome size and complexity is spurious. If true, onions should be more complex than humans and so should some rats.
  2. There's no evidence that mammalian genomes contain tens of thousands of newly evolved noncoding RNA genes and certainly no evidence that they specify complexity.
  3. The field of evo-devo has demonstrated conclusively that significant changes in phenotype and complexity can be achieved by simply altering the timing of expression the core genes. These differences are due almost exclusively to differences in transcription factors and transcription factor binding sites. Very few examples of regulator RNAs have been discovered by workers in this field. The differences between, say, a whale and an wallaby, can be explained by what we already know without the need to postulate extraordinary new mechanisms. In other words, there's no great mystery that needs to be solved.
  4. The other solution to the C-value paradox is junk DNA. The idea that most of our genome is junk is well-supported by real data and it has much more explanatory power than Mattick's speculation. There's no reason to reject a perfectly good model in favor of one that lacks evidence and doesn't explain genome size differences or complexity.
One of the most common arguments for function is that the transcripts are only expressed in some cell types and not others. The claim is that specificity indicates regulation and regulation indicates function. This argument was refuted two decades ago when it was pointed out that spurious transcripts would exhibit the same pattern. Accidental transcription is triggered by the inappropriate binding of transcription factors to regions of junk DNA. Since different cell types have different transcription factors, it follows that each cell type will have its own collection of spurious transcripts (noise, junk RNA). Thus, the observed specificity of low concentration, non-conserved, unstable RNA transcripts is not an argument for function.

Proponents of function continue to ignore this point in their papers. Here's how Mattick uses the phoney argument in a section entitled "Evidence of Long Non-Coding RNA Functionality."
The case for lncRNA functionality is also supported by their dynamic expression patterns in differentiating cells and their highly specific spatial (including subcellular) localization, especially in the brain, which also explains their low abundance in RNAseq analyses of whole tissues. Indeed, high-resolution analyses using RNA capture technologies have revealed an extraordinary diversity of lncRNAs, most of which are likely to be cell-specific, and which have yet to be catalogued or characterized. Perhaps the most intriguing are the 3’UTR-derived lncRNAs that are expressed separately from, and appear to convey differentiation signals independently of, their normally associated mRNAs.
The next few paragraphs cover some known or strongly suggestive examples of functional RNAs involved in regulation. This is standard stuff in papers arguing for function. It's a red herring in the sense that nobody questions the role of some regulatory RNAs—after all, they've been known for more than 40 years. The real question is not whether there are some regulatory RNA genes but whether there are massive numbers (tens of thousands) of newly evolved noncoding genes in every mammalian lineage.

Pointing to a few known examples and then using them as an argument that every other putative RNA gene must also be real is a form of fallacious argument that deserves a name. A quick bit of sleuthing on the internet reveals that it already has a name—it's the Association Fallacy, also known as guilt by association. The fallacy is illustrated by the diagram on the right.

Imagine that you are comparing two sets; B and C, where B represents the set of functional genes and C represents the set of all known transcripts. You identify a subset of the transcripts that come from functional genes (A). The fallacy lies in assuming that because A exists it follows that all of C is also a subset of B.

In this case the fallacy is compounded by the fact that we already know of a subset of C that does not overlap with B. An example would be transcripts of a pseudogene. The only valid form of argument using this data would be to say that we know spurious transcripts exist and we know functional regulatory RNAs exist. What we don't know is how many transcripts fall into each category.

Contrast that with what John Mattick says ...
While skeptics remain, the most likely interpretation is that the documented functional examples are emblematic of an army of regulatory RNAs that guide epigenetic trajectories and specify cell state during a very complex and precise developmental ontogeny—from a single fertilized cell to a mobile, cognizant adult—and that most of the human genome is devoted to this purpose. Indeed, the proportion of the mammalian genome devoted to cognitive function, rather than body plan development, may be considerably underestimated, given the preponderance of lncRNA expression in the brain. Not surprisingly then, many lncRNAs are primate-specific.

Indeed, the growing body of evidence is now leading to a general acceptance of the relevance of (many or most) lncRNAs to cell and developmental biology, and increasingly neurobiology, with the debate, such as it remains, shifted to the proportion of lncRNAs that may be biologically relevant. For me, the best indicator, although by no means proof, is their precise expression patterns, on which basis one can project that most are likely to be functional.

If so, the current protein-centric framework for understanding the genetic programming of differentiation and development is incomplete, a legacy of the mechanical worldview that held sway at the birth of molecular biology. Reconsideration of this framework to incorporate not only proteins but also structural and regulatory RNAs is overdue.
I'm not convinced and it's difficult to imagine how Mattick could possibly defend this point of view in a debate with experts on junk DNA [The great junk DNA debate].

  1. This is a 12 page paper but 8 pages are taken up by references.
  2. This is false history. Regulatory RNAs were discovered in the 1970s.
  3. Neither of these papers appear in the list of 149 references.
  4. Can any experts confirm this speculation? Is natural selection strong enough to do the job?

Kopp, F., and Mendell, J.T. (2018) Functional Classification and Experimental Dissection of Long Noncoding RNAs. Cell, 172:393-407. [doi: 10.1016/j.cell.2018.01.011]

Mattick, J.S. (2004) The hidden genetic program of complex organisms. Sci Am. 291:60-67.

Mattick, J. (2018) The State of Long Non-Coding RNA Biology. Non-Coding RNA, 4:17-28. doi: doi: 10.3390/ncrna4030017

Palazzo, A.F., and Lee, E.S. (2015) Non-coding RNA: what is functional and what is junk? Frontiers in Genetics, 6. [doi: 10.3389/fgene.2015.00002]

Wang, J., Zhang, J., Zheng, H., Li, J., Liu, D., Li, H., Samudrala, R., Yu, J., and Wong, G. K.-S. (2004) Mouse transcriptome: neutral evolution of ‘non-coding’complementary DNAs. Nature, 431: [doi: 10.1038/nature03016]


Georgi Marinov said...

One thing to note:

It is the year 2018, and people have been talking about this great genome-wide sea of ncRNAs for 15-20 years now.

One would have thought there would be more well characterized examples by now than what we have actually seen in practice.

Also, when one does genome-wide CRISPR screens targeting lincRNAs, the results are usually disappointing to say the least no matter what phenotype one screens for.

Mikkel Rumraket Rasmussen said...

Ironically John Mattick seems to be a bad philosopher, and exhibits some of the same symptoms that usually afflict creationists. Particularly he seems to be commited to the some-therefore-all, or black-white types of thinking. No place for nuance. And don't even get me started on the crappy "article of faith" statement.

Particularly the way he deals with the idea that in general you can estimate whether a locus is under positive selection by it's degree of conservation using sequence comparisons from closely related species, exhibits this type of black-white thinking. It is as if Mattick is saying that conservation can't come in degrees, or that the argument necessarily implies that exceptions cannot exist.

The conservation argument gives an indication. It was never asserted to be conclusive proof that a locus is functional or not, nor that there are no lineage-specific adaptations. But he sort of sets it up that way when he knocks it down.

He states that the conservation argument is "circular", but does not point out where in the argument the circularity lies. I don't see it, and again he fails to cite anyone engaging in circular reasoning (the one reference he gives is to a paper he co-authored).

He claims that initially the existence of uncharacterized lncRNAs were "ignored" or "dismissed as transcriptional noise", but gives no references to support that claim.

Mikkel Rumraket Rasmussen said...

I mean under purifying selection.

Larry Moran said...

As you know, absence of evidence is usually not the same as evidence of absence but in this case I think you make a good point. Proponents of massive amounts of regulatory RNAs are fond of pointing to various examples culled from a number of different species. You would think that the number of proven examples should increase rapidly from year to year but that’s not happening. I think that’s significant because it’s consistent with the hypothesis that the transcripts are junk.

I have a slightly different way of looking at it. If Mattick and his friends are correct then expression of the average gene should be regulated by several different regulatory RNAs. But many genes have been studied intensely by multiple labs over the past thirty years and none of those labs have stumbled upon a network of regulatory RNAs controlling their favorite gene.

We do not see these regulatory RNAs controlling expression of the enzymes of the citric acid cycle or glycolysis, for example. Nor do we see them in the well-studied examples of developmental genes. Those mysterious regulatory RNAs should be popping up everywhere but they’re not.

This is why Mattick is now falling back on expression in brain cells in his latest review. Now he’s implying that these transcripts have something to do with thought and consciousness. I don’t know how he explains pervasive transcription in plants.

br56u7 said...

I remember an earlier response on this blog to mattick 2013, specifically against the point that transcription was cell specific. You stated that if random ans stochastic transcription were to be predominant throughout the genome, then the transcription would have to be cell specific and developmentally regulated. What I'm interested in is your response to Qian 2016 [] which looked at were proteins bind to in 75 organisms (including humans) and found that most transcription factor proteins avoided weak binding sites. What is your counter to this?

br56u7 said...


Mikkel Rumraket Rasmussen said...

Uhm, it almost sounds like the question answers itself. What explains that transcription factor proteins mostly avoid weak binding sites? That they are weak!

Jack said...


Reading "Blueprint: How DNA Makes Us Who We Are", the latest book by Robert Plomin I have found a few fragments that confused me a little:

"Most mutations are in the other 98 per cent of DNA that does not code for a change in amino-acid sequence and used to be called ‘junk DNA’ because it is not translated into amino-acid sequences. Even within genes like the FTO gene, most of the DNA does not code for proteins. These non-coding stretches of genes, or introns, are spliced out of the RNA code before the RNA is translated into proteins. The remaining RNA segments, or exons, are spliced back together and proceed to be translated into amino-acid sequences.
We are still learning about the many ways in which mutations in these non-coding differences in DNA sequence make a difference. What we do know is that they do make a difference. Some research suggests that as much as 80 per cent of this non-coding DNA is functional, in that it regulates the transcription of other genes. This distinction is important because most DNA associations with psychological traits involve SNPs in non-coding regions of DNA rather than in classical genes."

And this:

"We used to think that this RNA message was always translated into amino-acid sequences, which are the building blocks of all proteins. However, DNA transcribed into RNA and translated into amino-acid sequences accounts for only 2 per cent of all DNA. These are the 20,000 classical genes mentioned earlier. Is the other 98 per cent of DNA junk? We now know that as much as half of all DNA cannot be junk, because it is transcribed into RNA even though it is not translated into RNA. Instead of being called junk DNA, it is called non-coding DNA because it does something, even though it does not code for amino-acid sequences. One reason why it must be important is that at least 10 per cent of this non-coding DNA is the same across related species, suggesting that it has some adaptive function because it has been conserved evolutionarily. Other more direct research suggests that as much as 80 per cent of this non-coding DNA is functional, in that it regulates the transcription of other genes. This new way of thinking about ‘genes’ is important because many DNA associations with complex traits are in these non-coding regions of DNA."

Of course the "translated into RNA" phrase seems to be a typo but I would like to know your opinion on the rest of those fragments.

"Unlike any other predictors, polygenic scores are just as predictive from birth as from any other age because inherited DNA sequence does not change during life."

Is it really so? What about somatic mutations? I read somewhere that in an adult there are about 5000 somatic mutations per cell and their numbers are growing as we are getting older. Can't they hit the place where inherited sequence are and change their sequence?

Larry Moran said...

I haven't yet read this book. The author sounds like just another scientists who doesn't really understand the topic he's writing about.

I find this puzzling. It's 2018 and a simple google search will uncover a ton of criticism of the views you quote. Why didn't Robert Plomin do a bit of research on the validity of the ENCODE results and the evidence for junk DNA before publishing his book?

Robert Plomin is a psychologist. Here's brief bio of him from Wikipedia ...

In 2002, the Behavior Genetics Association awarded him the Dobzhansky Memorial Award for a Lifetime of Outstanding Scholarship in Behavior Genetics. He was awarded the William James Fellow Award by the Association for Psychological Science in 2004 and the 2011 Lifetime Achievement Award of the International Society for Intelligence Research. In 2017, Plomin received the APA Award for Distinguished Scientific Contributions. Plomin was ranked among the 100 most eminent psychologists in the history of science. In 2005, he was elected a Fellow of the British Academy (FBA), the United Kingdom's national academy for humanities and social sciences.

How is it possible for someone like this to be unaware of the controversy about junk DNA? Or is he aware of views that differ from his own and chooses to ignore them? How do we reconcile the gross ignorance behind the passages you quote and the fact that he's "among the 100 most eminent psychologists in the history of science"?

Christopher said...

Odd coincidence here. An old urban myth holds that "you only use 90% of your brain." Actual neurologists have frequently rebuttd this, but the idea continues to circulate. There was even a recent Hollywood movie based on it -- the central character learns to use her whole brain at last and she can kick the arse of the baddies. Yeah!

Anyway, you say -- I am utterly unqualified to an opinion and am not quarreling -- that 90% of DNA is essentially unused. To a layperson, it sounds like the same notion, transcribed from macrobiology to microbiology.

Jack said...

It's not so easy as it may seem at first sight. Look here:

Jack said...

From "Blueprint".

"Some random mutations in our DNA occur as time goes by, but the thousands of SNPs that are used to create polygenic scores will not change significantly. DNA can be damaged with aging, especially exacerbated by smoking, but this is also unlikely to affect polygenic scores"

What is the number of "some random mutations". Michael Lynch says here that:
"an average adult cell will contain about 100 X 50 = 5000 de novo mutations. Although these will not all be independent, with about 10^13 cells in the human body, the total number of mutations carried by an adult will then be of order 10^16 with every nucleotide site having been mutated in thousands of cells."

So "some" means 10^16.

br56u7 said...

Well yes, obviously, but I meant how do you explain this if you believe that most of the genome is junk.

Mikkel Rumraket Rasmussen said...

I don't think the proteins care whether their binding activity yields a useful biological function or not. They just bind in accordance with the affinity they have to the specific sequence of DNA (some places their bind more strongly, others less so), and that's that. If that binding results in successful transcription, and whether that transcribed product is beneficial in some way is a byproduct of the binding activity. The DNA has neither to be functional nor junk for binding to take place.

br56u7 said...

Thank you for your response

The Other Jim said...

40% or more of our genome is LINE elements and other transposabile elements. Another 8-10% are from other viruses. These are mostly non-functional, a few active ones, and maybe an even smaller number co-opted into some function.

If the DNA was all mysterious sequence of unknown origin, we would concede your point. But almost half is dead parasites...

See the previous post on this site...

Joe Felsenstein said...

And there's the Onion Test, and there's the issue of mutational load.

The Other Jim said...

I often have troubles explaining the mutation load argument to those unfamiliar with the basics of molecular evolution. I need more practice ;-)

Larry Moran said...

Here's my shot at explaining genetic load (mutation load).

Revisiting the genetic load argument with Dan Graur