More Recent Comments

Friday, April 15, 2022

Most lncRNAs are junk

A hard-hitting review will be published in Annual Review of Genomics and Human Genetics. It shows that the case for large numbers of functional lncRNAs is grossly exaggerated.

A long-time Sandwalk reader (Ole Kristian Tørresen) alerted me to a paper that's coming out next October in Annual Review of Genomics and Human Genetics. (Thank-you Ole.) The authors of the review are Chris Ponting from the University of Edinburgh (Edinburgh, Scotland, UK) and Wilfried Haerty at the Earlham Institute in Norwich, UK. They have been arguing the case for junk DNA for the past two decades but most of their arguments are ignored. This paper won't be so easy to ignore because it makes the case forcibly and critically reviews all the false claims for function. I'm going to quote a few juicy parts because I know that many of you will not be able to access the preprint.

Ponting, C.P. and Haerty, W. (2022) Genome-Wide Analysis of Human Long Noncoding RNAs: A Provocative Review. Annual review of genomics and human genetics 23. [doi: 10.1146/annurev-genom-112921-123710

Do long noncoding RNAs (lncRNAs) contribute little or substantively to human biology? To address how lncRNA loci and their transcripts, structures, interactions, and functions contribute to human traits and disease, we adopt a genome-wide perspective. We intend to provoke alternative interpretation of questionable evidence and thorough inquiry into unsubstantiated claims. We discuss pitfalls of lncRNA experimental and computational methods as well as opposing interpretations of their results. The majority of evidence, we argue, indicates that most lncRNA transcript models reflect transcriptional noise or provide minor regulatory roles, leaving relatively few human lncRNAs that contribute centrally to human development, physiology, or behavior. These important few tend to be spliced and better conserved but lack a simple syntax relating sequence to structure and mechanism, and so resist simple categorization. This genome-wide view should help investigators prioritize individual lncRNAs based on their likely contribution to human biology.

LncRNAs are operationally defined as transcripts that are longer than 200 nucleotides and do not encode protein (long-noncoding RNAs). Many of them are capped and polyadenylated suggesting that they are RNA polymerase II transcripts.

The names of most RNA classes are used to describe RNA with a specific function (tRNA, microRNA, siRNA, snoRNA etc.) but that's not the case with lncRNAs because the majority of lncRNAs have no known function. This causes a bit of confusion since many people assume that whenever you describe something as a lncRNA it must have a function. It would be better to simply refer to these candidates as transcripts until a function has been assinged but that's a losing battle.

Ponting and Haerty begin their review with an overview of some of the problems in the field of lncRNA.

Taking a gene-centric perspective on lncRNAs raises the problem that a lesson learned from one locus is rarely relevant to others. Our deep functional understanding of Xist—the master regulator of X chromosome inactivation—for example, has not aided investigation of tens of thousands of annotated lncRNAs. Whenever researchers propose a lncRNA’s mechanism, or its involvement in a pathology such as cancer, almost inevitably they herald this as revealing a new paradigm, one that possibly explains the mode of action of many other lncRNAs. Hundreds of publications state that lncRNAs are emerging as important regulators, elements, or components, and 30% of published reviews on lncRNAs since 2012 employed the term emerging. The implication is that lncRNAs are now being revealed almost as brightcolored butterflies, rather than plain-colored chrysalises. Nevertheless, very few lncRNAs have high-quality evidence for such colorful claims. Instead, low-quality evidence abounds, in part because the lncRNA literature has been contaminated by hundreds of paper-mill publications (Zhou et al., 2020) but also because molecular and cellular observations—such as RNA–molecule interactions and gene expression changes—are often deemed important without sufficient evidence.

As well as acclaiming hard-won advances in human lncRNA biology, it is critical that we recognize the field’s substantial knowledge gaps. The ubiquity of lncRNAs within and across eukaryotic species has led some to describe lncRNAs as major actors that contribute substantially to most cellular processes and whose RNA sequence variation will ultimately be recognized as greatly altering human traits and disease susceptibility (Mattick, 2009). Faced with the same evidence, others view the vast majority of lncRNAs as nonfunctional, spurious by-products of transcription (Palazzo and Lee, 2015)). The truth lies across these two extremes: Some transcripts will lack RNA sequence–dependent function, whereas others will harbor variants that predispose individuals to disease.

The Zhou et al. (2021) reference was carefully chosen because here's how those authors begin their paper.

The discovery of one order of magnitude more transcripts coded for RNAs (i.e. non-coding RNAs) than proteins provided a paradigm shift in our understanding of genome regulation. Long non-coding RNAs (lncRNAs), in particular, have emerged as key players in essentially every biological process and associated with many human diseases including cancer, cardiovascular and neurodegenerative diseases.

That's a claim for 200,000 lncRNA genes in the human genome. It illustrates the kind of rhetoric that has become common in the scientific literature and the fact that it was approved in the review process shows you what we're up against.

Ponting and Haerty note that there are at least 270,000 lncRNAs in some databases but only 17,944 are listed in version 38 of GENCODE. The latest version of GENCODE (release 40) has 18,805 lncRNA genes and 7,567 genes for small non-coding RNAs. If true, this means that there are more noncoding genes than protein-coding genes. Ponting and Haerty go on to note that there's a lot of hype associated with these numbers.

To elevate the importance of lncRNA loci even further, some researchers claim that “the large majority of the human genome is transcribed into nonprotein-coding RNAs,” whereas “only ∼1.2% of the human genome encodes for protein-coding genes” (e.g., Johnsson et al., 2014: p. 1063). The truth, however, is less impressive: Human lncRNA exons span at most 2.3% of the human genome, and most intergenic RNA arises from transcription that is initiated within protein-coding genes. Moreover, most lncRNAs are expressed at low levels. These low levels mean that even if, very optimistically, the number of lncRNA loci is 10-fold greater than the number of protein-coding genes, their molecular output is considerably smaller. A claim that “around 98% of all transcriptional output in humans is non-coding RNA” (Mattick, 2001: p. 986) is plausible only when this includes intronic nucleotides of protein-coding gene transcripts.

This is important because it serves to illustrate a remarkable lack of scientific rigor among lncRNA scientists. They often make statments that just don't make sense. Ponting and Haerty don't mention the fact that these statements are approved during peer review but I think that's an important point that we should not ignore. (Protein-coding genes occupy about 40% of the genome but most of that is introns. It means that 40% of our genome will be transcribed even if there were no noncoding genes. Every scientist should know this.)

LncRNA proponents often claim that the abundance of noncoding RNA genes explains human evolution, human complexity, and cognition. These arguments are "anthropocentric," according to Ponting and Haerty, and they often fail because other, less-complex, organisms have just as many hypothetical noncoding genes. Ponting and Haerty publish a table of all the logical fallacies in the lncRNA literature. Here's the list.

I can think of other logical fallacies but the point is well-taken. It's about time that we started to call out scientists who are guilty of false logic.

Expression and the null hypothesis

Most lncRNAs are present at less than one copy per cell and this can easily be explained as transcription noise.

The cellular transcriptional machinery does not perfectly discriminate cryptic promoters from functional gene promoters. This machinery is abundant and so can engage sites momentarily depleted of nucleosomes and rapidly initiate transcription. The chance occurrence of splice sites can then facilitate the capping, splicing, and polyadenylation of long transcripts. A very large number of such rare RNA species are detectable in RNA-sequencing experiments whose properties are virtually indistinguishable from those of bona fide lncRNAs. Consequently, “a sensible [null] hypothesis is that most of the currently annotated long (typically >200 nt) noncoding RNAs are not functional, i.e., most impart no fitness advantage, however slight” (Ulitsky and Bartel, 2013: p. 26).

Subcellular location

Most lncRNAs appear transiently in the nucleus and that's exaclty what you expect for transcriptional noise.

Enhancer lncRNAs

There are often transcripts produced in the vicinity of functional promoters. These are called enhancer RNAs and one of their functions could be simply to keep the promoter region in an open domain so that it works more effectively. It means that it's the act of transcription that's important and not the transcript itself. Enhancer RNAs will not be conserved because the sequence doesn't matter. Most of these transcripts are shorter than the typical lncRNA.

Evolution, conservation, and constraint The failure to recognize the implications of the non-coding DNA will go down I think as the biggest mistakes in the history of molecular biology.

John Mattick
abc Australia

Sequence conservation is an important clue to function. Conserved RNAs often have a known biological function and are usually found in distantly related species—some of them are ancient. Most lncRNAs show no evidence of conservation (or purifying selection) and that strongly suggests that they are non-functional.

John Mattick and his colleagues dismiss this conclusion on two grounds. They claim that the presumed lack of conservation is an artifact based on a faulty assumption; namely, using transposon (TE) sequences to measure neutral evolution (Mattick and Dinger, 2013). They also claim that most lncRNAs are human specific and since they arose relatively recently in the human lineage, they won't show any evidence of conservation. [See The Junk DNA Controversy: John Mattick Defends Design and The biggest mistake in the history of molecular biology (not!)]

Ponting and Haerty dismiss both of those objections. Here's what they say about the first argument.

On the other side of this debate are evolutionary biologists who hold that a century-old theoretical evolutionary framework can be trusted to provide deep insight into molecular structure, function, and disease. With a neutral model of evolution, lncRNAs were estimated to contain only a small fraction (4.1–5.5%) of functional sequence, implying that mutations in the remaining sequence would not alter reproductive fitness. Mattick & Dinger (2013) responded that this model’s notion of selective neutrality was highly questionable. This was despite the model being founded on only one assumption—specifically, that mutations (in this case insertions or deletions) occur randomly within neutrally evolving sequence. Rather than assuming selective neutrality within ancient TE sequence, as Mattick & Dinger claimed, the model predicted that more than 99% of such sequence evolved neutrally.

The second argument suggests that new lncRNA genes will show no evidence of conservation but, since they are currently functional, they must be subject to purifying selection to preserve that function. That argument runs into some nasty little facts.

One of two polar opposite outcomes was expected from applying this constraint approach to the human population. In one, lncRNA sequence would be highly constrained even if it was poorly conserved in other species, indicative of important human-specific functions; in the other, human lncRNA sequence would be poorly constrained, consistent with its weak conservation over longer evolutionary intervals. Population data provided compelling evidence for this second outcome—specifically, that newly arising mutations in human lncRNAs are seldom deleterious. Recent evidence shows that strong selection is almost entirely absent in human lncRNAs whose sequence is not conserved in other species.


There are lots of studies that correlate various lncRNAs with certain phenotyes. This is evidence that those lncRNAs have a function related to the observed phenotype. Ponting and Haerty discuss a number of different explanations that lncRNA proponents don't mention. They also point out that the best way to establish a connection between a lncRNA and a phenotype is to delete the DNA that's transcribed and observe the effect. This has been done in many cell lines but the results have usually been misinterpreted.

It would be better to delete the DNA from a living organism but you can't do that experiment in humans. You can do it in mice but only 10% of all lncRNAs have an othologous region in the mouse genome.

In summary, among mouse lncRNA loci that have been targeted for disruption and phenotypic scrutiny, many have yielded either no in vivo phenotypes or effects that are not always replicated when different strategies to disrupt the locus are adopted. In the absence of strong evidence to the contrary, therefore, the expectation should be that natural mutations within human lncRNAs only rarely cause overt phenotypes.

Medical relevance

The scientific literature contains lots of speculation about an association between lncRNAs and human diseases. This is taken to be evidence that large numbers of lncRNAs are functional. However, the best studies have demonstrated a close association between less than 100 lncRNA loci and this is perfectly consistent with a small number of true functional lncRNA genes. It says nothing about the function of the vast majority of lncRNAs. Furthermore, even the best studies have been challenged for missing nearby protein-coding genes that could account for the association.

We intend to provoke alternative interpretation of questionable evidence and thorough inquiry into unsubstantiated claims.

Ponting & Haerty (2022)
Summary points

The review begins with a question; how do you choose a particular lncRNA to study given that you generally have no clues about it's function? The purpose of their review, according to Ponting and Haerty, is to help scientists choose appropriate candidates for further study.

They summarize their answers at the end. You should choose a lncRNA that's conserved in other mammals. It should be abundant in the cells where it is expressed and show specific subcellular localization. It should interact with other molecules.

I think we can all agree that by following those guidelines you are far more likely to find a real biologically lncRNA to study but that's not going to answer the big question, what about all the others?

Mattick, J.S. (2009) The genetic signatures of noncoding RNAs. PLOS Genetics 5:e1000459. [doi: 10.1371/journal.pgen.1000459]

Mattick, J. S. and Dinger, M. E. (2013) The extent of functionality in the human genome. The HUGO Journal 7, 2 [doi: 10.1186/1877-6566-7-2] [Abstrat]

Palazzo, A.F. and Lee, E.S. (2015) Non-coding RNA: what is functional and what is junk? Frontiers in genetics 6:2(1-11). [doi: 10.3389/fgene.2015.00002]

Ulitsky, I. and Bartel, D.P. (2013) lincRNAs: genomics, evolution, and mechanisms. Cell 154:26-46. [doi: 10.1016/j.cell.2013.06.020]

Zhou, B., Ji, B., Liu, K., Hu, G., Wang, F., Chen, Q., Yu, R., Huang, P., Ren, J. and Guo, C. (2021) EVLncRNAs 2.0: an updated database of manually curated functional long non-coding RNAs validated by low-throughput experiments. Nucleic acids research 49:D86-D91. doi: [


SPARC said...

Rather than knocking out DNA sequences to prove the function of lncRNAs it will soon be possible to directly target such transcripts by something like Crispr/CasRx directly. However, who would do this? On the one side we have those who are not interested in possibly direct proof of their failures and on the other hand you have those who are wondering it is worth the effort to demonstrate that most spurious transcripts interpreted as lncRNAs don’t have any function.

Graham Jones said...

It seems to me that this paper describes a way in which non-functional RNAs with secondary structure can be quite easily created.

"we identified mutation patterns consistent with the TSM mechanism both among historical changes separating established evolutionary lineages and
among recent mutations, likely destined for removal by drift
and selection."

Template switching in DNA replication can create and maintain RNA hairpins

Anonymous said...

If I were you Larry, I'd stop drinking, especially the Diet Coke...

Corneel said...

Please tell us that isn't your comment below, J-Mac.