A large percentage of the human genome is transcribed at some time or another during development. The vast majority of those transcripts are very rare transcripts that look very much like spurious products of accidental transcription initiation at sequences resembling true promoters. They have been rejected by genome annotators. They do not define genes. They are junk RNA. Pervasive transcription does not mean that most of the genome is functional.Among the transcripts is a class called long non-coding RNAs or lncRNAs. These are usually defined as capped and polyadenylated transcripts longer than 200 nucleotides. Many of them are processed by splicing. They look a lot like mRNA except they don't encode any polypeptides.1
We don't know how many of these RNAs exist because different labs use different criteria to describe them. Some databases exclude low abundance lncRNAs and some include non-polyadenylated RNAs. There is general agreement that they number in the tens of thousands. A common number in the scientific literature is 60,000 lncRNAs.
& Junk DNAThe latest build of the human genome lists 14,727 lncRNA genes [Ensembl Human Genome] but most workers in the field think this is too low. Let's be clear about these numbers. The Ensembl description is just wrong. They are assuming that the existence of a lncRNA means there is a gene that produces that RNA. But since we don't know whether these RNAs have a biological function, this must mean that Ensembl uses a very loose definition of "gene"—one that includes any sequence that's transcribed whether or not it has a function. This seems to be what they are doing since they include pseudogenes in their gene count and they include an additional 198,000 "gene" transcripts. This is begging the question.2
I will use a definition of "gene" that restricts the term to those sequences that produce a functional product [What Is a Gene?].
A gene is a DNA sequence that is transcribed to produce a functional product.The key question is not how many lncRNAs there are but how many are functional. That's the question I want to address but, before doing so, let me make it clear that even if all 60,000 lncRNAs are functional it doesn't make much difference in the junk DNA debate. I'll illustrate that with a simple calculation. Assume that the average size of a lncRNA gene is 1,000 bp. If all 60,000 are functional, that corresponds to only 2% of the human genome.
We should also recognize that the terminology is confusing and it muddles the issue. Most workers assume the default explanation is that lncRNAs are functional. As a result, the term has come to be associated with presumed functional sequences. My colleagues, Alex Palazzo and Eliza Lee wrote a review on this topic last year and one of their main points is that this assumption is invalid (Palazzo and Lee, 2015). The true default assumption, according to them (and me), is ...
In the absence of sufficient evidence, a given ncRNA should be provisionally labeled as non-functional. Subsequently, if the ncRNA displays features/activities beyond what one would expect for the null hypothesis, then we can reclassify the ncRNA in question as being functional.I'll be addressing a paper that looks at sequence conservation as an indicator of function so it's important to keep the correct null hypothesis in mind. We will be looking at the subset of ncRNAs that fit the lncRNA definition but it's hard to overcome the bias associated with the term since it has come to imply function. I'll just call them long low abundance transcripts or LOLATs.
Find a function
How many of the 60,000 or so LOLATs have a biological function? There's only one way to answer that question and that's to examine each and every one of them to see what they do inside the cell. Only a small number of these transcripts have been examined in this way. It's hard to get a handle on just how many have a proven function but I doubt it's more than 200 in humans. That's a small percentage of the 60,000 LOCATs.
In the absence of a proven function, there are several indirect indicators we can use to make a decision. One of them is the overall abundance of a given LOLAT inside the cell. It is generally agreed that almost all of them are present at less than one copy per cell [reviewed in Palazzo and Lee, 2015: see figure]. It's very difficult to see how anything could have a true biological function at such a low concentration, although that hasn't prevented some wild speculation and bizarre scenarios.
Based on what we know about the levels of these RNAs inside the cell, it's safe to conclude they are most likely due to spurious transcription with no function (junk RNA). The concentration data provides no support for the idea that most of those 60,000 transcripts are functional; therefore, the null hypothesis cannot be rejected.
Most of the LOLATs are present in only one cell type or tissue or in a small number of tissues. Many of them are only found in testes or brain—tissues that are notorious for producing junk RNA based on data going back over 40 years.
Proponents of function often claim that patterns of restricted expression are an indication of function. This is a bad argument since spurious transcription—the default explanation—should show the same thing. Different promoters/enhancer are exposed in different tissues and if spurious transcripts come from those regions then they will show restricted expression.
I don't understand why this argument for function is so widely believed. Restricted expression (specificity) does not cause us to reject the null hypothesis since the result is consistent with spurious transcription.
The "gold standard" for assessing function is sequence conservation. If the sequence of the LOLAT is conserved in other species then we can assume that it's under negative selection. It must have a function.
How many of the 60,000 LOLATs show evidence of conservation? All of the literature published so far indicates that this number is very low (<1,000). The evidence is reviewed in a paper published on the Nature Reviews Genetics website (advanced online publication).
Ulitsky, I. (2016) Evolution to the rescue: using comparative genomics to understand long non-coding RNAs. Nature Reviews Genetics: advanced online publication, Aug. 30, 2016 [doi: 10.1038/nrg.2016.85]
You have to read the paper carefully in order to understand the main message, which is that very few LOLATs show any evidence of sequence conservation. The null hypothesis cannot be rejected on the basis of sequence conservation. The conclusion, which is consistent with most of them being junk RNA resulting from spurious transcription, is muddled by frequent terminological issues. Here's an example ...
A key assumption made when using DNA sequence alignments to study lncRNA evolution is that lncRNA exons in one species align to lncRNA exons in the other species. However, transcription typically evolves faster than the underlying DNA sequence and thus, in many cases, lncRNA loci are homologous to non-transcribed sequences in the other species. Therefore, it is important to study lncRNAs by directly comparing lncRNA-producing loci, and such studies in multiple species have uncovered rapid turnover of lncRNA loci. For example, my laboratory found that in 17 vertebrates, more than 70% of lncRNAs have appeared in the past 50 million years. Splicing patterns also evolve rapidly, with only approximately 20% of splicing events in human lncRNAs conserved outside of primates. lncRNA loci are thus commonly gained and lost in evolution, and those lncRNAs that are retained drastically change their exon–intron architecture and their sequences across species in which the lncRNA is present.It's reasonable to focus attention on exon sequences since, as a general rule, intron sequences are not conserved in functional genes. Ulitsky notes that when you align LOLAT sequences to the comparable region in other genomes, you usually find that the similar sequence in the other genome does not correspond to a known transcript. This is consistent with the idea that the relevant sequences in both genomes are junk DNA but the human region just happens to be transcribed by accident.
It's probably best to avoid using the word "homologous" to describe these similar sequences since that's a loaded word. If the sequences are junk then they will eventually drift apart and all sequence similarity will be lost.
Because there's so little evidence of conservation for most LOLATs, Ulitsky suggests restricting the sequence comparisons to regions where both species have a LOLAT at the same site. This is a small fraction of the total number of LOLATs. When he does this, he discovers that 70% of LOLATs show no evidence of long-term sequence conservation (Herzroni et al., 2015). Transcription appears to have arisen fairly recently (within 50 million years). Less than 100 lncRNA genes are shared between fish and tetrapods and only a few hundred additional lncRNA genes are shared between reptiles, birds, and mammals. About one thousand lncRNA genes appear to be common to various mammalian lineages. Thus, sequence conservation suggests that about one thousand LOLATs out of 60,000 might be functional.
Knowing that most LOLATs are recent, there are two different interpretation. Ulitsky prefers to think of new genes arising in the past 50 million years. I prefer to think of spurious transcription arising in some common ancestor. The bottom line is that even if we restrict our analysis to a particular LOLAT that is present in more than one species (a small subset) there's very little evidence of conservation. Not even the splicing pattern is conserved. Looks like junk RNA to me.
BTW, we also have to be very careful about using the word "conservation." If you compare any two junk DNA regions of the human and chimp genomes you'll find that they are at least 98% similar. That's a very high degree of sequence similarity but it does not indicate "conservation." The word "conservation" should be restricted to sequences that can be shown to be under negative selection. In the case of humans and chimps the sequence similarity would have to be almost 100% over a long stretch of DNA before you could assume conservation.
Sloppy use of "conservation" carries the implication that the sequences are under selection.
Most of the review focuses on issues such as "Classes of lncRNA evolutionary trajectories," "Rapid turnover of lncRNAs in other phyla," "Evolutionary origins of new lncRNAs," and "Routes for increased complexity in lncRNA loci." Nowhere in the abstract or the concluding remarks do you see any mention of what I think is the main result; namely, most LOLATs show no evidence of conservation. I get the impression that the author remains unconvinced by the results of "using comparative genomics to understand non-coding RNAs" (see the title of the article).
Here's the abstract ....
Long non-coding RNAs (lncRNAs) have emerged in recent years as major players in a multitude of pathways across species, but it remains challenging to understand which of them are important and how their functions are performed. Comparative sequence analysis has been instrumental for studying proteins and small RNAs, but the rapid evolution of lncRNAs poses new challenges that demand new approaches. Here, I review the lessons learned so far from genome-wide mapping and comparisons of lncRNAs across different species. I also discuss how comparative analyses can help us to understand lncRNA function and provide practical considerations for examining functional conservation of lncRNA genes.To me, the rapid turnover of LOLATs indicates that most of them are spurious, non-functional transcripts. But when the author says, "the rapid evolution of lncRNAs poses new challenges" it seems to indicate that these are unusual genes that for some reason don't look conserved. That's a very different conclusion.
The author, Igor Ulitsky, should have directly addressed the important question in this review and given us some indication about what fraction of the 60,000 LOLATs represent truly functional genes and on what evidence he bases his conclusion if it's not sequence conservation.
This field is very confusing. Some labs have characterized individual RNAs and demonstrated that they have a biological function. As I mentioned earlier, there are only 200 or so of these proven genes. Many labs have taken a genomics approach and have promoted the idea that there are tens of thousands of functional lncRNA genes. However, when these labs try to find evidence of function the results are discouraging. The best interpretation is that most LOLATs are spurious, non-functional, transcripts.
That doesn't sit well with researchers who have invested their reputations (and many grants and publications) in the idea that the human genome has a huge number of lncRNA genes. What to do? The answer is to make up excuses to explain why sequence conservation is not a good indication of function and avoid the obvious conclusion that the transcripts are junk. Recall that junk RNA is the null hypothesis; function must be demonstrated, not assumed.
The three most common excuses are ....
- Conservation is difficult to detect because the functional part of the exon sequences relies on the formation of secondary structure and not exact sequence conservation.
- Most lncRNA genes are species-specific. They have evolved recently and they help define one species from another.
- The actual sequence of lncRNA genes isn't important. The function of these genes is simply to produce a transcript of some sort and not one with a particular sequence. (These are "eRNAs" or "enhancer RNAs.")
Here are a couple of abstracts that illustrate this phenomenon. I have underlined some interesting phrases.
Johnsson, P., Lipovich, L., Grandér, D., and Morris, K. V. (2014) Evolutionary conservation of long non-coding RNAs; sequence, structure, function. Biochimica et Biophysica Acta (BBA)-General Subjects, 1840:1063-1071. [doi: 10.1016/j.bbagen.2013.10.035]
Recent advances in genomewide studies have revealed the abundance of long non-coding RNAs (lncRNAs) in mammalian transcriptomes. The ENCODE Consortium has elucidated the prevalence of human lncRNA genes, which are as numerous as protein-coding genes. Surprisingly, many lncRNAs do not show the same pattern of high interspecies conservation as protein-coding genes. The absence of functional studies and the frequent lack of sequence conservation therefore make functional interpretation of these newly discovered transcripts challenging. Many investigators have suggested the presence and importance of secondary structural elements within lncRNAs, but mammalian lncRNA secondary structure remains poorly understood. It is intriguing to speculate that in this group of genes, RNA secondary structures might be preserved throughout evolution and that this might explain the lack of sequence conservation among many lncRNAs.
Scope of review
Here, we review the extent of interspecies conservation among different lncRNAs, with a focus on a subset of lncRNAs that have been functionally investigated. The function of lncRNAs is widespread and we investigate whether different forms of functionalities may be conserved.
Lack of conservation does not imbue a lack of function. We highlight several examples of lncRNAs where RNA structure appears to be the main functional unit and evolutionary constraint. We survey existing genomewide studies of mammalian lncRNA conservation and summarize their limitations. We further review specific human lncRNAs which lack evolutionary conservation beyond primates but have proven to be both functional and therapeutically relevant.
Kapusta, A., and Feschotte, C. (2014) Volatile evolution of long noncoding RNA repertoires: mechanisms and biological implications. TRENDS in Genetics, 30:439-452. [doi: 10.1016/j.tig.2014.08.004]
Thousands of genes encoding long noncoding RNAs (lncRNAs) have been identified in all vertebrate genomes thus far examined. The list of lncRNAs partaking in arguably important biochemical, cellular, and developmental activities is steadily growing. However, it is increasingly clear that lncRNA repertoires are subject to weak functional constraint and rapid turnover during vertebrate evolution. We discuss here some of the factors that may explain this apparent paradox, including relaxed constraint on sequence to maintain lncRNA structure/function, extensive redundancy in the regulatory circuits in which lncRNAs act, as well as adaptive and non-adaptive forces such as genetic drift. We explore the molecular mechanisms promoting the birth and rapid evolution of lncRNA genes, with an emphasis on the influence of bidirectional transcription and transposable elements, two pervasive features of vertebrate genomes. Together these properties reveal a remarkably dynamic and malleable noncoding transcriptome which may represent an important source of robustness and evolvability.
1. There are a few exceptions. Some lncRNAs appear to be translated.
2. The question being begged is how many of those sequences have a true biological function.
3. Don't forget that almost all of these LOLATs are present at extremely low concentrations so they already have one strike against them. (Explaining that problem requires a different set of excuses.)
Palazzo, A.F., and Lee, E.S. (2015) Non-coding RNA: what is functional and what is junk? Frontiers in Genetics, 6. [doi: 10.3389/fgene.2015.00002]
Hezroni, H., Koppstein, D., Schwartz, M. G., Avrutin, A., Bartel, D. P., and Ulitsky, I. (2015) Principles of long noncoding RNA evolution derived from direct comparison of transcriptomes in 17 species. Cell reports, 11:1110-1122. [doi: 10.1016/j.celrep.2015.04.023]