Friday, January 16, 2015

Functional RNAs?

One of the most important problems in biochemistry & molecular biology is the role (if any) of pervasive transcription. We've known for decades that most of the genome is transcribed at some time or other. In the case of organisms with large genomes, this means that tens of thousand of RNA molecules are produced from regions of the genome that are not (yet?) recognized as functional genes.

Do these RNAs have a function?

Most knowledgeable biochemists are aware of the fact that transcription factors and RNA polymerase can bind at many sites in the genome that have nothing to do with transcription of a normal gene. This simply has to be the case based on our knowledge of DNA binding proteins [see The "duon" delusion and why transcription factors MUST bind non-functionally to exon sequences and How RNA Polymerase Binds to DNA].

If you have a genome containing large amounts of junk DNA then it follows, as night follows day, that there will be a great deal of spurious transcription. The RNAs produced by these accidental events will not have a biological function.

The human genome is large and most knowledgeable biochemists think that 90% of it is junk. There should be a lot of junk RNA produced in any particular cell and if you look at a large number of different tissues you are bound to find that most of the genome is transcribed. In spite of the fact that this is the expected result if you understand the biochemistry, there are those who believe that most of these RNAs have a function—and therefore most of the genome is functional.

The journal Nature Structural & Molecular Biology decided to publish a special issue called Focus on Noncoding RNAs. It came out at the same time as a paper by my colleague Alex Palazzo and his graduate student, Eliza Lee (Palazzo and Lee, 2015). The contrast is remarkable.

Let's look at the Nature Structural & Molecular Biology papers first. Keep in mind that the most important question is whether these RNAs have a function or whether they are just junk RNA produced as a result of spurious transcription. The lead editorial sets the stage [The noncoding explosion].
The long-held view that the primary role of RNA is to code for proteins has been severely undermined. This Focus explores the remarkable functional diversity of RNA in light of recent breakthroughs in noncoding-RNA biology.

In 1958, Francis Crick postulated the 'central dogma' to describe the flow of genetic information from DNA to RNA to protein (Crick F.H., Symp. Soc. Exp. Biol. 12, 138–163, 1958). Experimental evidence then established the mechanistic pathway linking genes to proteins: mRNAs act as transitory templates, tRNAs serve as adaptors between nucleotide and amino acid sequences, and the ribosome functions as the molecular machine that drives protein synthesis. This body of work cemented a canonical view of RNA as primarily a 'coding molecule'. Although tRNAs and rRNAs have obvious noncoding functions, their roles are nevertheless intimately tied to translation, thus reinforcing the notion of RNA as template and structural component to aid in protein synthesis.

The finding that RNA itself is capable of enzymatic catalysis in the 1980s jolted the community and eventually led to the 'RNA world' hypothesis, which proposes that self-replicating RNA molecules were precursors to life based on DNA, RNA and proteins. In comparison, the notion of RNA as a regulatory molecule is relatively recent, and the tremendous number, diversity and biological importance of noncoding RNAs (ncRNAs) are only beginning to be fully appreciated. In this issue, we present a special Focus on noncoding RNAs that explores the functional diversity of ncRNAs, discusses the molecular mechanisms of different RNA interference (RNAi) pathways and highlights the latest breakthroughs in ncRNA biology.
I'm sick and tired of supposedly intelligent people misrepresenting the Central Dogma and misrepresenting the history of a field in order to hype the latest discoveries. You would think that Nature publications would be particularly sensitive to this after the ENCODE disaster.

For the record, regulatory RNAs have been known for 40 years and most of the diverse small RNAs have been around for 20 years. These hardly count as "relatively recent" and it's nothing short of ridiculous to claim that they "are only beginning to be appreciated."

Don't forget that the important question is whether most of these RNAs have a biological function. In other words, SHOULD they be appreciated!

The editors have thought about this question. How do they deal with it?
Thousands of lncRNAs have been discovered to date, but their functional characterization has remained a challenge. This is partly because of a shortage of experimental techniques to explore their functions. In their Perspective, Spitale, Chang and Chu (p 29) highlight recent technological advances that will aid in the functional characterization of lncRNAs and discuss their advantages and caveats. The sheer number and the increasing pace of the discovery of new lncRNAs also present a challenge in terms of lncRNA definition and annotation. This issue is addressed in a Commentary by Rinn and Mattick (p 5), who propose considerations and best practices for identifying and annotating lncRNAs. These guidelines should assist the growing research community embarking on the mechanistic investigation of lncRNAs.
I'm not going to bother discussing either of those papers in any detail. They don't address the question at all. The first paper (Chu et al. 2015) just talks about " ...technologies that have finally made it possible to directly address the where, what and how of lncRNA function..." They assume that most of the RNAs have a function that's just waiting to be nailed down.

The second paper is by John Mattick and John Rinn (Mattick and Rinn, 2015). Asking these guys to write about whether most lncRNAs have a function is like asking Michael Behe to write a critical review of irreducibly complexity. Mattick and Rinn don't discuss the important question at all. They're mostly concerned with how to classify all those thousands of lncRNAs that have been discovered.

If you really want to know about function then you have to read the Palazzo and Lee paper on "Non-coding RNA: what is functional and what is junk?"

They make the same points that have been made repeatedly over the past two decades. Clearly they haven't sunk in and need to be repeated. You begin by assuming, in the absence of evidence to the contrary, that the newly discovered RNAs don't have a function. They are spurious transcripts. That's the default hypothesis. How do you determine if a given RNA has a function?
  1. If it's present in significant amounts [see also: How to Evaluate Genome Level Transcription Papers]. We know that the vast majority of RNAs are present at less that one copy per cell. Palazzo and Lee point out that this is a good indication of lack of function although there are situations where low abundance RNA might still have a function.
  2. How many have a known function? As of December 2014, there are only 166 lncRNAs with a validated function. (I suspect that not all of them will pan out.) That's after 20 years of looking for function among tens of thousands of putative lncRNAs. It doesn't prove anything but it surely points in one direction.
  3. We expect functional RNA to be conserved and most of them aren't. Pallazo and Lee have a good discussion about exceptions to the rule. They are correct to point out that you can have functional transcription without sequence conservations and you can have functional RNAs that have just evolved in one lineage. However, these exceptions cannot account for the thousands of nonconserved RNAs that are supposed to have a biological function.
  4. Cell specific transcription. It's often assumed that if an RNA is expressed in only certain tissues, or cells, that this is an indication of function. This is a bad assumption. If we are dealing with spurious transcripts then these will be produced when certain transcription factors bind nonspecifically to DNA. Since different cells have different transcription factors, it follows that they will produce different junk RNAs.
  5. What if the RNA is localized within the cell? Palazzo and Lee point out that most of these RNAs are only found in the nucleus and that's where you expect junk RNA. Some are exported to the cytoplasm but that's not a reliable indication of function.
The question has not been answered but if you're a betting person, I'd put my money on most of these RNAs turning out to be junk.

Chu, C., Spitale, R.C., and Chang, H.W. (2015) Technologies to probe functions and mechanisms of long noncoding RNAs. Nature Structural & Molecular Biology 22:29-35. [doi: 10.1038/nsmb.2921]

Mattick, J.S. and Rinn, J.J. (2015) Discovery and annotation of long noncoding RNAs. Nature Structural & Molecular Biology 22:5-7. [doi: 10.1038/nsmb.2942]

Palazzo, A.F. and Lee, E.S. (2015) Non-coding RNA: what is functional and what is junk? Front. Genet. 6:2. [doi: 10.3389/fgene.2015.00002


  1. I do wonder if there's any selection against spurious (should we say "ectopic"?) transcription factor binding sites and other control sequences. Is there evidence of such selection in any organism? Have there been any studies? There is at least selection against restriction sites, which are a sort of control sequence.

    1. To the extent that they are harmful, there is.

      Then what happens is of course a matter of selection coefficients and population genetic environment...

    2. A typical transcription factor binding site is about 6 bp. There should be about one every 4000 bp in the human genome. That's 750,000 sites per haploid genome of 1.5 million in a typical diploid cell. It's hard to imagine how there could be significant selection against one of them such that eliminating one would confer a selective advantage.

    3. This is where I take the argument that I usually use against Larry and use it to agree with him. Whether natural selection will be effective is a function of whether 4Ns exceeds 1 in absolute value, where s is the selection coefficient. In these cases (selection favoring deletion or change of one of these useless transcription sites, for example, or selection to delete a short stretch of bases from our junk DNA) the selection coefficient s is most likely not big enough to be effective. Even though 1/(4N) may be small the value of s is most likely a lot smaller than that.

    4. Let's clarify the terminology a bit here. Let's say we have s and it's negative. Are we only allowed to say that there is selection against the allele if |s| > 1/4N or we can also say that there is selection against it but it's overwhelmed by drift so in the end it does not do anything? This is a source of confusion (I personally tend to do the latter).

    5. @Joe Felsenstein
      "Whether natural selection will be effective is a function of whether 4Ns exceeds 1 in absolute value, where s is the selection coefficient."

      Isn't time also a factor here? I'm of the understanding that you and Simon have pointed out several times that even very small s values become important on geological timescales.

    6. Georgi Marinov: I'd say there is selection as long as s isn't zero, but it is ineffective if |4Ns| is much less than 1, as then selection is overwhelmed by drift.

      Mikkel: No, if |4Ns| is quite small, drift overwhelms selection, and by about 10-20 N generations drift has finished fixing or losing the allele. Waiting longer won't help. It is only if we have a deterministic model, with an infinite population, that waiting long enough causes selection to have an effect. In effect in that idealized case N is infinity of |4Ns| is too.

    7. A typical transcription factor binding site is about 6 bp.

      Irrelevant. So is a typical restriction site. The argument to make is that selection against spurious transcription (or spurious suppression) is so weak that it makes no difference.

    8. re: A typical transcription factor binding site is about 6 bp.

      Uhmm... actually I was under the impression that the consensus sequence for typical transcriptional binding sites were larger than restriction sites.

      for example:

      Yes the consensus logo for the LexA-binding motif has six crucial nucleotides. But these are mirrored by a complementary palindromic 6 base sequence further downstream.

      Classical footprint analyses in olden days before bioinformatics indicated a greater surface area of protein-DNA contact.

      Meanwhile I am reminded of crucial differences between Chimpanzees and Humans that were obviously subject to significant selection pressures

      Those 13 nucleotide changes in an 81 nucleotide stretch in just ONE enhancer HACSN1 ( what I thought typical for transcription factor binding sites) represents quite an anomalous mutation rate that could not be attributed to drift.

      OK, I have gotten off the topic of lnRNA, but I just wanted to clarify whether or not "a typical transcription factor binding site is about 6 bp.

      Again, thanks to everybody for their patience and indulgence.

    9. LexA is a bacterial TF

      There is a difference between prokaryotic and eukaryotic TFs - the prokaryotic ones have longer motifs. For which there are good evolutionary reasons.

      The 6bp comments referred to the eukaryotic ones. There are of course plenty of eukaryotic examples with longer motifs - CTCF, NRSF, etc. But most are indeed short, 6 to 8p

    10. Hi Georgi – thank you for your patient assistance

      I thought that Helix-turn-helix, Zinc fingers & Leucine zippers – all resulted by mixing and matching of protein subunits into different heterodimers or homodimers.

      By definition that meant that while one subunit would interact with the first 6 bps, the other subunit necessarily had to interact with another upstream/downstream 6 bps that did not necessarily need to be palindromic with the first unless we were discussing homodimers.

      We are still speaking of sequence identity of 12 AND NOT 6 bps.

      I remain intrigued by those 13 nucleotide changes in an 81 nucleotide stretch in just ONE enhancer HACSN1 (are enhancer activator protein binding sites different perhaps?) representing quite an anomalous mutation rate that could not be attributed to drift.

      My understanding was that HACSN1 was first discovered by comparing Human/Chimp genome variation and focusing on areas of unusual high variability indicating putative high selection for mutation in short regions of DNA (in the case of HACSN1, a region of DNA spanning 81 bp).

      What am I missing here?

    11. I am not familiar with the HACSN1 case, but those changes clearly have to be not in a single TFBS but multiple ones. That's what an enhancer is - multiple TFBSs. And it pretty much has to be.

      Because while it is OK to point out how many random TFBS matches exist in a genome when discussing TF binding, what is less often noted is that the vast majority of those are not detectably occupied. So if there is occupancy detected by ChIP-seq, that's something you do want to pay attention to. It does not mean that it is functional, but you need to take it seriously - clearing the chromatin barrier is a significant achievement on its own that forces one to take note of it. Unfortunately we still don't have a good answer to the question why some sites are bound and others are not. The usual explanations are pioneer factors and combinatorial occupancy, but the pioneer factors do not open all matches to their motifs either and the combinatorial occupancy concept is still quite fuzzy when it comes to the specifics.

    12. Georgi

      I am still unclear on one point:

      I understand that Helix-turn-helix, Zinc fingers & Leucine zippers are part and parcel of the eukaryote TFBS story. If so, Transcription factors are dimers; if so, then in fact it is incorrect to claim that

      A typical transcription factor binding site is about 6 bp ...

      That would be true only for the monomer and not the functional dimer.

      Of course, you raise another excellent point! Eukaryote enhancers bind a minimum of 3 Activator proteins (if I am not mistaken) and up to 8 Activator Proteins in enhancers for many genes.

      This ups the ante considerably when considering the notion of fortuitous junk transcription.

  2. 'The Central Dogma: "Once information has got into a protein it can't get out again". Information here means the sequence of the amino acid residues, or other sequences related to it.'
    Crick (1956)

  3. I have a bookmark on page 57 of the 109 page book titled:
    "Immunology and the Quest for an HIV Vaccine: A New Perspective" see:
    because that's where I stopped about six months ago because it was so repetitive and uninformative.

    Anyway the authors claimed (repeatedly and without apparent evidence) that noncoding RNA elements were part of an unappreciated molecular immune system that operates entirely within individual cells. There thesis seemed to be that since much of the Junk was old fragments of retroviruses, disabling these transcribed segments gave their "molecular immune system" practice for the big day when they were confronted by real retrovirus elements.

  4. Laurence A. Moran: “Most *knowledgeable* biochemists are aware of the fact that transcription factors and RNA polymerase can bind at many sites in the genome that have nothing to do with transcription of a normal gene”

    I think you underestimate the understanding and knowledge of the scientists working on genomics and gene expression. Similar to the ENCODE scientists, which represent some of the finest academic institutions in the world, I think all leading scientists working on lncRNAs are highly knowledgeable.

    However, in order to be competitive and at the top of their field, many of them choose to misrepresent the current knowledge in order to promote their studies and results. Ultimately, these shrewd scientists are themselves ‘victims’ of the current science enterprise, which is based on a deficient peer review system (see my note in PubMed Commons entitled: “Multiple knockout mouse models reveal that some lincRNAs might be required for life and brain development” at:

    You might also want to see a note entitled “Everlasting confusion on ‘functional DNA’ and ‘junk DNA’” addressing the ENCODE project ( Here is an excerpt from this note:

    “After all, the ENCODE ‘function fiasco’ was not the result of misunderstanding the concept of biological function, nor was it due to scientific incompetence as suggested by others (2). On the contrary, because it conflicted with some of the project’s objectives and with its significance, there was a concerted effort not to bring this concept forward (3); indeed, as clearly shown in a recent ENCODE publication (4), at least some ENCODE members seem well aware of the scientific rationale and criteria for addressing putative biological functions for genomic DNA….


    (2) Graur D et al., 2013. On the immortality of television sets: "function" in the human genome according to the evolution-free gospel of ENCODE. Genome Biol Evol., 5:578-90. Graur D, 2013.

    (3) Bandea CI. 2014. Closing the gap between ‘words’ and ‘facts’ in evaluating genome biology and the ENCODE project. PubMed Commons (National Library of Medicine; Bethesda, MD). Comment on: Doolittle WF. 2013. Is junk DNA bunk? A critique of ENCODE. Proc Natl Acad Sci USA., 110:5294-300.

    (4) Kellis M. et al., 2014. Defining functional DNA elements in the human genome. Proc Natl Acad Sci USA., 111:6131-8. Kellis M, 2014.”

    1. We've been through this so many times. Why do we have to do it again?

    2. @Georgi,

      Before engaging in discussions about ENCODE and related projects, it would make sense that you address some of the issues raised in previous posts (

      Laurence A. Moran; Friday, October 31, 2014 9:36:00 AM:

      “Georgi, there are 30 authors on the PNAS paper from last April. How many of them do you think are prepared to stand by everything that's in that paper and how many will claim that the paper may not represent their views because they never approved the draft that was sent to PNAS?”

      Georgi Marinov; Friday, October 31, 2014 2:49:00 PM:

      I have said enough over the many threads on these subjects here for my position to be clear to anyone who has read my posts.

      John Harshman; Friday, October 31, 2014 3:25:00 PM:

      No, that's the problem. Your position isn't clear. When you say the main Nature paper was "technically correct", that sounds to everyone like a way of avoiding the controversy by pedantic legalism. Larry's trying to pin you down here, and you keep squirming. Or that's how it looks to me and, I suspect, to other readers. An unequivocal statement would be nice.

      I also raised the following issues that you to did not address:

      (1) Is there anything wrong with ENCODE’s flagship paper in Nature? If the answer is yes, please let us know what’s wrong with it.

      (2). Is there anything wrong with the presentation of ENCODE findings by Ewan Birney and other ENCODE leaders to science writers and media? If the answer is yes, please let us know what’s wrong with it

    3. My posts in this very same thread provide more than sufficient information to answer your questions.

      Then there are things like Google Scholar that will give you even more information.

    4. Georgi. I'm afraid that this too looks like weaseling to me and, I suspect again, to other readers. I also suspect that everyone would be interested in your answers to the questions, and the fact that people keep asking them should suggest to you that they think you haven't answered them. Now if you don't care that people think you're being a weasel, no problem.