Sandwalk: A 2004 kerfuffle over pervasive transcription in the mouse genome

Wednesday, March 09, 2016

A 2004 kerfuffle over pervasive transcription in the mouse genome

The first drafts of the human genome sequence were published in 2001. There was still work to do on "finishing" the sequence but a lot of the International Human Genome Project (IHGP) team shifted to work on the mouse genome. The FANTOM Consortium and the RIKEN Genome Exploration Groups (I and II) published an analysis of mouse transcripts in December 2002.

Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J., Bono, H., Kondo, S., Nikaido, I., Osato, N., Saito, R., Suzuki, H. et al. (2002) Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature, 420:563-573. [doi: 10.1038/nature01266]

Only a small proportion of the mouse genome is transcribed into mature messenger RNA transcripts. There is an international collaborative effort to identify all full-length mRNA transcripts from the mouse, and to ensure that each is represented in a physical collection of clones. Here we report the manual annotation of 60,770 full-length mouse complementary DNA sequences. These are clustered into 33,409 ‘transcriptional units’, contributing 90.1% of a newly established mouse transcriptome database. Of these transcriptional units, 4,258 are new protein-coding and 11,665 are new non-coding messages, indicating that non-coding RNA is a major component of the transcriptome. 41% of all transcriptional units showed evidence of alternative splicing. In protein-coding transcripts, 79% of splice variations altered the protein product. Whole-transcriptome analyses resulted in the identification of 2,431 sense–antisense pairs. The present work, completely supported by physical clones, provides the most comprehensive survey of a mammalian transcriptome so far, and is a valuable resource for functional genomics.

I haven't shown the complete list of authors. Some of the others are members of the Mouse Genome Sequencing Consortium. There's an overlap between the authors of this 2002 paper and the ENCODE papers that were published in 2007 and 2012 (e.g. Ewan Birney). The significance of this overlap will become clear.

Okazaki et al. begin their paper by noting that the total number of genes in the human genome is still unknown. They point to the fact that parts of the human genome (notably chromosome 21) are pervasively transcribed. This suggests, according to them, that there may be many more genes yet to be discovered,

One significant class of ‘genes’ missing from the existing genome annotation are those that give rise to non-protein-coding RNAs. Non-coding RNAs, although not highly transcribed, constitute a major functional output of the genome. In addition to their role in protein synthesis (ribosomal and transfer RNAs), non-coding RNAs have been implicated in control processes such as genomic imprinting and perhaps more globally in control of genetic networks.

The authors set out to define the complete "transcriptome" of the mouse in order to discover new genes. They constructed and analyzed a set of 60,770 expressed sequence tags (ESTs). ESTs are cloned complementary DNA (cDNA) fragments copied from purified mouse RNA molecules. These were clustered into 33,409 transcription unit (TUs). Many, but not all, of the TUs are based on multiple overlapping ESTs. The average length of a TU is 1,970 bp (1.97 kb). These are potential genes.

There's a bit of a problem with definitions since Okazaki et al. DEFINE a TU as "as unit of genetic information transcribed into mRNA." Presumably, this reflects their belief that all ESTs represent messenger RNAs because the cDNAs were derived from poly A⁺ RNA. However, it's clear that some of these potential genes may specify functional RNAs that are not translated (= noncoding RNAs).

The results are described in the abstract. After examining the sequences, they conclude that only 17,594 of the 33,409 potential genes have significant coding potential. That's only 52% of the total. They don't talk much about the other 48% of TUs. Most of them are "unclassifiable." These 15,815 TUs "may represent [genes for] functional non-coding RNAs."

It would be interesting to know whether all 17,594 potential protein-coding genes have been confirmed by subsequent analysis over the past 14 years. I doubt it very much since the total number of mouse protein-coding genes is now estimated to be about 20,000 and most of them are not expressed in the tissues analyzed by the FANTOM Consortium.

Wang et al. (2004) decided to look at the 15,815 TUs that could potentially be genes for noncoding RNAs. Remember that there is considerable debate over the number of genes for functional noncoding RNAs. The ENCODE Consortium claims that pervasive transcription of the human genome indicates that it is full of such genes and they play a key role in regulating the expression of protein-coding genes.

Back in 2004, there were just as many skeptics as there are today. Wang et al. examined the potential genes to see if they were any more conserved than random DNA intergenic sequences in the mouse genome. If they represent randomly transcribed regions of the genome that produce junk RNA by accidental transcription then you expect them to be evolving at the neutral rate whereas if they really are genes for functional RNAs then you expect them to show evidence of sequence conservation.¹

Wang et al. compared all 33,409 TU sequences to the rat and human genomes. The divided the dataset into the same four categories used by Okazaki et al. and plotted the results as percentage of cDNAs (Y-axis) vs percentage sequence identity (X-axis) for rat (left) and human (right).

The purple lines represent typical conserved coding regions of homologous genes in the two species. This is a positive control. The other positive control is the small set of all known mouse genes for functional RNAs (brown line). The yellow line represents random intergenic sequences that are presumably junk DNA. They should be evolving at the neutral rate.

The other lines represent sequences of cDNAs from the FANTOM2 database as follows ...

Red: coding cDNAs - probably protein-coding genes (14,317)
Blue: marginal coding cDNAs - possible protein-coding genes (3,277)
Black: possibly genes for functional noncoding RNA (11,526)
Green: probably genes for functional noncoding RNA (3,450)

As you can see, the potential genes for protein-coding regions tend to be conserved but the data suggest that a good many of those potential genes are probably not real protein-coding genes. In the case of potential genes for functional noncoding RNA (black, green), the sequence similarities are no different than random neutrally-evolving DNA and very different from known genes for functional noncoding RNAs (brown).

Wang et al conclude the most of the 15,815 potential RNA-type genes are not genes at at all and the transcripts are junk RNA.

The simplest explanation is that non-functional transcripts can be produced at low copy numbers, escape the cell's messenger RNA surveillance system, and yet inflict no damage on the cell.

They go on to issue a caution that has largely been ignored.

Given that all of the best techniques for detecting RNA genes depend on sequence conservation, the absence of this cannot be summarily dismissed, even if isolated examples of RNA genes being weakly conserved can be found. Extraordinary claims require extraordinary proof — this is particularly true when much of the data support an alternative interpretation that they are simply non-functional cDNAs.

This is a direct criticism of the Okazaki et al. paper so the authors of that paper were invited to respond. There were 133 authors. We'll never know how many of them might have agreed with the criticism since the response came from Yoshihide Hayashizaki (Hayashizaki, 2004), the RIKEN Group leader at the Yokahama Institute in Yokahama, Japan. He acknowledges that he had help from several people who were not on the original list on authors—one of them was John Mattick.

Hayashizaki didn't like the Wang et al. paper. He had a couple of technical objections; namely, that the functional RNA positive control is only based on 19 RNAs and that the validity of the negative control (random junk DNA) is also questionable. The first objection is valid but if you look at the big picture it's not going to make much difference if the known functional RNAs are unrepresentative. The second objection assumes that intergenic sequences are actually conserved to some extent—in other words, they are not junk. Hayashizaki claims that "more of the genome is under evolutionary selection (both positive and negative) than has been appreciated."

He doesn't explain his view but I'm guessing that he looks at the average sequence similarity between rat and mouse (~83%) and between mouse and human (~65%) and assumes that the differences should be even greater if they are evolving neutrally.

But those technical objections are not his main arguments. I bet Sandwalk readers can already anticipate what Hayashizaki is going to say. Think about it before I give you the answer.

waiting ....

waiting ....

That's right! He pushes the same old replies that we've heard before from creationists. The RNAs must be functional because they are transcribed in a tissue-specific manner or in response to external stimuli. This is a silly argument since if the RNAs are just noise produced by accidental transcription then that kind of spurious transcription will still depend of the binding of transcription factors to random parts of the genome. Since the transcription factors are tissue-specific or activated by external stimuli, it follows that the spurious transcripts will show the same features as the real genes that are being turned on by those transcription factors.²

The second argument is that regulatory noncoding RNAs are "in the main, much less conserved than protein-coding sequences." Thus, the genes for these RNAs could look like they aren't conserved but still be genes for functional RNAs. They could also be mouse-specific genes that arose only in the mouse lineage even though there are related, nonfunctional, sequences in other species. This is an advanced form of question-begging.

Hayashizaki doesn't seem to have absorbed the main take-home message—the onus is on those who make extraordinary claims to come up with evidence for function. It's not good enough to just speculate with just-so stories since there's a valid alternative hypothesis based on the default assumption (neutral evolution).

The criticism and reply were published in October 2004. The following year, the same group published two more papers in the September 2 issue of Science. This was an issue devoted to functional RNAs.

There were numerous articles on the importance of noncoding RNAs in mammalian genomes and all of them were in "honor" of the two papers from the FANTOM Consortium and RIKEN (Carninci et al., 2005; Katayama et al., 2005).

Did anyone learn anything from the Wang et al. paper? Judge for yourselves by reading the press release from FANTOM/RIKEN [PDF].

The FANTOM Consortium for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute and Genome Science Laboratory, Discovery and Research Institute, RIKEN Wako Institute (Genome Network Core Group), announce the publishing of two milestone papers this week in the prestigious journal Science, which will transform our understanding of how the genes in mammals are controlled.

The past five years have seen the completion of several mammalian genome sequences, but these are of limited value unless we can decode the way that they are translated into functions required to create a mature animal. Only around 2% of the genome is translated into proteins (coding transcripts), the building blocks of the cells that make up our bodies. But which 2%, and how is it controlled?

The key intermediate is the transcriptome, which now has been subject to the most comprehensive characterization ever. The groundbreaking study has used new technology that accurately tags the beginning and end of each of over 20 million RNA messages (transcripts) created by genes, resulting in a powerful profile of the regulating control of genes. In addition, it has also shown that overlapping sense/antisense transcript pairs (both strands) are almost universal in the genome, and that S/AS pairs are especially abundant in imprinted loci, keeping with the putative role of non-coding RNA in the mechanism of gene silencing.

Since mammals only have slightly more conventional genes (around 22,000) than a simple worm, the results of the FANTOM Consortium study clearly indicate that while proteins comprise the essential components of our cells, the development of multicellular organisms like mammals is controlled by vast amounts of regulatory non-coding RNAs that until recently was not suspected to exist or be relevant to our biology. The findings suggest that the difference between mouse and man may well lie in the control systems of these genes, and not in the structures of the proteins some of them code for.

"We have provided the biomedical research community with the tools to understand the controls that are needed to make a mammal. We have deciphered the genome sequence not only for the code for making the parts (proteins) of a mammal, but also the code for making the right forms, in the right amounts, in the right place, at the right time." states project leader Yoshihide Hyashizaki.

The FANTOM/RIKEN groups identified and named 56,722 protein-coding genes based on their cDNAs in the 2005 papers. The majority of these (65%) were supposed to produce multiple proteins by alternative splicing. (Nobody believes any of this in 2016.)

They claim that the mouse genome has up to 34,030 genes for noncoding RNAs. Here's what they say about the previous year's discussion.

The function of ncRNAs is a matter of debate (Wang et al., 2004). Some ncRNAs are highly conserved even in distant species: 1117 out of 2886 overlap chicken sequences, of which 780 do not overlap known CDS and 438 do not overlap known mRNAs on either strand, whereas 68 out of 2886 have BLAST-like alignment tool (BLAT) alignments to the Fugu genome, of which 40 do not overlap known CDS on either strand. These ncRNAs are at least as conserved as a reference set of known ncRNAs (Fig. 3A), contrary to a previous study (Wang et al., 2004).

I don't understand these numbers. What does it mean to say that 1117 ncRNAs overlap chicken sequences?

At least they acknowledged the earlier criticism, in contrast with the ENCODE paper seven years later. They tried to provide evidence that their noncoding RNAs were functional by showing that they were, on average, more highly conserved than random genomic DNA. Here's a copy of Figure 3A titled "Human-mouse conservation of coding and noncoding RNAs compared with random genomic sequence."

At first glance it certainly looks like their ncRNAs are more conserved than random sequences but I'm deeply suspicious of the results. I don't like the way they present the data since all parts of the curve are skewed downward by the small number of real, highly conserved, sequences at the top end. I don't count this as extraordinary evidence.

Lots of people disagree with me (surprise!). Here's a sample of comments in that very same issue of Science [Mapping RNA Form & Function].

Guy Riddihough in "In the Forests of RNA Dark Matter."

The phrase “dark matter” could well be ascribed to noncoding RNA in general. The discovery that much of the mammalian genome is transcribed, in some places without gaps (so-called transcriptional “forests”), shines a bright light on this embarrassing plentitude: an order of magnitude more transcripts than genes (pp. 1559, 1564, and 1529). Many of these noncoding RNAs (p. 1527) are conserved across species, yet their functions (if any) are largely unknown ...

John Mattick in "The Functional Genomics of Noncoding RNA."

That complex organisms have complex genetic programming should come as no surprise. That much of this programming may be transacted by noncoding RNAs may be. However, given the sheer extent of noncoding RNA transcription, it seems more and more likely that a large portion of the human genome may be functional by means of RNA. This also means that we may have seriously misunderstood the nature of genetic programming in the higher organisms (21) by assuming that most genetic information is expressed as and transacted by proteins, as it largely is in prokaryotes (22). If so, there is a long road ahead in functional genomics.

Jean-Michel Claverie in "Fewer Genes, More Noncoding RNA."

Recent data from the FANTOM 3 project (13, 14) confirm and amplify these findings. Through a technical tour de force, the members of this consortium have established that a staggering 62% of the mouse genome is transcribed. They have identified more than 181,000 independent transcripts, of which half consist of noncoding RNA. Moreover, they found that more than 70% of the mapped transcription units overlap to some extent with a transcript from the opposite strand (13, 14).

These results provide a solution to the discrepancy between the number of (protein-coding) genes and the number of transcripts—noncoding polyadenylated mRNA contributes to a large fraction of the 3′-EST sequences (and SAGE tags) subsequently clustered or remaining as singletons. Indeed, the noncoding Xist mRNA is abundantly represented in all EST projects. It is thus likely that sequences of noncoding transcripts have been accumulating in EST databases and have for the most part (including singleton and antisense ESTs) been erroneously interpreted as coming from the 3′-untranslated regions of protein-coding transcripts. Noncoding transcripts originating from intergenic regions, introns, or antisense strands have probably been right before our eyes for 8 years without having been discovered!

Some of us look back on those claims with amusement because we are convinced that mammalian genomes are full of junk and most of those transcripts are just noise or junk RNA.

However, the scary part is that there are still many biologists who believe that there are thousands and thousands of genes for regulatory RNAs, far more, in fact, than the number of protein-coding genes.

It's going be hard to dissuade them because they want to believe that mammalian genomes contain a lot of missing information that makes us more complex than nematodes or fruit flies.

1. There will be exceptions. It's possible to imagine a given RNA having a function that's not sequence dependent. But it's 2016, and after decades of work there are very few proven examples of such RNAs. It's safe to assume that sequence conservation is a good proxy for function.

2. The same argument is still being used today and it still makes no sense.

Carninci, P., Kasukawa, T., Katayama, S., Gough, J., Frith, M., Maeda, N., Oyama, R., Ravasi, T., Lenhard, B., Wells, C. et al. (2005) The transcriptional landscape of the mammalian genome. Science, 309:1559-1563. [doi: 10.1126/science.1112014]

Hayashizaki, Y. (2004) Mouse transcriptome: Neutral evolution of ‘non-coding’complementary DNAs (reply). Nature, 431:759-760. [doi: 10.1038/nature03017]

Katayama, S., Tomaru, Y., Kasukawa, T., Waki, K., Nakanishi, M., Nakamura, M., Nishida, H., Yap, C., Suzuki, M., Kawai, J. et al. (2005) Antisense transcription in the mammalian transcriptome. Science, 309:1564-1566. [doi: 10.1126/science.1112009]

Wang, J., Zhang, J., Zheng, H., Li, J., Liu, D., Li, H., Samudrala, R., Yu, J., and Wong, G.K.-S. (2004) Mouse transcriptome: neutral evolution of ‘non-coding’complementary DNAs. Nature, 431:758-759. [doi: 10.1038/nature03016]

23 comments:

Jonathan BadgerWednesday, March 09, 2016 3:50:00 PM
It's going be hard to dissuade them because they want to believe that mammalian genomes contain a lot of missing information that makes us more complex than nematodes or fruit flies

Given that human behavior clearly is more complex than either nematodes or fruit flies despite not having significantly more protein coding genes, what would be alternative explanations for this? Is it all environment, and we haven't managed to provide the right education for flies than would allow them to reach their full potential? I'm no far of Mattick and his "dog's ass plots", but it seems the best argument against him is to show an alternative explanation that doesn't require thousands of RNA genes.
ReplyDelete
Replies
DiogenesWednesday, March 09, 2016 4:14:00 PM
Larry says: The average length of a TU is 1,970 bp (1.97 kb). These are potential genes.
...After examining the sequences, they conclude that only 17,594 of the 33,409 potential genes have significant coding potential. That's only 52% of the total. They don't talk much about the other 48% of TUs. Most of them are "unclassifiable." These 15,815 TUs "may represent [genes for] functional non-coding RNAs."

Worst case scenario:

15,815 TUs x 1,970 bp = 31.15 million bps

Assume numbers comparable to mouse in human genome.

31.15 million bps / 3.2 billion bps = 0.97%

I call this the "Divide by 3 Billion Problem"
ReplyDelete
Replies
SRMWednesday, March 09, 2016 5:40:00 PM
Since mammals only have slightly more conventional genes (around 22,000) than a simple worm, the results of the FANTOM Consortium study clearly indicate that while proteins comprise the essential components of our cells, the development of multicellular organisms like mammals is controlled by vast amounts of regulatory non-coding RNAs....

If pervasive transcription really is largely the result of spurious transcription, then one will expect the same result when worms cells are analyzed in this fashion.

Then what will be the argument? It will have to be that, yes, the same amount of non-coding RNA synthesis occurs in worm cells, but in the case of humans more of that non-coding RNA is functional than in the worm... I suppose.

It will come down to the same problem associated with never (or at least not yet) actually demonstrating functionality of the elements they claim are functionally important.
ReplyDelete
Replies
AnonymousSunday, May 01, 2016 5:59:00 AM
It is better to know about transcription which will be translated into the protein, and regulatory sequences unit encoding for a protein may contain both a coding sequence, which direct and regulate the synthesis of that protein.
ReplyDelete
Replies

Add comment