More Recent Comments

Wednesday, March 09, 2016

A 2004 kerfuffle over pervasive transcription in the mouse genome

The first drafts of the human genome sequence were published in 2001. There was still work to do on "finishing" the sequence but a lot of the International Human Genome Project (IHGP) team shifted to work on the mouse genome. The FANTOM Consortium and the RIKEN Genome Exploration Groups (I and II) published an analysis of mouse transcripts in December 2002.
Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J., Bono, H., Kondo, S., Nikaido, I., Osato, N., Saito, R., Suzuki, H. et al. (2002) Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature, 420:563-573. [doi: 10.1038/nature01266]

Only a small proportion of the mouse genome is transcribed into mature messenger RNA transcripts. There is an international collaborative effort to identify all full-length mRNA transcripts from the mouse, and to ensure that each is represented in a physical collection of clones. Here we report the manual annotation of 60,770 full-length mouse complementary DNA sequences. These are clustered into 33,409 ‘transcriptional units’, contributing 90.1% of a newly established mouse transcriptome database. Of these transcriptional units, 4,258 are new protein-coding and 11,665 are new non-coding messages, indicating that non-coding RNA is a major component of the transcriptome. 41% of all transcriptional units showed evidence of alternative splicing. In protein-coding transcripts, 79% of splice variations altered the protein product. Whole-transcriptome analyses resulted in the identification of 2,431 sense–antisense pairs. The present work, completely supported by physical clones, provides the most comprehensive survey of a mammalian transcriptome so far, and is a valuable resource for functional genomics.
I haven't shown the complete list of authors. Some of the others are members of the Mouse Genome Sequencing Consortium. There's an overlap between the authors of this 2002 paper and the ENCODE papers that were published in 2007 and 2012 (e.g. Ewan Birney). The significance of this overlap will become clear.

Okazaki et al. begin their paper by noting that the total number of genes in the human genome is still unknown. They point to the fact that parts of the human genome (notably chromosome 21) are pervasively transcribed. This suggests, according to them, that there may be many more genes yet to be discovered,
One significant class of ‘genes’ missing from the existing genome annotation are those that give rise to non-protein-coding RNAs. Non-coding RNAs, although not highly transcribed, constitute a major functional output of the genome. In addition to their role in protein synthesis (ribosomal and transfer RNAs), non-coding RNAs have been implicated in control processes such as genomic imprinting and perhaps more globally in control of genetic networks.
The authors set out to define the complete "transcriptome" of the mouse in order to discover new genes. They constructed and analyzed a set of 60,770 expressed sequence tags (ESTs). ESTs are cloned complementary DNA (cDNA) fragments copied from purified mouse RNA molecules. These were clustered into 33,409 transcription unit (TUs). Many, but not all, of the TUs are based on multiple overlapping ESTs. The average length of a TU is 1,970 bp (1.97 kb). These are potential genes.

There's a bit of a problem with definitions since Okazaki et al. DEFINE a TU as "as unit of genetic information transcribed into mRNA." Presumably, this reflects their belief that all ESTs represent messenger RNAs because the cDNAs were derived from poly A+ RNA. However, it's clear that some of these potential genes may specify functional RNAs that are not translated (= noncoding RNAs).

The results are described in the abstract. After examining the sequences, they conclude that only 17,594 of the 33,409 potential genes have significant coding potential. That's only 52% of the total. They don't talk much about the other 48% of TUs. Most of them are "unclassifiable." These 15,815 TUs "may represent [genes for] functional non-coding RNAs."

It would be interesting to know whether all 17,594 potential protein-coding genes have been confirmed by subsequent analysis over the past 14 years. I doubt it very much since the total number of mouse protein-coding genes is now estimated to be about 20,000 and most of them are not expressed in the tissues analyzed by the FANTOM Consortium.

Wang et al. (2004) decided to look at the 15,815 TUs that could potentially be genes for noncoding RNAs. Remember that there is considerable debate over the number of genes for functional noncoding RNAs. The ENCODE Consortium claims that pervasive transcription of the human genome indicates that it is full of such genes and they play a key role in regulating the expression of protein-coding genes.

Back in 2004, there were just as many skeptics as there are today. Wang et al. examined the potential genes to see if they were any more conserved than random DNA intergenic sequences in the mouse genome. If they represent randomly transcribed regions of the genome that produce junk RNA by accidental transcription then you expect them to be evolving at the neutral rate whereas if they really are genes for functional RNAs then you expect them to show evidence of sequence conservation.1

Wang et al. compared all 33,409 TU sequences to the rat and human genomes. The divided the dataset into the same four categories used by Okazaki et al. and plotted the results as percentage of cDNAs (Y-axis) vs percentage sequence identity (X-axis) for rat (left) and human (right).

The purple lines represent typical conserved coding regions of homologous genes in the two species. This is a positive control. The other positive control is the small set of all known mouse genes for functional RNAs (brown line). The yellow line represents random intergenic sequences that are presumably junk DNA. They should be evolving at the neutral rate.

The other lines represent sequences of cDNAs from the FANTOM2 database as follows ...
  • Red: coding cDNAs - probably protein-coding genes (14,317)
  • Blue: marginal coding cDNAs - possible protein-coding genes (3,277)
  • Black: possibly genes for functional noncoding RNA (11,526)
  • Green: probably genes for functional noncoding RNA (3,450)

As you can see, the potential genes for protein-coding regions tend to be conserved but the data suggest that a good many of those potential genes are probably not real protein-coding genes. In the case of potential genes for functional noncoding RNA (black, green), the sequence similarities are no different than random neutrally-evolving DNA and very different from known genes for functional noncoding RNAs (brown).

Wang et al conclude the most of the 15,815 potential RNA-type genes are not genes at at all and the transcripts are junk RNA.
The simplest explanation is that non-functional transcripts can be produced at low copy numbers, escape the cell's messenger RNA surveillance system, and yet inflict no damage on the cell.
They go on to issue a caution that has largely been ignored.
Given that all of the best techniques for detecting RNA genes depend on sequence conservation, the absence of this cannot be summarily dismissed, even if isolated examples of RNA genes being weakly conserved can be found. Extraordinary claims require extraordinary proof — this is particularly true when much of the data support an alternative interpretation that they are simply non-functional cDNAs.
This is a direct criticism of the Okazaki et al. paper so the authors of that paper were invited to respond. There were 133 authors. We'll never know how many of them might have agreed with the criticism since the response came from Yoshihide Hayashizaki (Hayashizaki, 2004), the RIKEN Group leader at the Yokahama Institute in Yokahama, Japan. He acknowledges that he had help from several people who were not on the original list on authors—one of them was John Mattick.

Hayashizaki didn't like the Wang et al. paper. He had a couple of technical objections; namely, that the functional RNA positive control is only based on 19 RNAs and that the validity of the negative control (random junk DNA) is also questionable. The first objection is valid but if you look at the big picture it's not going to make much difference if the known functional RNAs are unrepresentative. The second objection assumes that intergenic sequences are actually conserved to some extent—in other words, they are not junk. Hayashizaki claims that "more of the genome is under evolutionary selection (both positive and negative) than has been appreciated."

He doesn't explain his view but I'm guessing that he looks at the average sequence similarity between rat and mouse (~83%) and between mouse and human (~65%) and assumes that the differences should be even greater if they are evolving neutrally.

But those technical objections are not his main arguments. I bet Sandwalk readers can already anticipate what Hayashizaki is going to say. Think about it before I give you the answer.

waiting ....

waiting ....

That's right! He pushes the same old replies that we've heard before from creationists. The RNAs must be functional because they are transcribed in a tissue-specific manner or in response to external stimuli. This is a silly argument since if the RNAs are just noise produced by accidental transcription then that kind of spurious transcription will still depend of the binding of transcription factors to random parts of the genome. Since the transcription factors are tissue-specific or activated by external stimuli, it follows that the spurious transcripts will show the same features as the real genes that are being turned on by those transcription factors.2

The second argument is that regulatory noncoding RNAs are "in the main, much less conserved than protein-coding sequences." Thus, the genes for these RNAs could look like they aren't conserved but still be genes for functional RNAs. They could also be mouse-specific genes that arose only in the mouse lineage even though there are related, nonfunctional, sequences in other species. This is an advanced form of question-begging.

Hayashizaki doesn't seem to have absorbed the main take-home message—the onus is on those who make extraordinary claims to come up with evidence for function. It's not good enough to just speculate with just-so stories since there's a valid alternative hypothesis based on the default assumption (neutral evolution).

The criticism and reply were published in October 2004. The following year, the same group published two more papers in the September 2 issue of Science. This was an issue devoted to functional RNAs.

There were numerous articles on the importance of noncoding RNAs in mammalian genomes and all of them were in "honor" of the two papers from the FANTOM Consortium and RIKEN (Carninci et al., 2005; Katayama et al., 2005).

Did anyone learn anything from the Wang et al. paper? Judge for yourselves by reading the press release from FANTOM/RIKEN [PDF].
The FANTOM Consortium for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute and Genome Science Laboratory, Discovery and Research Institute, RIKEN Wako Institute (Genome Network Core Group), announce the publishing of two milestone papers this week in the prestigious journal Science, which will transform our understanding of how the genes in mammals are controlled.

The past five years have seen the completion of several mammalian genome sequences, but these are of limited value unless we can decode the way that they are translated into functions required to create a mature animal. Only around 2% of the genome is translated into proteins (coding transcripts), the building blocks of the cells that make up our bodies. But which 2%, and how is it controlled?

The key intermediate is the transcriptome, which now has been subject to the most comprehensive characterization ever. The groundbreaking study has used new technology that accurately tags the beginning and end of each of over 20 million RNA messages (transcripts) created by genes, resulting in a powerful profile of the regulating control of genes. In addition, it has also shown that overlapping sense/antisense transcript pairs (both strands) are almost universal in the genome, and that S/AS pairs are especially abundant in imprinted loci, keeping with the putative role of non-coding RNA in the mechanism of gene silencing.

Since mammals only have slightly more conventional genes (around 22,000) than a simple worm, the results of the FANTOM Consortium study clearly indicate that while proteins comprise the essential components of our cells, the development of multicellular organisms like mammals is controlled by vast amounts of regulatory non-coding RNAs that until recently was not suspected to exist or be relevant to our biology. The findings suggest that the difference between mouse and man may well lie in the control systems of these genes, and not in the structures of the proteins some of them code for.

"We have provided the biomedical research community with the tools to understand the controls that are needed to make a mammal. We have deciphered the genome sequence not only for the code for making the parts (proteins) of a mammal, but also the code for making the right forms, in the right amounts, in the right place, at the right time." states project leader Yoshihide Hyashizaki.
The FANTOM/RIKEN groups identified and named 56,722 protein-coding genes based on their cDNAs in the 2005 papers. The majority of these (65%) were supposed to produce multiple proteins by alternative splicing. (Nobody believes any of this in 2016.)

They claim that the mouse genome has up to 34,030 genes for noncoding RNAs. Here's what they say about the previous year's discussion.
The function of ncRNAs is a matter of debate (Wang et al., 2004). Some ncRNAs are highly conserved even in distant species: 1117 out of 2886 overlap chicken sequences, of which 780 do not overlap known CDS and 438 do not overlap known mRNAs on either strand, whereas 68 out of 2886 have BLAST-like alignment tool (BLAT) alignments to the Fugu genome, of which 40 do not overlap known CDS on either strand. These ncRNAs are at least as conserved as a reference set of known ncRNAs (Fig. 3A), contrary to a previous study (Wang et al., 2004).
I don't understand these numbers. What does it mean to say that 1117 ncRNAs overlap chicken sequences?

At least they acknowledged the earlier criticism, in contrast with the ENCODE paper seven years later. They tried to provide evidence that their noncoding RNAs were functional by showing that they were, on average, more highly conserved than random genomic DNA. Here's a copy of Figure 3A titled "Human-mouse conservation of coding and noncoding RNAs compared with random genomic sequence."

At first glance it certainly looks like their ncRNAs are more conserved than random sequences but I'm deeply suspicious of the results. I don't like the way they present the data since all parts of the curve are skewed downward by the small number of real, highly conserved, sequences at the top end. I don't count this as extraordinary evidence.

Lots of people disagree with me (surprise!). Here's a sample of comments in that very same issue of Science [Mapping RNA Form & Function].

Guy Riddihough in "In the Forests of RNA Dark Matter."
The phrase “dark matter” could well be ascribed to noncoding RNA in general. The discovery that much of the mammalian genome is transcribed, in some places without gaps (so-called transcriptional “forests”), shines a bright light on this embarrassing plentitude: an order of magnitude more transcripts than genes (pp. 1559, 1564, and 1529). Many of these noncoding RNAs (p. 1527) are conserved across species, yet their functions (if any) are largely unknown ...

John Mattick in "The Functional Genomics of Noncoding RNA."
That complex organisms have complex genetic programming should come as no surprise. That much of this programming may be transacted by noncoding RNAs may be. However, given the sheer extent of noncoding RNA transcription, it seems more and more likely that a large portion of the human genome may be functional by means of RNA. This also means that we may have seriously misunderstood the nature of genetic programming in the higher organisms (21) by assuming that most genetic information is expressed as and transacted by proteins, as it largely is in prokaryotes (22). If so, there is a long road ahead in functional genomics.

Jean-Michel Claverie in "Fewer Genes, More Noncoding RNA."
Recent data from the FANTOM 3 project (13, 14) confirm and amplify these findings. Through a technical tour de force, the members of this consortium have established that a staggering 62% of the mouse genome is transcribed. They have identified more than 181,000 independent transcripts, of which half consist of noncoding RNA. Moreover, they found that more than 70% of the mapped transcription units overlap to some extent with a transcript from the opposite strand (13, 14).

These results provide a solution to the discrepancy between the number of (protein-coding) genes and the number of transcripts—noncoding polyadenylated mRNA contributes to a large fraction of the 3′-EST sequences (and SAGE tags) subsequently clustered or remaining as singletons. Indeed, the noncoding Xist mRNA is abundantly represented in all EST projects. It is thus likely that sequences of noncoding transcripts have been accumulating in EST databases and have for the most part (including singleton and antisense ESTs) been erroneously interpreted as coming from the 3′-untranslated regions of protein-coding transcripts. Noncoding transcripts originating from intergenic regions, introns, or antisense strands have probably been right before our eyes for 8 years without having been discovered!
Some of us look back on those claims with amusement because we are convinced that mammalian genomes are full of junk and most of those transcripts are just noise or junk RNA.

However, the scary part is that there are still many biologists who believe that there are thousands and thousands of genes for regulatory RNAs, far more, in fact, than the number of protein-coding genes.

It's going be hard to dissuade them because they want to believe that mammalian genomes contain a lot of missing information that makes us more complex than nematodes or fruit flies.

1. There will be exceptions. It's possible to imagine a given RNA having a function that's not sequence dependent. But it's 2016, and after decades of work there are very few proven examples of such RNAs. It's safe to assume that sequence conservation is a good proxy for function.

2. The same argument is still being used today and it still makes no sense.

Carninci, P., Kasukawa, T., Katayama, S., Gough, J., Frith, M., Maeda, N., Oyama, R., Ravasi, T., Lenhard, B., Wells, C. et al. (2005) The transcriptional landscape of the mammalian genome. Science, 309:1559-1563. [doi: 10.1126/science.1112014]

Hayashizaki, Y. (2004) Mouse transcriptome: Neutral evolution of ‘non-coding’complementary DNAs (reply). Nature, 431:759-760. [doi: 10.1038/nature03017]

Katayama, S., Tomaru, Y., Kasukawa, T., Waki, K., Nakanishi, M., Nakamura, M., Nishida, H., Yap, C., Suzuki, M., Kawai, J. et al. (2005) Antisense transcription in the mammalian transcriptome. Science, 309:1564-1566. [doi: 10.1126/science.1112009]

Wang, J., Zhang, J., Zheng, H., Li, J., Liu, D., Li, H., Samudrala, R., Yu, J., and Wong, G.K.-S. (2004) Mouse transcriptome: neutral evolution of ‘non-coding’complementary DNAs. Nature, 431:758-759. [doi: 10.1038/nature03016]


Jonathan Badger said...

It's going be hard to dissuade them because they want to believe that mammalian genomes contain a lot of missing information that makes us more complex than nematodes or fruit flies

Given that human behavior clearly is more complex than either nematodes or fruit flies despite not having significantly more protein coding genes, what would be alternative explanations for this? Is it all environment, and we haven't managed to provide the right education for flies than would allow them to reach their full potential? I'm no far of Mattick and his "dog's ass plots", but it seems the best argument against him is to show an alternative explanation that doesn't require thousands of RNA genes.

The Lorax said...

The short response is differential gene regulation through classic well established mechanisms led to a more 'complex' brain that allows us to spend inordinate amounts of time arguing that we're better than everything else. I'll call this the Trump fallacy (patent pending).

Diogenes said...

Larry says: The average length of a TU is 1,970 bp (1.97 kb). These are potential genes.
...After examining the sequences, they conclude that only 17,594 of the 33,409 potential genes have significant coding potential. That's only 52% of the total. They don't talk much about the other 48% of TUs. Most of them are "unclassifiable." These 15,815 TUs "may represent [genes for] functional non-coding RNAs."

Worst case scenario:

15,815 TUs x 1,970 bp = 31.15 million bps

Assume numbers comparable to mouse in human genome.

31.15 million bps / 3.2 billion bps = 0.97%

I call this the "Divide by 3 Billion Problem"

judmarc said...

Human behavior is more complex. Let's assume this is because of bigger, more complex brains. How many new developmental genes do we need in order to explain this, or can it be explained by (mostly) the same old genes regulated/regulating differently, e.g., the genes responsible for brain development wait longer before "turning off" in humans? (Cf. various examples of neoteny.)

Jonathan Badger said...

But the problem is the mechanisms in classic well established gene regulation involves protein coding genes to do the regulating. So it isn't clear how a significantly more complex behavior could arise by this method without a significantly larger number of protein coding genes to serve as additional regulators.

Alternatively, there's the argument that our behavior isn't really all that more complex and we are just vain like the pre-Copernicans who believed we lived at the center of the universe. There's something to be said for this in comparison to other mammals, but it is a bit hard to argue in regard to flies, I think.

Larry Moran said...

You only have to divide by TWO billion because protein-coding genes take up about one third of the genome. :-)

SRM said...

Since mammals only have slightly more conventional genes (around 22,000) than a simple worm, the results of the FANTOM Consortium study clearly indicate that while proteins comprise the essential components of our cells, the development of multicellular organisms like mammals is controlled by vast amounts of regulatory non-coding RNAs....

If pervasive transcription really is largely the result of spurious transcription, then one will expect the same result when worms cells are analyzed in this fashion.

Then what will be the argument? It will have to be that, yes, the same amount of non-coding RNA synthesis occurs in worm cells, but in the case of humans more of that non-coding RNA is functional than in the worm... I suppose.

It will come down to the same problem associated with never (or at least not yet) actually demonstrating functionality of the elements they claim are functionally important.

Georgi Marinov said...

There isn't a lot of intergenic space in worms.

SRM said...

Ah, thanks, that makes C. elegans a bad example but maybe the general sentiment would hold for non-mammalian organisms that do possess a lot of non-protein coding DNA.

John Harshman said...

No, you don't need thousands of new non-coding genes. You just need, if anything, a somewhat more complex regulatory network. Some of that is probably new families of regulatory RNAs and some of it is new transcription factor binding sites (for the same old transcription factors) in promoters. But only a modest increase in numbers of functional sequences seems to me necessary to produce considerably greater differentiation. We're still left with regulatory elements, RNA and DNA alike, being a few percent of the genome, or at least with no reason to think otherwise and much reason to think so.

John Harshman said...

Georgi: When you say "worm" do you just mean C. elegant or are you making a general case for long, roughly tubular invertebrates?

Georgi Marinov said...


Jonathan Badger said...

Well then we agree. We need somewhat more than what is currently known, but probably not the extremes that Mattick and ENCODE fanatics imagine to exist.

Georgi Marinov said...

A few things to note:

1) One can generate an enormous variety of distinct "cell types" using just a few genes. If we count each and every B and T cell expressing a different Ig/TCR as a distinct cell types, then the numbers of cell types produced by just one locus is truly astronomical. Something similar seems to be happening with brain cells (protocadherins, DSCAM, etc.)

2) I don't see why each and every individual neuron has to be genetically and precisely specified. They are in some nematodes, rotifers, etc. but those are weird cases. In a complex vertebrate brain I see absolutely no need for that. All that one needs is to specify a limited number of cell types and to control where and how many of them will be produced during development. Which does not take that much extra in terms of regulation, especially given how the "how many" part, which is the easiest to tweak, seems to have been most important in our own evolution.

There really is a lot less to explain than many people would want there to be. And one can get into some interesting speculations on why that is -- if you ask me, I would guess that there is a subconscious desire for the brain to be exquisitely and precisely specified in all its complexity, because if it was, then that might make it more understandable. The alternative (things are specified down to a certain level, below that there is a lot of self-organization with quite a few degrees of freedom going on) is not so attractive. Which, if we are to get to the bottom of it, might well be the same as one of the reasons people believe in deities and love conspiracy theories (better to have the feeling that someone is in control than to fully realize how messy and fragile everything is).

John Harshman said...

Do all nematodes have unusually small genomes? No, though the average size of recorded genomes does seem pretty small. However, the range I can see is from .02 to 2.1pg.

John Harshman said...

I never said we need more than is currently known. I think we're good already.

Georgi Marinov said...

The ones that have been sequenced are not very different from C. elegans. And the small ones are certainly a majority (looking at the genome size databases, there are only a couple of 2GB ones)

Diogenes said...


The Lorax said...

Yes you can explain all these differences via classical gene regulation models. How else do you explain the differences between a chicken and a horse, and if you think these are due to differences in complexity which is more complex? And I disagree with you that we need 'somewhat more', our understanding of gene regulation is quite clear that numerous and profound differences that can arise via minor changes (look at insect morphology for example). But continue going with the Trump fallacy, we have all the complexity, we have the best complexity.

judmarc said...

The alternative (things are specified down to a certain level, below that there is a lot of self-organization with quite a few degrees of freedom going on) is not so attractive.

It's attractive to me because it seems correct.

Jonathan Badger said...

Well, you did mention "probably new families of regulatory RNAs", which would definitely count as something more than is currently known.

Beau Stoddard said...

The Trump fallacy? Brilliant Lorax.

Anonymous said...

It is better to know about transcription which will be translated into the protein, and regulatory sequences unit encoding for a protein may contain both a coding sequence, which direct and regulate the synthesis of that protein.