Sunday, November 01, 2015

3,000 new genes discovered in the human genome - dark matter revealed

Let's look a a recent paper published by a large group of medical researchers at the University of California, Los Angeles (USA). The paper was published online a few days ago (Oct. 26, 2015) in Nature Immunology.

The authors clam to have discoverd 3,000 previously unknown genes in the human genome.

The complete reference is ...
Casero, D., Sandoval, S., Seet, C.S., Scholes, J., Zhu, Y., Ha, V.L., Luong, A., Parekh, C., and Crooks, G.M. (2015) Long non-coding RNA profiling of human lymphoid progenitor cells reveals transcriptional divergence of B cell and T cell lineages. Nat Immunol, advance online publication. [doi: 10.1038/ni.3299]

Abstract:To elucidate the transcriptional 'landscape' that regulates human lymphoid commitment during postnatal life, we used RNA sequencing to assemble the long non-coding transcriptome across human bone marrow and thymic progenitor cells spanning the earliest stages of B lymphoid and T lymphoid specification. Over 3,000 genes encoding previously unknown long non-coding RNAs (lncRNAs) were revealed through the analysis of these rare populations. Lymphoid commitment was characterized by lncRNA expression patterns that were highly stage specific and were more lineage specific than those of protein-coding genes. Protein-coding genes co-expressed with neighboring lncRNA genes showed enrichment for ontologies related to lymphoid differentiation. The exquisite cell-type specificity of global lncRNA expression patterns independently revealed new developmental relationships among the earliest progenitor cells in the human bone marrow and thymus.
We'll start by looking at the press release from UCLA [UCLA researchers discover more than 3,000 genes in a little-studied part of the human genome]. Keep in mind that this is supposed to be one of the best universities in the USA (and the world).
Scientists at the UCLA Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research have discovered more than 3,000 previously unknown genes in a poorly understood part of the genome. These genes, found in rare cells in bone marrow and in the thymus, give scientists a new understanding of how the human immune system develops.
Let's make one thing very clear. The authors DID NOT discover 3,000 new "genes." What they may have discovered is 3,000 PUTATIVE genes that POSSIBLY specify functional lncRNAs. They have a shitload of work to do before they can conclude that any of those sequences are actually genes (see Palazzo and Lee, 2015).

Thankfully, there's nothing in the UCLA press release about junk DNA but it does refer to "dark matter" in the subtitle ("‘Dark matter’ of genome offers clues to how the immune system develops") and "dark matter" is mentioned by one of the lead authors ...
“The genes we found are called long non-coding RNAs, or LncRNAs,” said Gay Crooks, co-director of the UCLA Broad Stem Cell Research Center, a member of the UCLA Jonsson Comprehensive Cancer Center and co-senior author of the study. “They make up much of what we used to think of as the ‘dark matter’ of our genome because, unlike the better-known messenger RNA genes, they do not produce proteins. The function of LncRNAs is not well-known but it is becoming increasingly apparent that they are not inert; they have a critical role in controlling how other genes function.
Really?—everything that isn't a protein-coding gene is "dark matter"? And isn't it a bit confusing if one paragraph says that the function of lncRNAs is not well-known but another paragraph declares that they must be genes?

And what's this nonsense about lncRNA genes making up "much of what we used to think of as 'dark matter'?" Even if every single putative transcript turned out to be a functional RNA there would only be about 25,000 lncRNA genes making up 0.8% of the entire genome.

I think there are two rules for science press releases that must be enforced [Stupid Harvard press release illustrates the importance of author responsibility].
  1. The press release must include the complete citation, including a link (doi). If This means delaying the press release for a day or two after the embargo is lifted then that's a small price to pay.
  2. The press release should always include a notice from at least one author affirming, in writing, that the press release is a complete and accurate report of the results and conclusions that have been published in the peer reviewed literature.
This UCLA press release contains a direct link to the paper in Nature Immunology but no citation. Of course, there's no disclaimer because that would grant authors veto power over what their university says about their science. That's not going to happen because the university wants to hype everything with a view to attracting donors and publicity.

Very few of you can read the paper because it's behind a paywall. You'll have to trust me when I tell you what's in it.

The first thing I did was search for "dark matter." There's noting in the actual paper about "dark matter." There's no mention of "junk DNA."

The next thing that interested me was how they determined whether a transcript was a lncRNA. I looked for the methods section but it was only available in the online version of the paper. Here's what I found ...
The RABT (‘reference annotation–based transcript’ assembly) approach (34) was used for assembly of the transcriptome. Alignment files for each of our samples as well as those derived from publicly available RNA-Seq data sets were analyzed with the HTSeq Python software package (35), with our gene-annotation file to generate gene-level counts for each sample. Pairwise expression correlations between protein-coding genes and lncRNA genes were computed with a strategy previously described for characterization of the Gencode lncRNA catalog (15). Gene clustering of differentially expressed genes was performed through the use of the MBCluster.Seq (‘model-based clustering for RNA-seq data’) software package (23). The GREAT tool (24) was used for analysis of functional annotations of protein coding genes whose regulatory domains overlapped loci of lncRNA genes. Pair-wise differential expression analysis was performed with the DESeq package in software of the R project for statistical computing (36). Bayesian model polytomous selection and WGCNA were done as described (25,27). Peak detection and signal intensity analyses for ChIP-seq data were done with the MACS2 (‘model-based analysis of ChIP-Seq’) algorithm37 and HOMER (‘hypergeometric optimization of motif enrichment’) suite of tools (38). A detailed description of the RNA-Seq and ChIP-Seq analysis methods is provided in the Supplementary Methods.
That's not very helpful. I still don't know what criteria they used to identify putative functional lncRNAs. They include a figure showing their "pipeline for the annotation of newly identified lncRNAs" but I don't find it very helpful.

Is there a Sandwalk reader out there who can explain to us in simple language how they distinguish functional lncRNAs from transcriptional noise?

The authors claim that their complete annotation database contained 18,268 lncRNA "genes." I don't know of any reliable annotation of the human genome that includes as many as 18,000 lncRNA genes. The latest ENSEMBL version [GRCh38.p3 (Genome Reference Consortium Human Build 38] has 14,898 and that's a very generous (and incorrect) count of "genes."

One of the good things about this paper is that the authors include data on the expression levels of their putative genes. One of the bad things about this paper is that they follow the example of many other authors by converting the data into something called "FKPM." It turns out that 80% of their novel lncRNA "genes" are expressed at an FPKM > 1 in at least one cell type. (I don't know exactly what that means in terms of transcripts per cell but I think it's about one transcript per cell or less.)

The most important part of this study should have been providing evidence for their claim that the new transcripts are functional and therefore that the complementary DNA sequences are actually genes. This would require a discussion of what a "gene" is and it would require that the authors refer to their transcripts as putative lncRNAs and the DNA sequences as "possible genes." They don't do that.

Part of any decent study on putative lncRNAs should include a discussion of the number of transcripts per cell and why the level that's detected is consistent with function. Part of any decent paper on the subject should include data on the conservation of the putative gene sequences within the human population and between species.

A some point in time the scientific community has to stand up and put a stop to publishing papers like this that don't measure up to the minimum standards of scientific publishing.

The Intelligent Design Creationists are already advertising this paper. Let's remember that even if all 3000 transcripts turned out to be from real genes, this represents only 0.1% of the genome. It's not going to affect the discussion about the about of junk DNA in our genome.

Palazzo, A. F., and Lee, E. S. (2015). Non-coding RNA: what is functional and what is junk? Frontiers in Genetics, 6. [doi: 10.3389/fgene.2015.00002]


  1. Setting the hype aside, there is another problem with papers like these -- I would have thought this was simply not news in 2015 as the time for that kind of paper passed a few years ago...

    I guess I was wrong and I should probably write a few of these myself...

  2. It turns out that 80% of their novel lncRNA "genes" are expressed at an FPKM > 1 in at least one cell type. (I don't know exactly what that means in terms of transcripts per cell but I think it's about one transcript per cell or less.)

    There really is no way to equate FPKM to copies per cell -- people have come up with empirical conversion factors that might be true for their particular experiments, but part of the problem with FPKM (and why people are beginning to suggest alternative methods) is that FPKM values depend on the number of transcripts per sample.

  3. Yeah, they are relative measures not comparable across experiments: