I've discovered many more authors who seem to be ignorant of the scientific literature and far too willing to rely of the opinions of others instead of investigating for themselves. Many of these authors seem to be completely unaware of controversy and debate in the fields they are writing about. They act, and write, as if there was only one point of view worth considering, theirs.
How does this happen? It seems to me that it can only happen if they find themselves in an environment where skepticism and critical thinking are suppressed. Otherwise, how do you explain the way they write their papers? Are there no colleagues, post-docs, or graduate students who looked at the manuscript and pointed out the problems? Are there no referees who raised questions?
Let's look at a paper on functional elements in the human genome (Milligan and Lipovich, 2015). It wasn't published in a front-line journal but that shouldn't matter for the points I'd like to make. This is a review article so special rules apply. As a scientist, you are obliged to represent the field fairly and honestly when writing a review. Here's the abstract ...
In the more than one decade since the completion of the Human Genome Project, the prevalence of non-protein-coding functional elements in the human genome has emerged as a key revelation in post-genomic biology. Highlighted by the ENCODE (Encyclopedia of DNA Elements) and FANTOM (Functional Annotation of Mammals) consortia, these elements include tens of thousands of pseudogenes, as well as comparably numerous long non-coding RNA (lncRNA) genes. Pseudogene transcription and function remain insufficiently understood. However, the field is of great importance for human disease due to the high sequence similarity between pseudogenes and their parental protein-coding genes, which generates the potential for sequence-specific regulation. Recent case studies have established essential and coordinated roles of both pseudogenes and lncRNAs in development and disease in metazoan systems, including functional impacts of lncRNA transcription at pseudogene loci on the regulation of the pseudogenes’ parental genes. This review synthesizes the nascent evidence for regulatory modalities jointly exerted by lncRNAs and pseudogenes in human disease, and for recent evolutionary origins of these systems.The authors are, of course, entitled to their opinion but they are not entitled to state it as if it were a fact. I do not believe that the prevalence of non-coding functional elements is a key "revelation" of the past 15 years.
For one thing, those elements that truly are functional were known BEFORE the human genome was sequenced. For another, it's not true, in my opinion, that there are huge amounts of functional DNA in the human genome. Any scientist who has kept up with the literature will know that the conclusions of the ENCODE Consortium and FANTOM are not universally accepted so they should not be quoted in an abstract as if they were necessarily true.
It would be okay to say something like this, "We believe that ENCODE and FANTOM have demonstrated that much of the human genome is functional but we will review and report contrary evidence and opinions."
The authors say that "tens of thousands of pseudgoenes" are functional but there's no evidence at all that this is true. They also say that a similar number of lncRNA elements are functional but, again, there is no evidence that this is true. There may be lots of people who like to think that tens of thousands of DNA elements are functional (i.e. genes) because they produce functional RNAs but wishing is not evidence.
It would be okay to say, "After an extensive review of the literature we conclude that tens of thousands of pseudogenes, and a similar number of lncRNAs, are functional although we recognize that most scientists will disagree with our opinion."
There's a more fundamental problem with this abstract and it has to do with the connections between genome activities and disease. The implicit assumption in this paper, and in many other papers, is that the locus of disease-causing mutations pinpoints functional regions of the genome. This is not correct. You could easily have a mutation that enhances transcription in a junk DNA region and the aberrant transcription interferes with the expression of a nearby gene. An example might be a spurious mutation that leads to transcription of an adjacent pseudogene from the opposite strand and the resulting antisense RNA blocks translation of the mRNA from the active gene. That does not mean that the junk DNA and the pseudogene now have a function.
You can also have a mutation in the junk DNA part of a large intron creating a new spice site leading to splicing errors that shut down proper gene expression. This does not mean that the site of the mutation has a function and can no longer be considered junk. We need to recognize that many disease-causing mutations might occur in junk DNA. These go by the unfortunate name of "gain-of-function" mutations.
The Milligan & Lipovich paper begins with ....
Redefining the Human Gene CountYou can guess where this is going. The authors are going to make the case that new data has forced us to recognize that there are genes for functional RNAs that don't encode proteins. This is a standard approach for a certain group of scientists who want to defend ENCODE and the functionality of most of our genome.
Classical definitions of genes focus on heritable sequences of nucleic acids which can encode a protein (White et al., 1994).
The set-up requires you to believe that during the 1990s everyone thought that the only kind of genes were those that encoded proteins. This is not true, but it is a misrepresentation of the truth that seems to be widely believed. I can assure you that knowledgeable scientists have known about genes for ribosomal RNAs and tRNAs for half-a-century and we've known about a host of other genes for functional RNAs for thirty years.
It may be the case that Michael Milligan and Leonard Lipovich were ignorant of non-protein-coding genes until very recently but it's not fair to imply that this misconception was shared by most knowledgeable scientists.
The reference (White et al., 1994) was not something I recognized so I tried to look it up. After a bit of searching I realized that the order of authors was incorrect and the real reference is Fields et al. (1994). It's a News & Views article in Nature Genetics entitled "How many genes in the human genome?" The authors are from Craig Venter's
Fields et al. know that defining the word "gene" is important so they say ...
Counting genes requires being clear about what counts as a gene. "Gene" is a notoriously slippery concept, and differing notions about what it means to identify one can lead to heated disagreements. Some define a gene physically as a region of DNA sequence containing a transcription unit and the associated regulatory sequences.They refer to genes for small regulatory RNAs but decide to focus on transcription units that can be translated into proteins in the rest of their discussion.
It's not clear to me why Milligan & Lipovich use this reference to bolster their claim that "classical" definitions of genes focus on genes that encode proteins unless they mean that Fields et al were aware of the proper definition of gene but decided to restrict their count to protein-coding genes. (See What Is a Gene? for a more thorough discussion.)
Milligan & Lipovich continue the Introduction with ...
The question of how many genes the human genome contains has been an evolving point of contention since before the Human Genome Project. In 1994, the estimated total human protein-coding gene count was 64,000–71,000 genes (White et al., 1994). The higher gene estimate was based on partial genome sequencing, GC content, and genome size. The lower bound of 64,000 took into account expressed sequence tags (ESTs) and CpG islands as additional prediction factors. In 2000, a new count of actively transcribed genes was estimated at 120,000 using the TIGR Gene Index, based on ESTs, with the results from the Chromosome 22 Sequencing Consortium (Liang et al., 2000). 1 year later, Celera arrived at only 26,500–38,600 protein-coding genes using their completed human genome and comparative mouse genomics (Venter et al., 2001). The Human Genome Project, which used tiling-path sequencing as opposed to Celera’s shotgun sequencing, converged on a similar estimate (Lander et al., 2001).
- False History and the Number of Genes
- False History and the Number of Genes 2010
- Facts and Myths Concerning the Historical Estimates of the Number of Genes in the Human Genome
- How many genes do we have and what happened to the orphans?
- Michael White's misleading history of the human gene
At least in this case the authors have read an "ancient" paper from 1994. It's the Fields et al. paper that I talked about above only they refer to it as White et al. (1994). It's actually a pretty good paper on the number of genes. They discuss estimates ranging from 14,000 to 100,000 recognizing that the problem was difficult. Unfortunately they don't discuss any of the genetic load predictions.
Fields et al. (1994) figure there are between 60,000 and 70,000 protein-coding genes in the human genome. But just because some people thought that there were so many genes doesn't mean that this was the value universally accepted by all knowledgeable scientists.
By the time the complete draft human genome sequences was published we already knew the sequences of chromosomes 21, and 22 and the gene frequency in these chromosomes gave rise to predictions of 40,000 to 45,000 genes in the whole genome (see Aparicio, 2000). These were likely to be overestimates since both of these small chromosomes are rich in genes compared to the rest of the genome. (At the time we didn't know that the algorithms for counting genes returned many false positives.) That means that the gene count was approaching the numbers estimated earlier (about 30,000, if you only count knowledgeable scientists).
I find it interesting that Milligan & Lipovich take a different view of the history, saying that the estimates from chromosomes 21, and 22 predicted 120,000 genes. Their reference is Liang et al., (2000). It's true that Liang et al. worked at TIG and it's true that their estimate was 120,000. However, that paper is in the same issue of Nature Genetics as the Aparico (2000) paper I just quoted and two papers by Ewing and Green (2000) and Roest Crollius et al. (2000). The Ewing and Green estimate is 35,000 genes. The Roest Crolius et al. estimate is 28-34,000 genes. The papers were part of an issue on "The Nature of the Number."
So even if your version of ancient history only extends back to 1994, it's clear that by 2000 (one year before publication of the draft human genome sequence) most knowledgeable scientists—even those who were ignorant of the real ancient history from the 1960s—were thinking that the human genome had about 30,000 genes.
You may be wondering, as I did, why Milligan & Lipovich want to make a point about historical estimates of gene number when we already know the correct answer. I'm not sure why they think it's important. Clearly it's not important enough for them to have done a critical job of describing that history. Based on what I've seen in other papers, this sort of introduction seems designed to show you that there is a lot of "missing information" in the genome since scientists were expecting many more genes.
These are estimates of protein-coding genes. That's not because knowledgeable scientists didn't know about any other genes, it's because recognizing genes for functional RNAs is much more difficult. Samuel Aparicio explained it very nicely 15 years ago (Aparicio, 2000) ...
Although the tendency (especially in a pay-per-sequence access mode) is to assume that any transcript represents a gene, classical genetics demands some evidence of associated function. Crucially, what is not yet established (but is implied to be relatively abundant by these studies) is the extent of biological "noise" in the transcriptome of any given cell. In other words, what fraction of transcripts which can be isolated have any meaningful function? What fraction might be mere by-products of spurious transcription, spuriously fired off, perhaps on the antisense strand from promoters or CpG islands associated with protein coding genes (as seems to be the case with a number of imprinted genes)?Lots and lots of scientists have expressed this cautionary view but no matter how many times it's published there are many more scientists who ignore the warning and continue to ignore it to this day. It's not a question of whether, in your opinion, the transcripts are functional in spite of the potential problems, it's that too many scientists won't even recognize that there's a problem.
Let's see the next paragraph in the Milligan & Lipovich paper.
Following the sequencing of the human genome, focus has shifted toward understanding gene function. In 2005, the FANTOM (Functional Annotation of Mammals) Consortium determined that the mouse genome harbored more non-coding genes than coding genes (Carninci and Hayashizaki, 2007). In a parallel project to FANTOM, the ENCODE (Encyclopedia of DNA Elements) Consortium began exhaustively surveyed the epigenetics and regulation of the whole genome (Birney et al., 2007; Consortium ENCODE Project, 2012). ENCODE’s continuing effort to recount human genes (GENCODE) using the study of genetic landmarks indicative of transcription and next generation sequencing has allowed them to arrive at a current total of just under 58,000 genes as of 2013 (gencodegenes.org). Of these 58,000 genes ENCODE only defines approximately 20,000 genes as coding, with almost all of the other genes being classified as pseudogenes and non-coding RNA (ncRNA). Early studies of the mouse transcriptome by the FANTOM Consortium first motivated the redefinition of a gene into a transcriptional unit as a consequence of large numbers of lncRNA genes discovered (Carninci et al., 2005).Things are beginning to fall into place in this paper. The authors want you to believe that historical gene number estimates were much higher than the actual number of genes observed when the human genome sequence was published. That's because scientists thought that the only kind of genes were those that encode proteins, according to the myth. However, recent discoveries by ENCODE and FANTOM show that those scientists were wrong and there are actually genes for noncoding RNAs. Futhermore, those RNA genes outnumber the protein-coding genes by a large margin (38,000 to 20,000).
The caution expressed by Aparicio, and many others, is ignored. The rest of the paper consists of reviews of lncRNA functions and pseudogene functions. With respect to lncRNAs, there's no discussion of whether these lncRNAs represent "noise" and no critical review of the case for function. Even lack of conservation doesn't phase Milligan & Lipovich because these nonconserved genes for lncRNAs are still exaptive—they can easily become important functioning genes. As reservoirs for future change, they are "not disposable even when adaptation doesn't govern their existence."
Contrast this biased review with a review of lncRNAs published by my colleagues Alex Palazzo and Eliza Lee in the same journal a month earlier (Palazzo and Lee, 2015). They review the literature with a critical eye and conclude that ...
The genomes of large multicellular eukaryotes are mostly comprised of non-protein coding DNA. Although there has been much agreement that a small fraction of these genomes has important biological functions, there has been much debate as to whether the rest contributes to development and/or homeostasis. Much of the speculation has centered on the genomic regions that are transcribed into RNA at some low level. Unfortunately these RNAs have been arbitrarily assigned various names, such as “intergenic RNA,” “long non-coding RNAs” etc., which have led to some confusion in the field. Many researchers believe that these transcripts represent a vast, unchartered world of functional non-coding RNAs (ncRNAs), simply because they exist. However, there are reasons to question this Panglossian view because it ignores our current understanding of how evolution shapes eukaryotic genomes and how the gene expression machinery works in eukaryotic cells. Although there are undoubtedly many more functional ncRNAs yet to be discovered and characterized, it is also likely that many of these transcripts are simply junk. Here, we discuss how to determine whether any given ncRNA has a function. Importantly, we advocate that in the absence of any such data, the appropriate null hypothesis is that the RNA in question is junk.I know for a fact that the Palazzo and Lee manuscript was reviewed by a number of knowledgeable and skeptical scientists before it was sent off. They even sent it to an old curmudgeon who criticizes everything.1
The question is, why didn't the Milligan & Lipovich paper get the same scrutiny before they sent it off to the journal?
The other part of the Milligan & Lipovich paper discusses possible functions of pseudogenes. Again, there's a remarkable lack of critical thinking. The only case presented is the case for function. There's no attempt whatsoever to critically analyze and defend their claim in the abstract and introduction that "... the prevalence of non-protein-coding functional elements in the human genome has emerged as a key revelation in post-genomic biology." It's a classic case of confirmation bias and this isn't supposed to happen in the scientific literature, especially in reviews.
1. They didn't need to change any of their main points in response to reviewers because they already knew how to read and interpret the literature correctly.
Aparicio, S.A.J.R. (2000) How to count… human genes. Nature Genetics, 25:129-130. [doi:10.1038/75949]
Ewing, B., and Green, P. (2000) Analysis of expressed sequence tags indicates 35,000 human genes. Nat Genet. 25:232-234. [doi:10.1038/76115]
Fields, C., Adams, M.D., Whte, O., and Venter, J.C. (1994) How many genes in the human genome? Nature Genetics, 7:345-346. [PDF]
Liang, F., Holt, I., Pertea, G., Karamycheva, S., Salzberg, S.L., and Quackenbush, J. (2000) Gene Index analysis of the human genome estimates approximately 120,000 genes. Nat Genet, 25:239-240. [doi:10.1038/76126]
Milligan, M.J., and Lipovich, L. (2014) Pseudogene-derived lncRNAs: emerging regulators of gene expression. Frontiers in Genetics, 5: [doi: 10.3389/fgene.2014.00476]
Palazzo, A.F., and Lee, E.S. (2015) Non-coding RNA: what is functional and what is junk? Frontiers in Genetics, 6:2 [doi: 10.3389/fgene.2015.00002]
Pertea, M., and Salzberg, S. (2010) Between a chicken and a grape: estimating the number of human genes. Genome Biology, 11:206. [doi:10.1186/gb-2010-11-5-206]
Roest Crollius, H., Jaillon, O., Bernot, A., Dasilva, C., Bouneau, L., Fischer, C., Fizames, C., Wincker, P., Brottier, P., Quetier, F., Saurin, W., and Weissenbach, J. (2000) Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat Genet, 25(2), 235-238. [doi:10.1038/76118]