More Recent Comments

Wednesday, January 08, 2020

Are pseudogenes really pseudogenes?

There are many junk DNA skeptics who claim that most of our genome is functional. Some of them have even questioned whether pseudogenes are mostly junk. The latest challenge comes from a recent review in Nature Reviews: Genetics where the authors try to place the burden of proof on those who say that pseudogenes are broken, nonfunctional, genes (Cheetam et al., 2019). The authors of the review try to make the case that we should not label a DNA sequence as a pseudogene until we can prove that it is truly nonfunctional junk.

I'm about to refute this ridiculous stance but first we need a little background.

What is a pseudogene?

The traditional definition of a pseudogene is a DNA sequence that resembles a known gene except that it carries mutations making it nonfunctional [Different kinds of pseudogenes - are they really pseudogenes?]. There are four types of pseudogenes but the two main classes are processed pseudogenes that arise from cDNA copies of functional RNA and duplicated pseudogenes that arise from a gene duplication event followed by inactivation of one of the copies (see figure below)

The formation of a pseudogene from a duplication event is part of the well-studied birth & death evolution of genes where death by inactivation or deletion is the most common fate [On the evolution of duplicated genes: subfunctionalization vs neofunctionalization] [Birth and death of genes in a hybrid frog genome].

The idea that most duplicated genes will become pseudogenes is consistent with a ton of data and fits well with our understanding of mutation rates and genome evolution. This is an important point. We don't arbitrarily assign the word "pseudogene" to any old DNA sequence. The designation is based on the fact that the duplicated region is no longer transcribed, or it is no longer correctly spliced, or that it carries mutations rendering the product nonfunctional. (In the case of protein-coding genes it could be that the reading frame is disrupted.) It's also important to understand that the frequency these inactivating mutations and the rate of fixation of the resulting allele is perfectly consistent with everything we know about molecular evolution.

The same reasoning applies to processed pseudogenes. The integration into the genome of a cDNA copy of a functional RNA is a relatively rare event and it makes sense that the integrated sequence is nonfunctional since it lacks a promoter. Thus, it seems reasonable to assume that most of these sequences are dead-on-arrival; in other words, they are pseudogenes. This is especially true of DNA sequences that are missing part of the 5′ end of the gene.

Some pseudogenes appear to have a function

There are some examples of DNA sequences that appear to be pseudogenes but they also have functional regions. The best examples are duplicates that contain small RNA genes within their introns or genes that contain other functional regions like SARs and origins of replication. In those cases, the inactivated gene is still a pseudogene but the other functional regions are best characterized as something else.

There are also quite a few examples of pseudogenes that have secondarily acquired a distinct new function such as producing a small RNA that might have a regulatory function. The review by Cheetham et al. contains several examples of such pseudogenes. They are still pseudogenes but the region may now specify a new lncRNA gene or some other gene such as an siRNA gene.

There are some examples of gene duplicates producing a truncated protein that plays a role in regulating the expression of the normal full-length protein. These examples are rare but if they are biologically significant then they are examples of neofunctionalization. In this case, the gene duplicate is not a pseudogene, it has evolved a new function.

There is rampant speculation that transcription of the opposite strand of a pseudogene is common and that the antisense RNA plays a biologically significant role in regulating the expression of the parental gene. So far, there seems to be only a small number of examples that make sense.

Are most pseudogenes functional?

All of these functions have caused some workers to question the very existence of pseudogenes and there are many papers in the literature implying that most pseudogenes aren't pseudogenes at all. One of the common characteristics of these papers is a lack of critical thinking and skepticism. Very few of them even mention the possibility that they could be looking at spurious transcripts and/or interactions that are not biologically significant. Here are some examples ...
Balakirev, E.S., and Ayala, F.J. (2003) Pseudogenes: are they “junk” or functional DNA? Annual review of genetics, 37:123-151. [doi: 10.1146/annurev.genet.37.040103.103949]

Milligan, M.J., and Lipovich, L. (2015) Pseudogene-derived lncRNAs: emerging regulators of gene expression. Frontiers in genetics, 5:476. [doi: 10.3389/fgene.2014.00476]

Xu, J., and Zhang, J. (2015) Are human translated pseudogenes functional? Molecular Biology and Evolution, 33:755-760 [doi: 10.1093/molbev/msv268]

Wen, Y.-Z., Zheng, L.-L., Qu, L.-H., Ayala, F. J., and Lun, Z.-R. (2012) Pseudogenes are not pseudo any more. RNA biology, 9:27-32. [doi: 10.4161/rna.9.1.18277]

Johnsson, P., Morris, K.V., and Grandér, D. (2014) Chapter 14: Pseudogenes: a novel source of trans-acting antisense RNAs Pseudogenes (pp. 213-226): Springer.

Pink, R.C., Wicks, K., Caley, D.P., Punch, E.K., Jacobs, L., and Carter, D.R.F. (2011) Pseudogenes: pseudo-functional or key regulators in health and disease? Rna, 17:792-798. [doi: 10.1261/rna.2658311]
The question is not whether some DNA regions that look like pseudogenes have a function. This is a fact. The real question is whether these regions are exceptions to the general rule that a pseudogene is a pseudogene or whether most pseudogenes have been mislabeled. To me, it seems almost irrational to assume that most pseudogenes are actually functional DNA segments because the direct and circumstantial evidence for junk is very strong.

A new dogma?

The examples above suggest to me that there's a new "dogma" emerging. It's based on the idea that there must be a lot more function on our genome than the data suggests and it's based on an uncritical acceptance of data published by ENCODE and their allies.

This new dogma is evident in the Cheetham et al. paper from Nature Reviews: Genetics. They believe that calling something a pseudogene is "dogmatic" (old dogma) and that it inhibited research. They want to impose a new dogma based on assuming that something has a function until you can prove that it's junk. Here's the abstract ....
Cheetham, S.W., Faulkner, G.J., and Dinger, M.E. (2019) Overcoming challenges and dogmas to understand the functions of pseudogenes. Nature Reviews Genetics. [doi: 10.1038/s41576-019-0196-1]

Pseudogenes are defined as regions of the genome that contain defective copies of genes. They exist across almost all forms of life, and in mammalian genomes are annotated in similar numbers to recognized protein-coding genes. Although often presumed to lack function, growing numbers of pseudogenes are being found to play important biological roles. In consideration of their evolutionary origins and inherent limitations in genome annotation practices, we posit that pseudogenes have been classified on a scientifically unsubstantiated basis. We reflect that a broad misunderstanding of pseudogenes, perpetuated in part by the pejorative inference of the ‘pseudogene’ label, has led to their frequent dismissal from functional assessment and exclusion from genomic analyses. With the advent of technologies that simplify the study of pseudogenes, we propose that an objective reassessment of these genomic elements will reveal valuable insights into genome function and evolution.
As the title suggests, Cheetam et al. want us to overcome dogma and see pseudogenes in a new light. Their main objection to labeling something a "pseudogene" is that it inhibits research.
... the annotation of genomic regions as pseudogenes constitutes an etymological signifier that an element has no function and is not a gene. As a result, pseudogene-annotated regions are largely excluded from functional screens and genomic analyses. Therefore, the process of pseudogene annotation is paramount in the consideration of which genomic elements are assessed for biological impact. However, with a growing number of instances of pseudogene-annotated regions later found to exhibit biological function, there is an emerging risk that these regions of the genome are prematurely dismissed as pseudogenic and therefore regarded as void of function.
This is a nonsensical argument. Let's assume that we have all agreed not to call a particular region a "pseudogene" and instead we just identify it as an unknown segment of DNA. Let's assume that you have a bundle of money and a new grad student or postdoc and you are looking for a project. If the region looks like a pseudogene by all the criteria that we have used in the past are you going to ignore that evidence just because annotators don't call it a pseudogene? Are you going to put valuable resources into looking for a function? Of course not.

If it's true that the vast majority of pseudogenes really are pseudogenes then that's not inhibiting research. That's just looking at the evidence. It doesn't mean that every single pseudogene is actually nonfunctional junk; it just means that you need to have some evidence of function before you ask a student of postdoc to look for function. In the absence of evidence, the default assumption is pseudogene if it looks like a defective gene or a processed bit of cDNA.

If it looks like a pseudogene then call it a pseudogene

I think we can dismiss the argument that calling something a pseudogene will inhibit research. The second part of their argument focuses specifically on gene annotation where annotators look at the sequence and conclude that the region is probably a pseudogene. For example, they see a bit of DNA that resembles the coding region of a known gene except that it has a number of mutations interrupting the open reading frame. The region has nothing that looks like an intron and it lacks any of the characteristics of the parental gene promoter. The annotators assume that it's a processed pseudogene.

Similarly, annotators see a duplicated region of DNA that includes a known gene. They see that the duplicate has multiple mutations making it impossible to encode the same functional product as the parental gene. They note that closely related species do not have similar sequences and they note the the frequency of mutations is consistent with the fixation of neutral alleles by random genetic drift. They label it a pseudogene.

Cheetham et al. think there's something wrong with this analysis. Here's what they say ...
Therefore, we suggest that it may be useful to consider the annotation of pseudogenes in genomes as a prediction or a hypothesis rather than a classification. As discussed further below, the inherent semantic contradiction that arises when a pseudogene is found to have biological function raises the notion that the term pseudogene should be reserved for gene copies that have been empirically demonstrated to be defective rather than indicated by algorithmic prediction alone.
In other words, you have to prove that a sequence has no function before you can call it a pseudogene. That's absurd.

There are roughly 15,000 annotated pseudogenes in the human genome and they cover about 5% of the genome. Is it reasonable to demand that we need to prove that every one of these sequences lacks a function before we can confidently assume that they are pseudogenes? Of course not. The circumstantial evidence is more than sufficient to tentatively identify it as a pseudogene. If some researcher suspects that the label is incorrect then the burden of proof is on them to show that the DNA has a function.

When labels can be misleading

I understand where the authors are coming from because it's true that the sloppy assignment of labels can carry implications that confuse researchers. For example, I've argued previously that we shouldn't label every splice variant as an example of alternative spicing. Alternative splicing is a very real phenomenon but most splice variants are just mistakes in splicing and they have no biological function. This is the exact opposite of the pseudogene problem since, in the case of alternative splicing, the label implies function and leads researchers to make the unwarranted assumption that most genes are alternatively spliced [The frequency of splicing errors reflects the balance between selection and drift] [Are splice variants functional or noise?].

LncRNAs are another example of the misuse of labels. There are very real examples of functional lncRNAs but when you call every long transcript a lncRNA you are implying that they have a function when, in fact, most of them are spurious transcripts. It's interesting that Cheetham et al. don't seem to recognize that the misuse of the term "lncRNA" is a serious problem [How many lncRNAs are functional?].
The scenario is reminiscent of, and in many regards analogous to, the challenges that the lncRNA field underwent following the initial observation of their pervasive transcription in mammalian genomes. lncRNAs were similarly dismissed initially as emanating from ‘junk DNA’ or as transcriptional noise, largely by virtue of their definition as non-protein-coding, and were challenging to study due to their generally lower and more restricted expression patterns relative to mRNAs. Following a combination of technology developments, genome-wide studies, and detailed biochemical studies, lncRNAs are now routinely included in genome-wide analyses, and their functional potential as cellular regulators is widely recognized.
What they should have said was that the discovery of numerous long, low concentration transcripts led to a debate over whether they were functional or just junk RNA. We now know that the vast majority are probably spurious transcripts but a few, now called lncRNAs, have been shown to have a biological function.

Here's a case where the bias of the authors is showing. They want to believe that most of those transcripts have a function so they are comfortable with calling them all lncRNAs and putting the burden of proof on those who dismiss them as junk RNA. Similarly, they want to believe that most gene-like sequences are functional so they don't want to "mistakenly" label them pseudogenes because that conflicts with their bias.

We need to have a serious discussion about the use/misuse of labels such as "junk," "pseudogenes," "alternative splicing," and "lncRNA," but this paper is not a good beginning. We also need to have a serious debate about whether the default position in the absence of evidence should be function or nonfunction. Unfortunately, there's no trace of a serious discussion in this review.

Once again, we see a paper being published in a prestigious peer-reviewed journal where there's very little evidence that the reviewers did an appropriate job. Something is seriously wrong.

Cheetham, S.W., Faulkner, G.J., and Dinger, M.E. (2019) Overcoming challenges and dogmas to understand the functions of pseudogenes. Nature Reviews Genetics. [doi: 10.1038/s41576-019-0196-1]


  1. I am a biology undergraduate and I am so happy to have found this blog. I have nothing to add to the post, just want to say thank you professor Moran for showing me a new way to look at biochemistry and molecular biology.

  2. The Cheetham review article demands that researchers prove a negative proposition. It reminds me of Bertrand Russell's "prove to me there is not a teapot in space" analogy that so aptly dismisses the arguments for gods based on "proofs" that adamantly assert that the absence of evidence is not evidence of absence. It would mean the ruin of the scientific enterprise if such non-productive assertions were to become generally accepted as viable alternatives to evidence-based explorations and explanations of actual biological function.

    The ENCODE project continues to contaminate the scientific literature and that makes me... sad.

    1. I agree, placing the burden of proof on a negative proposition is very unscientific. The null hypothesis is called "null" for a reason.

  3. Important post, as are all of your critiques of current postmodern genomics.

  4. Unfortunately, online commenting on Nature's web pages has been closed years ago, Pubmed commons was discontinued back in 2018 and I question if Nature Reviews Genetics would be willing to publish your remarks if they even had a section for this purpose.

  5. “In other words, you have to prove that a sequence has no function before you can call it a pseudogene.“

    You are right that this is absurd because it is practically impossible to prove a lack of function. Even if one could successfully show that a single pseudogene had no function in one tissue, it could always be argued that it might have a function elsewhere or at different developmental stages.

    There is a problem with the annotation of pseudogenes, but it isn’t the one the authors identify. Pseudogenes are classified based on the balance of the available evidence and determining the status of many pseudogenes is relatively straightforward; most have collected enough debilitating mutations for the decision to be clear cut (though there are exceptions: PLK5 ought to be a pseudogene, but it has an antibody in a published paper, so it is classified as coding).

    But the human genome also has many recent duplications. Many of these are highly sequence similar, which makes it difficult to distinguish which copies code for proteins and which do not, even with experimental evidence. The result often is that all members of the family are retained as coding genes, even when most are likely to be pseudogenes. The USP17L family is a good example.

    I don't believe that this paper will have much echo in the end. Researchers who follow the suggestions will soon find that there is little to support functionality in the vast majority of pseudogenes. In this at least it isn’t comparable to alternative splicing, because a clear proportion of alternative isoforms are likely to be functionally important. The discussion with regards to alternative splicing is the size of that proportion …

    One last thing, I wouldn't necessarily blame the referees. Editors are sometimes so enamored with a paper that they ignore referee suggestions.