Sandwalk: How many pseudogenes in the human genome?

There are somewhat less than 25,000 genes in the human genome and there are probably about the same number of pseudogenes.

Pseudogenes are sequences that resemble real functional genes but they contain mutations that render them non-functional. They are very real examples of junk DNA.

There are four kinds of pseudogenes. Duplicated pseudogenes arise from a gene duplication event when one of the original copies mutates. Duplicated pseudogenes retain all of the features of the original gene, including introns and adjacent regulatory sequences. The inactivating mutation may occur in the gene itself—for example in the coding region of a protein coding gene—in which case the pseudogene may still be transcribed. Duplicated pseudogenes are usually found adjacent to their parent gene.

Processed pseudogenes arise when the normal transcript is copied by reverse transcriptase and the DNA copy is reintegrated into the genome. Processed pseudogenes don't have introns or regulatory sequences and they are not near their parent gene. Most processed pseudogenes come from transcripts that are expressed in the germ line.

Unitary pseudogenes come from single copies of genes that were no longer essential in the ancestral genome. The gene was mutated and the mutant allele was not detrimental so the gene became a pseudogene. These pseudogenes will have introns and regulatory sequences and they may still be transcribed. Unitary pseudogenes are located at the same locus as functional, and presumably still essential, genes in other species.

Polymorphic pseudogenes are like unitary pseudogenes except that the pseudogene allele has not become fixed in the genome so both the functional gene and the pseudogene are present in the population.

It's often quite difficult to identify pseudogenes, especially those derived from non-coding genes. Most of the computer programs designed to detect pseudogenes will rely on comparing known protein coding regions from an established database with sequences that resemble those coding regions but have various mutations that prevent production of a full-length functional protein. They don't look for pseudogenes related to noncoding genes.

It is quite easy to find pseudogenes in bacteria but much more difficult in complex eukaryotes because most of a protein coding gene consists of introns and the coding DNA in the exons is quite short. Early attempts with the human genome gave results ranging from more than 18,000 to a little over 11,000 depending on the program that was used.

A comparison of the pseudogenes detected by various groups revealed surprisingly little overlap so the actual number of pseudogenes could have been as low as 7,000 (Harrow et al., 2012).

The GENCODE annotation in 2012 listed about 11,000 pseudogenes derived from protein-coding genes (Harrow et al., 2012). The latest Ensembl/GENCODE annotation predicts 15,204 pseudogenes but it's not clear whether this includes non-coding genes [Human assembly gene annotation].

A recently published paper describes a new program that found a lot more pseudogenes.

Cabanac, S., Dunand, C. and Mathé, C. (2026) P-GRe: An efficient pipeline for pseudogenes annotation. Genomics 118:111216. [doi: 10.1016/j.ygeno.2026.111216]

Formerly considered as part of “junk DNA”, pseudogenes are nowadays known for their role in the post-transcriptional regulation of functional genes. Their identification also contributes to a better understanding of gene evolution, particularly in relation to adaptive responses and the evolution of multigene families. Despite this, there is, to our knowledge, no fully automatic pipeline allowing annotation of the pseudogenes on a whole genome. Here, we propose a new software named Pseudo-Gene Retriever (P-GRe). This is a completely automated pseudogene prediction tool requiring only a genome sequence, its corresponding GFF annotation file, and a protein sequences file. The aligner miniprot has been integrated in our pipeline, because of its high speed and sensitivity. With several filtering and post-analysis steps P-GRe outperforms existing software, while being more sensitive and bringing the new capacity of annotating unitary pseudogenes.

For now, I'll ignore the gratuitous comments about functional pseudogenes because there's no data in the paper on that topic—this is simply a paper describing yet another algorithm to detect pseudogenes related to protein coding genes.

Cabanac et al. detected 28,790 pseudogenes and that's far more than the number currently annotated in the latest reference genome. This could be due to the fact that the reference database of vertebrate proteins contained 19 million entries and these included transposon proteins and, presumably, virus and retrovirus sequences. I don't think that fragments of degenerate transposon genes should really be counted as pseudogenes.

The interesting part of this paper is that the authors characterize the various types of pseudogenes in Arabidopsis and humans. The surprising part is the number of unitary pseudogenes (~5200).

The classic example of a unitary pseudogene is the remnants of a gene for one of the enzymes in the pathway for synthesizing vitamin C [Human GULOP Pseudogene]. Our ancestral primates lost that gene, probably because they got sufficient vitamin C from eating fruit. It's difficult to imagine that there would have been 5000 other genes in our vertebrate ancestors that have been subsequently inactivated in the lineage leading to humans.

Maybe the authors are detecting virus and transposon genes?

There's a disturbing tendency in the pseudogene literature to attribute function to what is obviously junk DNA. This tendency is based, in part, on the discovery of a tiny number of examples of pseudogenes that have secondarily acquired a function. This is a classic example of a logical fallacy known as "cherry-picking" where scientists over-generalize from a few "exceptions that prove the rule." [Different kinds of pseudogenes - are they really pseudogenes?] [Are pseudogenes really pseudogenes?].

Harrow, J., Frankish, A., Gonzalez, J.M., Tapanari, E., Diekhans, M., Kokocinski, F., Aken, B.L., Barrell, D., Zadissa, A. and Searle, S. et al. (2012) GENCODE: the reference human genome annotation for The ENCODE Project. Genome Research 22:1760-1774. [doi: 10.1101/gr.135350.111]

14 comments:

AnonymousWednesday, April 15, 2026 8:34:00 AM
How would you call a gene that is (very) useful in one celltype but nonfunctional in other celltypes. An example would be the INS gene which codes for insuline, that is, only in specific cells in the Islands of Langerhans in the pancreas. Unless Casey Luskin is busy doing labresearch on alternative splicing for this gene it seems that it is junk in other celltypes like a bonecell or a nervecell.
SPARCWednesday, April 15, 2026 9:47:00 AM
I guess the same is true for the majority of your genes. So why find a name for a type of gene that is not expressed ubiquitously in a multicellular orgnism when you can easily describe them in other terms like e.g., cell type specific genes?
Michael TressWednesday, April 15, 2026 12:43:00 PM
There are 120 unitary pseudogenes in GENCODE v49 as a reference. They aren't common at all. This study has 43 times as many. They do address the differences between the pseudogenes annotated for human and their predictions, and side-step the question by claiming that their predictions are close to another study. But they do not address the unitary pseudogene predictions at all. That one is the fault of the referee who asked them to explain the difference from the human annotation.

I think a lot of these predictions are likely to be alignment errors. They mention how they find a lot of pseudogenes with introns of 10Kb. You could drop 5 or 6 coding genes into introns of that size, easily. If they are joining up exons from non-adjacent (and possibly unrelated?) predicted pseudogenes, then of course all sorts of weird predictions will happen.
John HarshmanWednesday, April 15, 2026 9:26:00 PM
Would most recessive alleles be considered pseudogenes? For example, is the O allele in the ABO blood protein system considered a pseudogene?
SPARCThursday, April 16, 2026 5:45:00 AM
According to Larry yes Different kinds of pseudogenes: Polymorphic pseudogenes
Michael TressThursday, April 16, 2026 6:15:00 AM
Hi John,

The alleles on their own, no. The definition of pseudogenes applies to genomic coordinates. However, ABO itself fits the definition of a polymorphic pseudogene because of the inactive O allele (see Larry's page on polymorphic pseudogenes if anyone wants more information, but ignore the last sentence, which hasn't aged very well! https://sandwalk.blogspot.com/2015/11/different-kinds-of-pseudogenes_20.html).

I don't know if ABO was ever annotated as polymorphic, but right now it is annotated as coding, along with genes like ACTN3 that were previously considered polymorphic pseudogenes, in the case of ACTN3 because the reference genome has the functional allele. Genes without functional alleles in the reference genome now tagged as "Loss of Function" at the transcript level (CASP12 for example).
Larry MoranThursday, April 16, 2026 10:11:00 AM
@Michael Tress: Thank-you for pointing out that the last sentence now looks pretty silly 11 years later. I crossed it out.
Michael TressThursday, April 16, 2026 11:34:00 AM
Sequencing has got an awful lot easier and cheaper over the last few years. Polymorphic pseudogenes is something that the human pangenome is supposed to "solve".
John HarshmanThursday, April 16, 2026 9:56:00 PM
According to Larry, and this makes sense to me too, the term "pseudogene" is indeed attached to alleles. The O allele is a pseudogene while the A and B alleles are functional genes. A polymorphic pseudogene is a gene in which at least one allele present at high frequency is without function and at least one is functional. If the pseudogene allele becomes fixed, we drop the "polymorphic" bit and just call that locus a pseudogene.
John HarshmanFriday, April 17, 2026 7:49:00 PM
Hey, how about NUMTs? What category of pseudogene would they be?
JoãoTuesday, April 21, 2026 9:30:00 PM
Larry, you may wanna take a look at this paper:
https://www.science.org/content/article/scientists-stunned-fundamentally-new-way-life-produces-dna

Production of DNA from a protein!

Some claimed it violates the central dogma, and for that reason I'd like to see your take on the topic!
John HarshmanSaturday, April 25, 2026 6:11:00 PM
I think it would violate the central dogma only if the DNA produced by the protein coded for the protein, i.e. could be transcribed and translated into that protein.
VallveFriday, May 08, 2026 11:16:00 AM
About the possible violation of the central dogma and the production of DNA from a protein, I wonder whether the important point here is what the synthesized DNA is actually used for. The DRT3-generated repeat DNA does not appear to integrate into the bacterial genome or encode proteins. If it is just a repetitive defensive molecule, does it really represent biological “information flow” from protein to nucleic acid in the sense intended by the central dogma?

Tuesday, April 14, 2026

How many pseudogenes in the human genome?

14 comments: