Tuesday, April 21, 2009

Sequenced genomes contain thousands of "unknown" genes

The total number of genes in the human genome has dropped from the initial estimates of 30-35,000 to about 25,000. Of these, more than 4,000 encode functional RNAs, leaving about 20,500 protein-encoding genes in the human genome [Humans Have Only 20,500 Protein-Encoding Genes].

Up to 40% of these protein-encoding genes are "unknown" in the sense that no function has been assigned to their protein products. In the jargon of genomics, the genes are "unannotated," meaning that nobody has assigned a function to the gene in the human genome database (Reichardt, 2007).

That means 8,000 unknown genes. About 1000 of these genes are "orphan" genes—genes that have no homologues in other species, including chimpanzees (Clamp, 2007).

Humans aren't unique. All sequenced eukaryotic genomes have a high percentage (~30-40%) of "unknown" protein-encoding genes.

A new paper in PLoS One looks at the "unknown" genes in the filamentous fungus Neurospora crassa (pink bread mold) (Kasuga et al. 2009). The Neurospora genome has about 9,000 protein-encoding genes and more than half of them have not been annotated. They are the "unkown" genes.

The genomes of about 40 different species of fungus have been sequenced and many of these are filamentous fungi related to Neuropsora. What this means is that it's possible to compare the Neurospora genes to those in many different genomes from closely related species; those that are part of the same family (less closelyrelated); part of the same phylum; and distantly related. You can't do such an extensive study with human genomes because there aren't very many mammalian genomes that have been sequenced and carefullyannotated. A draft sequence of the chimpanzee genome, for example, has been published but it is neither complete nor reliable enough for genomic comparisons. The only other primate genome is from macaque (Rhesus monkey) and that's far from finished. (The human and mouse genomes are the only ones listed as "complete" on the NCBI/Entrez website.)

The question is: are the unknown genes confined to Neurospora and its close relatives? If so, it would suggest that new genes have evolved within the past several million years and that's why we don't know their function.

Kasuga et al. created six sets of genes ...
  1. Genes with homologs in distantly related eukaryotes and possibly prokaryotes. These are ancient genes.
  2. Genes that are only found in fungi and not in plants or animals or protists (Dikarya).
  3. Genes found only in Ascomycetes.
  4. Genes confined to the Pezizomycotina clade to which Neurospora belongs.
  5. Genes found only in Neurospora.
  6. Others: genes that are found in some of the first groupings but not in all the smaller grouping.
The classification depends on the similarity cutoff. If the lowest cutoff is 25% sequence identity, then there will be more homologs in the eukarote or prokaryote class than if the cutoff is raised to 35%. The distibution of the various classes at each of three minimum sequence identify cutoffs is shown in their second figure.

Taking the 30% threshold numbers (middle group), it looks like there are 2,358 highly conserved genes with homologs in distantly related eukaryotes and prokaryotes. In contrast, there are 2,219 genes that don't have homologs in any other species. These are the orphan genes in Neurospora.

You might expect that most of the unknown/unannotated genes would be confined to Neurospora and closely related species. You might expect that highly conserved genes would be more likely to have been identified. That's partly true. Here are the numbers.

Only 16.5% of the highly conserved genes are mystery genes of unknown function. While this is much lower that the total (56%), it's still surprising that so many of the core genes remain unidentified. Presumably they are doing something very important. There are dozens of thesis projects available for talented graduate students who want to make a valuable contribution to biology.

It's not a surprise that 94% of the orphans are unannotated. These genes are likely to be new genes that have evolved recently in Neurospora and they would be expected to carry out unusual reactions that aren't found in other species. These "genes" are also the ones most likely to be artifacts (false positives) of the gene searching software. They may not be genes at all.

[Image Credit: Neurospora-National Institute of General Medical Sciences]

Clamp, M., Fry, B., Kamal, M., Xie, X., Cuff, J., Lin, M.F., Kellis, M., Lindblad-Toh, K. and Lander, E.S. (2007) Distinguishing protein-coding and noncoding genes in the human genome. Proc. Natl. Acad. Sci. (USA) 104:19428-19433. [DOI 10.1073/pnas.0709013104]

Kasuga, T., Mannhaupt, G., and Glass, N.L. (2009) Relationship between Phylogenetic Distribution and Genomic Features in Neurospora crassa. PLoS ONE 4(4):e5286. [DOI:10.1371/journal.pone.0005286]

Reichardt, J.K.V. (2007) Quo vadis, genoma? A call to pipettes for biochemists. Trends in Biochemical Sciences (TIBS) 32:529-530. [DOI:10.1016/j.tibs.2007.10.001]


  1. Look at the zebrafish genome...it's a mess!

  2. Dikarya are the more derived Basidiomycetes and Ascomycetes with a dikaryotic phase, not all fungi.

  3. One of the interesting things is how unreported this is! The similarity between chimp and human is frequently given as 99%, but this is only when you are looking at shared genes. Check out Gollery, et. al., "What makes species unique?" in Genome Biology.

  4. I am analyzing cytochrome P450s from 64 sequenced fungal genomes, including correct gene assembly and naming according to a systematic nomenclature for this family. I have completed 2700+ sequences with about 1300 genes to go. Even in the Apergillus species, after looking in detail at 8 genomes, there are new CYP families in each genome. There is a core of conserved families, but there are always new ones. I don't think de novo evolution of these families is responsible. I supect there is a huge reservoir of genes that fungi are accessing by lateral gene transfer and this explains the novelty among even close species.
    As we get more and more genomes sequenced it may be possible to see the source of these novel genes.