In an earlier posting from 2007 [What Is a Gene?], I suggested the following ...
This essay describes various modern definitions of physical genes (Gene-D). I like to define a gene as “a DNA sequence that’s transcribed” but that’s a bit too brief for a formal definition. We need to include something that restricts the definition of gene to those entities that are biologically significant. Hence,Let's not quibble about all of the exceptions. Most of them are covered in my original article and in the comments there. I want to concentrate here on the idea that a gene has to have a "function" of some sort. As I explained in the comments ....
A gene is a DNA sequence that is transcribed to produce a functional product.
This eliminates those parts of the chromosome that are transcribed by accident or error. These regions are significant in large genomes; in fact, the confusion between accidental transcripts and real transcripts is responsible for the overestimates of gene number in many genome projects. (In technical parlance, most ESTs are artifacts and the sequences they come from are not genes.)
I don't know if I can come up with a catchy definition of "function." What I mean is that the transcript or it's product has to do some biochemical duty in order to qualify. It doesn't have to be an essential function but it has to make a difference of some sort.This is important because there's a growing tendency to label all kinds of things as "genes" just because they produce small RNA molecules or, in some cases, a small protein. In most cases the products have no known biological function.
Here's a couple of examples.
De Novo Ptotein-Encoding Genes
It's plainly obvious that new genes must arise from time to time in various lineages. Lot's of people are interested in the evolution of humans and in particular the changes that distinguish us from our closest cousins. Almost all of the changes can be explained by alterations in the timing or location of orthologous gene expression but that doesn't exclude the possibility that entirely new genes might arise de novo in some lineages.
Let's just think about genes that encode proteins. There are three steps required for the de novo creation of a new protein-encoding gene. (1) A part of the ancestral genome must be transcribed. (2) The transcript must contain an open reading frame with a start and stop codon. (3) The new protein must have a function.
That last step needs explaining. If the new protein doesn't have a function then the putative new gene is no different than a pseudogene or a mutant gene that produces a truncated protein because of a premature stop codon. It's also indistinguishable from bits of the genome that are accidentally transcribed and just happen to have an open reading frame.
Wu et al. (2010) looked at the evolution of new genes in the lineage leading to humans. The title of their paper is: "De Novo Origin of Human Protein-Coding Genes." I want to challenge their definition of "gene" by suggesting that what they've really discovered are "potential" or "candidate" genes that don't deserve to be called "genes" until one discovers a biological function for their products.
The authors searched the human genome (build 56) for annotated "genes" with small open reading frames greater that 100 codons long. Then they examined the corresponding loci in the chimpanzee and orangutan genomes looking for case where there was no open reading frame in the other apes. Various expressed RNA databases and two expressed peptide databases were screened to see if the candidate genes were expressed as RNA and protein. They found 27 examples. These are the candidates for de novo genes in humans.
Their collection did not contain some of the de novo "genes" reported by others. As it turns out, those "genes" were annotated in previous versions of the human genome (builds 40-55) but were dropped from the latest versions because there were no homologues in the other ape genomes. By using those older builds, Wu et al. discovered another 33 candidates for a total of 60 putative new protein-encoding genes in the human genome.
Wu et al. concede that the expression levels of these candidate genes are "very low" but unfortunately they don't give us any specific levels. This is important because there's plenty of evidence that the expressed RNA databases contain spurious transcripts [How to Evaluate Genome Level Transcription Papers].
I wonder how many spurious peptides are in the peptide databases? Wu et al report that one of the peptides used to identify an earlier example of a de novo gene (Knowles and McLysaght, 2009) has been removed from the current build of PeptideAtlas. What happened to it?
The authors are aware of the fact that function is important, especially if they want to argue that these new genes conferred some selective advantage on our hominid ancestors. The only "evidence" they offer is that the putative genes are expressed at a low level in testis and brains but at an even lower level in other tissues. This is no evidence at all since we've known for fifty years that the complexity of RNA sequences in brain and testis is much higher than in other tissues. We still don't know whether that's due to elevated spurious transcription in those tissues of whether it is biologically significant.
Are these 60 candidates really new "protein-coding genes"? I don't think so. I don't think they can be called "genes" until it has been demonstrated that the products have a biological function. Guerzoni and McLysaght (2010) seem to agree because they write,
The observation by Wu et al. that some of the candidate de novo genes are expressed at their highest in brain tissues and testis is interesting, but by no means proves they are functional. A major challenge remains to demonstrate functionality of the de novo genes.
Genes that Encode Functional RNAs
The people who annotate the human genome are somewhat skeptical of these new genes and that's why so many putative genes have disappeared from the more recent builds. (But the Ensembl group still lists 434 "novel protein-coding genes.")
However, they don't seem to be as skeptical when it comes to genes that produce small RNAs. The most recent Ensembl build (GRCh37.p5, Feb 2009), for example, lists 12,523 RNA genes [Ensembl: Human Genome].
What are the criteria they use to prove that these are really genes? It can't have anything to do with biological function since it's simply not true that the human genome contains more that twelve thousand genes that produce an RNA whose function has been demonstrated.
Should that be a requirement before declaring that a bit of transcribed DNA is a gene? You're damn right it should because otherwise every bit of DNA that's accidentally transcribed in some tissue at some time during development qualifies as a gene. That makes no sense [What is a gene, post-ENCODE?].
1. That's why it's much more difficult than physics where there's talk about unifying the entire discipline under a single theory of everything. :-)
Guerzoni D, McLysaght A. (2011) De novo origins of human genes. PLoS Genet. 2011 Nov;7(11):e1002381. Epub 2011 Nov 10. [PLoS Genetics]
Knowles, D.G. and McLysaght, A. (2009) Recent de novo origin of human protein-coding genes. Genome Res. 19:1752-1759. PLoS Genet. 2011 Nov;7(11):e1002379. Epub 2011 Nov 10. [doi: 10.1101/gr.095026.109]
Wu, D.D., Irwin, D.M., and Zhang, Y.P. (2010) De novo origin of human protein-coding genes. [PLoS Genetics]