Thursday, September 25, 2008

How Many Genes Do Nematodes Have? - Pristionchus pacificus Genome


Nematodes are small wormlike creatures that live almost everywhere. Many of them are parasites but there are thousands of species that live in the soil. "... it is said that if everything on the earth were to disappear except the nematodes, the outlines of everything would still be visible: the mountains, lakes and oceans, the plants and the animals would all be outlined by the nematodes living in every habitat."1

The free-living species Caenorhabditis elegans was chosen by Sydney Brenner as a model organism for the study of development [Nobel Laureates: Sydney Brenner, Robert Horvitz, John Sulston]. It turned out to be an excellent choice and by the mid 1990s this small metazoan (multi-cellular animal) was selected as the best metazoan candidate for genome sequencing.

The complete genome sequence was published in 1998. The genome is 100 Mb in size (= 100 million base pairs). This was smaller than the predicted size of the fruit fly genome (165 Mb) or the human genome (3,200 Mb). The first estimates of the number of genes were over 19,000 and at the time this was thought to be a reliable estimate although there were many, including me, who though that it was probably too high.

Over the years we have become more skeptical of these initial gene counts because there are many problems. The location of genes is determined by sophisticated computer programs that are trained to recognize the important characteristics of gene sequences (protein coding genes). This year marks the tenth anniversary of the publication of the C. elegans genome sequence and most people will be surprised to learn that the annotation of this sequence is just beginning to be complete.

A recent paper by James Thomas summarizes the result so far (Thomas, 2008).

Thomas points out that gene prediction suffers from the presence of false positives. One of the complications is pseudogenes, which are not easy to distinguish from real genes. Another complication is proving that a predicted gene is actually functional and not just a computational artifact. There is no better way to resolve these issues than by having real live people look at every potential gene. This is why annotation takes so long.

ResearchBlogging.orgThe latest estimate is 20,140 protein coding genes in the Caenorhabditis elegans genome. The coding regions (exons) would take up about 40 Mb of DNA or 24% of the genome. Most of the remainder is junk DNA.

The number of genes is remarkably close to the original prediction although it should be noted that estimates of the number of genes went up after the initial draft sequence was published. Nevertheless, unlike the gene count in humans, the number of genes has held pretty steady.

The number of genes can be compared to the number in the Drosohila melanoaster genome (~15,000) and the human genome (20,500). These are the only two other eukaryotic metazoan genomes2 that have been extensively annotated.

There are about 23,000 distinct transcripts from these genes. What that means is that roughly 18,000 genes produce a single transcript and about 2,000 produce two or three different transcripts by alternative splicing.

The C. elegans genes can be divided into two categories. About 8,000 of them are unique and the remainder belong to gene families. A gene family consists of multiple copies of the same gene in the same genome. The copies (paralogues) may be identical or they may be quite different but still related. Some of the gene families are very large and some have only two members.

There seem to be about 3,000 genes families contributing to the 12,000 genes that are not unique. The bottom line is that there are about 11,000 (8K + 3K) different kinds of gene in C. elegans. Interestingly, only 1800 of these genes are found in both insects (Drosophila) and primates (humans). The rest are restricted to just insets and nematodes or just nematodes (10,000 are found in other nematode species).

James Thomas points out that the determination of orthology (same genes in other species) is much more difficult than one might imagine. Many of the online databases, for example, contain erroneous entries based on faulty predictions. These false predictions propagate so that it often isn't reliable to use the database to confirm that a predicted gene actually exists. That's why he restricts his comparisons to well-annotated genomes wherever possible.

Partially annotated genome sequences of Caenorhabditis brigsae and Caenorhabditis remaneri are available. Orthologous gene comparisons indicate that the three species are remarkably dissimilar for species within the same genus. They probably diverged at least 20 My ago.

A new nematode genome sequence was published this week. The species is Pristionchus pacificus, a parasite of the oriental beetle Examala orientalis (Dieteridh et al. 2008). The authors note that there is a different species of parasitic nematode associated with almost every species of beetle, which means that there are at least as many nematodes as insects.

The Pristionchus pacificus genome is 169 Mb in size, which is considerably larger than the size of the Caenorhabditis elegans genome (100 Mb). P. pacificus has 23,500 genes.

Some of the increase in genome size is due to more genes but this is only a minor difference. Some of it is due to the presence of additional copies of repetitive DNA sequences in P. pacificus but the increase doesn't account for the extra 69 Mb of DNA.

The differences in gene number are almost entirely due to increases in the members of gene families in the P. pacificus genome. Several specific examples were given, notably 250 extra copies of ribosomal protein genes compared to C. elegans.

Another remarkable difference is in the number of genes involved in detoxification, or removal of poisonous substances. There are about 250 extra copies of gene family members in this category. The authors speculate that this expansion may be selection for detoxifying enzymes in parasites as opposed to the free-living C. elegans.

In addition to the various Caenorhabditis species, we now have a complete genome of the nematode Brugia malayi the parasite responsible for filariasis in humans. Pristionchus diverged from Caenorhabditis about 350 My (million years) ago and Brugia diverged from the others about 900 My ago according to Dietrich et al. (2008). Thomas (2008) cautions that these divergence times are based on an underestimate of mutation/fixation rates and that nematodes may be evolving more rapidly than other phyla. Nevertheless, it is clear that nematodes are an ancient, diverse, and abundant group of animals.

1. Nematoda.

2. See the discussion in the comments for examples of other well-annotated eukaryotic genomes. Yeast is obvious but what about Arabidopsis?

[Photo Credit: Christina Beck]

Christoph Dieterich, Sandra W Clifton, Lisa N Schuster, Asif Chinwalla, Kimberly Delehaunty, Iris Dinkelacker, Lucinda Fulton, Robert Fulton, Jennifer Godfrey, Pat Minx, Makedonka Mitreva, Waltraud Roeseler, Huiyu Tian, Hanh Witte, Shiaw-Pyng Yang, Richard K Wilson, Ralf J Sommer (2008). The Pristionchus pacificus genome provides a unique perspective on nematode lifestyle and parasitism Nature Genetics DOI: 10.1038/ng.227

J. H. Thomas (2008). Genome evolution in Caenorhabditis Briefings in Functional Genomics and Proteomics, 7 (3), 211-216 DOI: 10.1093/bfgp/eln022


  1. "The number of genes can be compared to the number in the Drosohila melanoaster genome (~15,000) and the human genome (20,500). These are the only two other eukaryotic genomes that have been extensively annotated."

    What the heck, Larry. What do you consider higher plants (such as Arabidopsis, whose genome is extensively annotated and presented to the public in a very user-friendly format) to be?

  2. My impression is that the annotation of the Arabidopsis genome lags far behind that of the other genomes.

    Is this a false impression? Do you have a reference to the polished version of the genome?

    Many eukaryotic genomes have been sequenced but just because they've been sequenced and they have a website does not mean that the annotation and polishing are nearing completion.

    It took ten years for Drosophila and C. elegans. Are you saying that the Arabidopsis project went much faster?

  3. If that 900-mya estimate for a nematode divergence is approximately correct, then some nematode families may be as distinct from each other as other animal phyla. Stunning!

    Are there any parasitologists reading this who can tell me how specialized most parasites are? I tend to think that most are specialized to one host species, such that there may indeed be many more nematodes than insects. However, if most parasites are more generalized in their host-preferences, then the same species of nematode may parasitize many species of (for example) beetles. Do we have rough estimates available of parasite specialization?

  4. @Brummell: There's a note of caution that's deserved here. C. elegans reproduces primarily by selfing, which reduces its effective population size and thus tends to lead to the greater likelihood of fixation of slightly deleterious alleles (among many other things). This means that estimates of divergence times between nematodes (especially w.r.t. C. elegans and C. briggsae may be overestimated. In general, estimates of divergence vary widely (e.g., elegans-briggsae diverged anywhere from 20-120 mya according to sources such as Cutter and Payseur 2003, for example). I haven't read the papers detailing new estimates though.

  5. Larry, your impression about the Arabidopsis genome project is wrong. You can try the following to see how polished is the Ath genome, and how much information and how many resources are available:

    First, visit the link I gave in the previous comment. In the search box near the upper right corner, type in HSP70. You'll retrieve a list of genes (which will likely be a complete set of HSP70 homologs, as well as others that the search pulls up, owing to the idiosyncrasies of the search). Choose the first one, and you will retrieve a detailed annotation. (Actually, you will get the same for any of the genes in the list.) It will have a genome browser view, lists of insertion mutations you can obtain from stock centers, snps, other polymorphisms, gene expression data, miRNA target sites (if there are any), and much, much more. (You can do this for any of the 20,000+ Arabidopsis genes. Very few, if any, will be totally unannotated.)

    Now, click on the "Map Detail Image". This will bring you to the genome browser, that can be modified to show all these features and much more. Zoom out one or two clicks, and you will see that all of this information exists for most predicted or confirmed Arabidopsis genes. (I notice that the first HSP70 that is retrieved has some peptides from a large-scale proteomic study - how cool is that!?) Scroll down and you will see, at your fingertips, an astonishing amount of information.

    Lagging? I think not. In fact, I doubt that any other model system (well, except yeast and E. coli) can bring all of these items to bear, and on all of the predicted genes. I'm so sure of this that I can pick, sight unseen, Arabidopsis gene IDs and assign them as web-based research subjects for my class. I have yet to assign a gene that has no information.

    (Another way to grasp the detail and thoroughness of the Arabidopsis annotation is to get ahold of and browse through some tiling array data. But that requires access and tools that I cannot link to here.)

  6. art,

    I agree that there's a lot of information on the website. Most of it looks like computer generated summaries. If you look at the HSP70 gene names for instance, you will find everything but the kitchen sink. Nobody has made a decision about the correct name.

    Take the fifth gene on the HSP70 list (AT1G79920). What is it? Is it a member of the HSP70 gene family? (Hint: NO, it isn't!) A well annotated genome wouldn't have these ambiguities.

    Can you tell me where to find information on the amount of the genome that has been sequenced and the number of scaffolds?

  7. Larry, gimmee a break. Of course the search I described will find all entries that mention HSP70, for any reason. This includes authentic HSP70's and other HSPs and proteins (that, somewhere in their annotation entry, may also mention HSP70).

    A well-annotated genome will have gene maps (fl cDNAs, splicing patterns, open reading frames, etc.), expression data (ESTs, microarrays, MPSS, and other data; tissue-specificity, developmental timing, responses to biotic and abiotic stimuli, effects of mutations, etc.), mutant information (of all manner), small RNA information, transcript info (antisense transcripts? alternatively-processes RNAs? non-coding RNAs?), proteome information, subcellular distributions, and much more. The Arabidopsis database has all this and more, AND the information is not derived from computer predictions. It is data-driven, thru and thru. (That's right - every line in the genome browser is informed by data.)

    A relatively-recent reference: . You'll see that the Ath genome is complete, for all practical purposes, and very well annotated.

  8. I would have guessed that S. cerevisiae would have been by far the best annotated eukaryotic genome.

  9. Oops! Yeast is one of the well-annotated genomes. It wasn't included in the study because it isn't an animal.

    I should have said well-annotated animal genomes and I've just changed it.

    However, the debate over Arabidopsis is still valid. I really don't know how good that genome is and I haven't seen a paper that tells me.

  10. art says,

    A well-annotated genome will have gene maps (fl cDNAs, ...

    I think we may be using different definitions of "well-annotated." What I mean by the term is not just that all of the data is complied and presented in a nice attractive format. It also has to be thoroughly reviewed by real live human beings in order to eliminate errors and misinterpretations.

    That's absolutely critical if you are going to use the genome for serious cross-species comparisons.

    I gave you a clear example of what needs to be done. One of the genes you asked me to look at is not clearly identified. If one were to do a computer driven search of the Arabidopsis genome in order to extract HSP70 genes it would certainly pick out AT1G79920 since the first line of the description says "heat shock protein 70, putative / HSP70, putative." This is the only identification in the Entrez Gene entry [844332] and it's just plain wrong.

    This is an HSP91 gene, not an HSP70 gene. They are very different. Those kinds of errors have to be removed from a well-annotated genome and it has to be done manually. That's why it takes so long.

    Here's another example. The first "HSP70" gene on my list is AT1G09080. This is correct, it is one of the versions of the BiP gene in Arabidopsis and those genes are important members of the HSP70 gene family.

    If you look at the complete description of the gene you will see near the end of a long list of gene name synonyms the words "contains InterPro domain Heat shock protein 70." This is the sort of thing that's added by computer-driven database collations. It's an important clue to the fact that this is an HSP70 gene.

    Unfortunately, the clue is too far down in the description list to make it into Entrez Gene [837429]. The Entrez Gene record just copies the first few words of the Arabidopsis genome data and that's not enough to identify the gene as an important member of the HSP70 gene family. When a human being eventually gets around to examining this gene, that kind of computer generated sloppiness will be fixed.

    Here's the Entrez Gene record for the human version of the same gene [3309]. Note that the human gene has an official name (HSPA5). This is one of the things that annotation (my version) does. Also note that in the Entrez Gene record for the human gene under "RefSeq status" it says "VALIDATED" whereas for the Arabidopsis gene it says "PROVISIONAL."

    Good annotation also applies to things like putative alternative splicing. If you look at the human, C. elegans, and Drosphila genomes you'll see that most of the silly alternative splice predictions have been removed by intelligent annotators. I'm not sure that this has been done for Arabidopsis gene. Has it?

  11. Larry,

    Thanks for flagging these two papers that I had not yet discovered. I read them with great interest. Here is a paper published earlier this year regarding the chromosomal binding sites for six of our old friends:

    Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm PLoS Biol. 2008 Feb;6(2):e27, PMID: 18271625

    You may have already seen it. If not I think you may find it of interest; particularly in relation to the discussion of genomes that are "ubiquitously transcribed," and the locations of putative DNA binding sites.

  12. It also has to be thoroughly reviewed by real live human beings in order to eliminate errors and misinterpretations.

    What sort of reviews need to be performed by human beings, and what are the common errors that need to be corrected?

  13. A small correction about Pristionchus pacificus : It is NOT a parasite of the beetles but lives in a NECROMENIC association with them.
    For review please see: Dieterich C, Sommer RJ. How to become a parasite - lessons from the genomes of nematodes. Trends Genet. 2009;25(5):203-209.