Sandwalk: How Many Genes Do Nematodes Have? - Pristionchus pacificus Genome

Thursday, September 25, 2008

How Many Genes Do Nematodes Have? - Pristionchus pacificus Genome

Nematodes are small wormlike creatures that live almost everywhere. Many of them are parasites but there are thousands of species that live in the soil. "... it is said that if everything on the earth were to disappear except the nematodes, the outlines of everything would still be visible: the mountains, lakes and oceans, the plants and the animals would all be outlined by the nematodes living in every habitat."¹

The free-living species Caenorhabditis elegans was chosen by Sydney Brenner as a model organism for the study of development [Nobel Laureates: Sydney Brenner, Robert Horvitz, John Sulston]. It turned out to be an excellent choice and by the mid 1990s this small metazoan (multi-cellular animal) was selected as the best metazoan candidate for genome sequencing.

The complete genome sequence was published in 1998. The genome is 100 Mb in size (= 100 million base pairs). This was smaller than the predicted size of the fruit fly genome (165 Mb) or the human genome (3,200 Mb). The first estimates of the number of genes were over 19,000 and at the time this was thought to be a reliable estimate although there were many, including me, who though that it was probably too high.

Over the years we have become more skeptical of these initial gene counts because there are many problems. The location of genes is determined by sophisticated computer programs that are trained to recognize the important characteristics of gene sequences (protein coding genes). This year marks the tenth anniversary of the publication of the C. elegans genome sequence and most people will be surprised to learn that the annotation of this sequence is just beginning to be complete.

A recent paper by James Thomas summarizes the result so far (Thomas, 2008).

Thomas points out that gene prediction suffers from the presence of false positives. One of the complications is pseudogenes, which are not easy to distinguish from real genes. Another complication is proving that a predicted gene is actually functional and not just a computational artifact. There is no better way to resolve these issues than by having real live people look at every potential gene. This is why annotation takes so long.

The latest estimate is 20,140 protein coding genes in the Caenorhabditis elegans genome. The coding regions (exons) would take up about 40 Mb of DNA or 24% of the genome. Most of the remainder is junk DNA.

The number of genes is remarkably close to the original prediction although it should be noted that estimates of the number of genes went up after the initial draft sequence was published. Nevertheless, unlike the gene count in humans, the number of genes has held pretty steady.

The number of genes can be compared to the number in the Drosohila melanoaster genome (~15,000) and the human genome (20,500). These are the only two other ~~eukaryotic~~ metazoan genomes² that have been extensively annotated.

There are about 23,000 distinct transcripts from these genes. What that means is that roughly 18,000 genes produce a single transcript and about 2,000 produce two or three different transcripts by alternative splicing.

The C. elegans genes can be divided into two categories. About 8,000 of them are unique and the remainder belong to gene families. A gene family consists of multiple copies of the same gene in the same genome. The copies (paralogues) may be identical or they may be quite different but still related. Some of the gene families are very large and some have only two members.

There seem to be about 3,000 genes families contributing to the 12,000 genes that are not unique. The bottom line is that there are about 11,000 (8K + 3K) different kinds of gene in C. elegans. Interestingly, only 1800 of these genes are found in both insects (Drosophila) and primates (humans). The rest are restricted to just insets and nematodes or just nematodes (10,000 are found in other nematode species).

James Thomas points out that the determination of orthology (same genes in other species) is much more difficult than one might imagine. Many of the online databases, for example, contain erroneous entries based on faulty predictions. These false predictions propagate so that it often isn't reliable to use the database to confirm that a predicted gene actually exists. That's why he restricts his comparisons to well-annotated genomes wherever possible.

Partially annotated genome sequences of Caenorhabditis brigsae and Caenorhabditis remaneri are available. Orthologous gene comparisons indicate that the three species are remarkably dissimilar for species within the same genus. They probably diverged at least 20 My ago.

A new nematode genome sequence was published this week. The species is Pristionchus pacificus, a parasite of the oriental beetle Examala orientalis (Dieteridh et al. 2008). The authors note that there is a different species of parasitic nematode associated with almost every species of beetle, which means that there are at least as many nematodes as insects.

The Pristionchus pacificus genome is 169 Mb in size, which is considerably larger than the size of the Caenorhabditis elegans genome (100 Mb). P. pacificus has 23,500 genes.

Some of the increase in genome size is due to more genes but this is only a minor difference. Some of it is due to the presence of additional copies of repetitive DNA sequences in P. pacificus but the increase doesn't account for the extra 69 Mb of DNA.

The differences in gene number are almost entirely due to increases in the members of gene families in the P. pacificus genome. Several specific examples were given, notably 250 extra copies of ribosomal protein genes compared to C. elegans.

Another remarkable difference is in the number of genes involved in detoxification, or removal of poisonous substances. There are about 250 extra copies of gene family members in this category. The authors speculate that this expansion may be selection for detoxifying enzymes in parasites as opposed to the free-living C. elegans.

In addition to the various Caenorhabditis species, we now have a complete genome of the nematode Brugia malayi the parasite responsible for filariasis in humans. Pristionchus diverged from Caenorhabditis about 350 My (million years) ago and Brugia diverged from the others about 900 My ago according to Dietrich et al. (2008). Thomas (2008) cautions that these divergence times are based on an underestimate of mutation/fixation rates and that nematodes may be evolving more rapidly than other phyla. Nevertheless, it is clear that nematodes are an ancient, diverse, and abundant group of animals.

1. Nematoda.

2. See the discussion in the comments for examples of other well-annotated eukaryotic genomes. Yeast is obvious but what about Arabidopsis?

[Photo Credit: Christina Beck]

Christoph Dieterich, Sandra W Clifton, Lisa N Schuster, Asif Chinwalla, Kimberly Delehaunty, Iris Dinkelacker, Lucinda Fulton, Robert Fulton, Jennifer Godfrey, Pat Minx, Makedonka Mitreva, Waltraud Roeseler, Huiyu Tian, Hanh Witte, Shiaw-Pyng Yang, Richard K Wilson, Ralf J Sommer (2008). The Pristionchus pacificus genome provides a unique perspective on nematode lifestyle and parasitism Nature Genetics DOI: 10.1038/ng.227

J. H. Thomas (2008). Genome evolution in Caenorhabditis Briefings in Functional Genomics and Proteomics, 7 (3), 211-216 DOI: 10.1093/bfgp/eln022

14 comments :

Anonymous said...: "The number of genes can be compared to the number in the Drosohila melanoaster genome (~15,000) and the human genome (20,500). These are the only two other eukaryotic genomes that have been extensively annotated."

What the heck, Larry. What do you consider higher plants (such as Arabidopsis, whose genome is extensively annotated and presented to the public in a very user-friendly format) to be?; Friday, September 26, 2008 12:42:00 PM
Larry Moran said...: My impression is that the annotation of the Arabidopsis genome lags far behind that of the other genomes.

Is this a false impression? Do you have a reference to the polished version of the genome?

Many eukaryotic genomes have been sequenced but just because they've been sequenced and they have a website does not mean that the annotation and polishing are nearing completion.

It took ten years for Drosophila and C. elegans. Are you saying that the Arabidopsis project went much faster?; Friday, September 26, 2008 2:04:00 PM
TheBrummell said...: If that 900-mya estimate for a nematode divergence is approximately correct, then some nematode families may be as distinct from each other as other animal phyla. Stunning!

Are there any parasitologists reading this who can tell me how specialized most parasites are? I tend to think that most are specialized to one host species, such that there may indeed be many more nematodes than insects. However, if most parasites are more generalized in their host-preferences, then the same species of nematode may parasitize many species of (for example) beetles. Do we have rough estimates available of parasite specialization?; Friday, September 26, 2008 4:50:00 PM
Carlo said...: @Brummell: There's a note of caution that's deserved here. C. elegans reproduces primarily by selfing, which reduces its effective population size and thus tends to lead to the greater likelihood of fixation of slightly deleterious alleles (among many other things). This means that estimates of divergence times between nematodes (especially w.r.t. C. elegans and C. briggsae may be overestimated. In general, estimates of divergence vary widely (e.g., elegans-briggsae diverged anywhere from 20-120 mya according to sources such as Cutter and Payseur 2003, for example). I haven't read the papers detailing new estimates though.; Friday, September 26, 2008 5:03:00 PM
Anonymous said...: Larry, your impression about the Arabidopsis genome project is wrong. You can try the following to see how polished is the Ath genome, and how much information and how many resources are available:

First, visit the link I gave in the previous comment. In the search box near the upper right corner, type in HSP70. You'll retrieve a list of genes (which will likely be a complete set of HSP70 homologs, as well as others that the search pulls up, owing to the idiosyncrasies of the search). Choose the first one, and you will retrieve a detailed annotation. (Actually, you will get the same for any of the genes in the list.) It will have a genome browser view, lists of insertion mutations you can obtain from stock centers, snps, other polymorphisms, gene expression data, miRNA target sites (if there are any), and much, much more. (You can do this for any of the 20,000+ Arabidopsis genes. Very few, if any, will be totally unannotated.)

Now, click on the "Map Detail Image". This will bring you to the genome browser, that can be modified to show all these features and much more. Zoom out one or two clicks, and you will see that all of this information exists for most predicted or confirmed Arabidopsis genes. (I notice that the first HSP70 that is retrieved has some peptides from a large-scale proteomic study - how cool is that!?) Scroll down and you will see, at your fingertips, an astonishing amount of information.

Lagging? I think not. In fact, I doubt that any other model system (well, except yeast and E. coli) can bring all of these items to bear, and on all of the predicted genes. I'm so sure of this that I can pick, sight unseen, Arabidopsis gene IDs and assign them as web-based research subjects for my class. I have yet to assign a gene that has no information.

(Another way to grasp the detail and thoroughness of the Arabidopsis annotation is to get ahold of and browse through some tiling array data. But that requires access and tools that I cannot link to here.); Friday, September 26, 2008 5:24:00 PM
Larry Moran said...: art,

I agree that there's a lot of information on the website. Most of it looks like computer generated summaries. If you look at the HSP70 gene names for instance, you will find everything but the kitchen sink. Nobody has made a decision about the correct name.

Take the fifth gene on the HSP70 list (AT1G79920). What is it? Is it a member of the HSP70 gene family? (Hint: NO, it isn't!) A well annotated genome wouldn't have these ambiguities.

Can you tell me where to find information on the amount of the genome that has been sequenced and the number of scaffolds?; Friday, September 26, 2008 8:57:00 PM
Anonymous said...: Larry, gimmee a break. Of course the search I described will find all entries that mention HSP70, for any reason. This includes authentic HSP70's and other HSPs and proteins (that, somewhere in their annotation entry, may also mention HSP70).

A well-annotated genome will have gene maps (fl cDNAs, splicing patterns, open reading frames, etc.), expression data (ESTs, microarrays, MPSS, and other data; tissue-specificity, developmental timing, responses to biotic and abiotic stimuli, effects of mutations, etc.), mutant information (of all manner), small RNA information, transcript info (antisense transcripts? alternatively-processes RNAs? non-coding RNAs?), proteome information, subcellular distributions, and much more. The Arabidopsis database has all this and more, AND the information is not derived from computer predictions. It is data-driven, thru and thru. (That's right - every line in the genome browser is informed by data.)

A relatively-recent reference: http://nar.oxfordjournals.org/cgi/content/full/36/suppl_1/D1009 . You'll see that the Ath genome is complete, for all practical purposes, and very well annotated.; Friday, September 26, 2008 9:45:00 PM
Anonymous said...: I would have guessed that S. cerevisiae would have been by far the best annotated eukaryotic genome.; Friday, September 26, 2008 10:55:00 PM
Larry Moran said...: Oops! Yeast is one of the well-annotated genomes. It wasn't included in the study because it isn't an animal.

I should have said well-annotated animal genomes and I've just changed it.

However, the debate over Arabidopsis is still valid. I really don't know how good that genome is and I haven't seen a paper that tells me.; Saturday, September 27, 2008 9:19:00 AM
Larry Moran said...: art says,

A well-annotated genome will have gene maps (fl cDNAs, ...

I think we may be using different definitions of "well-annotated." What I mean by the term is not just that all of the data is complied and presented in a nice attractive format. It also has to be thoroughly reviewed by real live human beings in order to eliminate errors and misinterpretations.

That's absolutely critical if you are going to use the genome for serious cross-species comparisons.

I gave you a clear example of what needs to be done. One of the genes you asked me to look at is not clearly identified. If one were to do a computer driven search of the Arabidopsis genome in order to extract HSP70 genes it would certainly pick out AT1G79920 since the first line of the description says "heat shock protein 70, putative / HSP70, putative." This is the only identification in the Entrez Gene entry [844332] and it's just plain wrong.

This is an HSP91 gene, not an HSP70 gene. They are very different. Those kinds of errors have to be removed from a well-annotated genome and it has to be done manually. That's why it takes so long.

Here's another example. The first "HSP70" gene on my list is AT1G09080. This is correct, it is one of the versions of the BiP gene in Arabidopsis and those genes are important members of the HSP70 gene family.

If you look at the complete description of the gene you will see near the end of a long list of gene name synonyms the words "contains InterPro domain Heat shock protein 70." This is the sort of thing that's added by computer-driven database collations. It's an important clue to the fact that this is an HSP70 gene.

Unfortunately, the clue is too far down in the description list to make it into Entrez Gene [837429]. The Entrez Gene record just copies the first few words of the Arabidopsis genome data and that's not enough to identify the gene as an important member of the HSP70 gene family. When a human being eventually gets around to examining this gene, that kind of computer generated sloppiness will be fixed.

Here's the Entrez Gene record for the human version of the same gene [3309]. Note that the human gene has an official name (HSPA5). This is one of the things that annotation (my version) does. Also note that in the Entrez Gene record for the human gene under "RefSeq status" it says "VALIDATED" whereas for the Arabidopsis gene it says "PROVISIONAL."

Good annotation also applies to things like putative alternative splicing. If you look at the human, C. elegans, and Drosphila genomes you'll see that most of the silly alternative splice predictions have been removed by intelligent annotators. I'm not sure that this has been done for Arabidopsis gene. Has it?; Saturday, September 27, 2008 10:06:00 AM
MDPerry said...: Larry,

Thanks for flagging these two papers that I had not yet discovered. I read them with great interest. Here is a paper published earlier this year regarding the chromosomal binding sites for six of our old friends:

Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm PLoS Biol. 2008 Feb;6(2):e27, PMID: 18271625

You may have already seen it. If not I think you may find it of interest; particularly in relation to the discussion of genomes that are "ubiquitously transcribed," and the locations of putative DNA binding sites.; Thursday, October 09, 2008 3:05:00 PM
Larry Moran said...: Marcus says,

You may have already seen it.

Yes.

Transcription Factors Bind Thousands of Active and Inactive Regions in the Drosophila Blastoderm; Friday, October 10, 2008 8:59:00 AM
Anonymous said...: It also has to be thoroughly reviewed by real live human beings in order to eliminate errors and misinterpretations.

What sort of reviews need to be performed by human beings, and what are the common errors that need to be corrected?; Wednesday, October 15, 2008 10:17:00 AM
Amit Sinha said...: A small correction about Pristionchus pacificus : It is NOT a parasite of the beetles but lives in a NECROMENIC association with them.
For review please see: Dieterich C, Sommer RJ. How to become a parasite - lessons from the genomes of nematodes. Trends Genet. 2009;25(5):203-209.; Thursday, October 15, 2009 5:41:00 AM

Quotations

The old argument of design in nature, as given by Paley, which formerly seemed to me to be so conclusive, fails, now that the law of natural selection has been discovered. We can no longer argue that, for instance, the beautiful hinge of a bivalve shell must have been made by an intelligent being, like the hinge of a door by man. There seems to be no more design in the variability of organic beings and in the action of natural selection, than in the course which the wind blows.Charles Darwin (c1880)

Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed, during a long course of years, from a point of view directly opposite to mine. It is so easy to hide our ignorance under such expressions as "plan of creation," "unity of design," etc., and to think that we give an explanation when we only restate a fact. Any one whose disposition leads him to attach more weight to unexplained difficulties than to the explanation of a certain number of facts will certainly reject the theory.

Charles Darwin (1859)

Science reveals where religion conceals. Where religion purports to explain, it actually resorts to tautology. To assert that "God did it" is no more than an admission of ignorance dressed deceitfully as an explanation...

Peter Atkins

Quotations

The world is not inhabited exclusively by fools, and when a subject arouses intense interest, as this one has, something other than semantics is usually at stake. Stephen Jay Gould (1982)
I have championed contingency, and will continue to do so, because its large realm and legitimate claims have been so poorly attended by evolutionary scientists who cannot discern the beat of this different drummer while their brains and ears remain tuned to only the sounds of general theory. Stephen Jay Gould (2002) p.1339
The essence of Darwinism lies in its claim that natural selection creates the fit. Variation is ubiquitous and random in direction. It supplies raw material only. Natural selection directs the course of evolutionary change. Stephen Jay Gould (1977)
Rudyard Kipling asked how the leopard got its spots, the rhino its wrinkled skin. He called his answers "just-so stories." When evolutionists try to explain form and behavior, they also tell just-so stories—and the agent is natural selection. Virtuosity in invention replaces testability as the criterion for acceptance. Stephen Jay Gould (1980)
Since 'change of gene frequencies in populations' is the 'official' definition of evolution, randomness has transgressed Darwin's border and asserted itself as an agent of evolutionary change. Stephen Jay Gould (1983) p.335
The first commandment for all versions of NOMA might be summarized by stating: "Thou shalt not mix the magisteria by claiming that God directly ordains important events in the history of nature by special interference knowable only through revelation and not accessible to science." In common parlance, we refer to such special interference as "miracle"—operationally defined as a unique and temporary suspension of natural law to reorder the facts of nature by divine fiat. Stephen Jay Gould (1999) p.84

Quotations

My own view is that conclusions about the evolution of human behavior should be based on research at least as rigorous as that used in studying nonhuman animals. And if you read the animal behavior journals, you'll see that this requirement sets the bar pretty high, so that many assertions about evolutionary psychology sink without a trace.

Jerry Coyne
Why Evolution Is True

I once made the remark that two things disappeared in 1990: one was communism, the other was biochemistry and that only one of them should be allowed to come back.

Sydney Brenner
TIBS Dec. 2000

It is naïve to think that if a species' environment changes the species must adapt or else become extinct.... Just as a changed environment need not set in motion selection for new adaptations, new adaptations may evolve in an unchanging environment if new mutations arise that are superior to any pre-existing variations

Douglas Futuyma

One of the most frightening things in the Western world, and in this country in particular, is the number of people who believe in things that are scientifically false. If someone tells me that the earth is less than 10,000 years old, in my opinion he should see a psychiatrist.

Francis Crick

There will be no difficulty in computers being adapted to biology. There will be luddites. But they will be buried.

Sydney Brenner

An atheist before Darwin could have said, following Hume: 'I have no explanation for complex biological design. All I know is that God isn't a good explanation, so we must wait and hope that somebody comes up with a better one.' I can't help feeling that such a position, though logically sound, would have left one feeling pretty unsatisfied, and that although atheism might have been logically tenable before Darwin, Darwin made it possible to be an intellectually fulfilled atheist

Richard Dawkins

Another curious aspect of the theory of evolution is that everybody thinks he understand it. I mean philosophers, social scientists, and so on. While in fact very few people understand it, actually as it stands, even as it stood when Darwin expressed it, and even less as we now may be able to understand it in biology.

Jacques Monod

The false view of evolution as a process of global optimizing has been applied literally by engineers who, taken in by a mistaken metaphor, have attempted to find globally optimal solutions to design problems by writing programs that model evolution by natural selection.

Richard Lewontin

More Recent Comments

Thursday, September 25, 2008

How Many Genes Do Nematodes Have? - Pristionchus pacificus Genome

14 comments :