Comments on Sandwalk: How Many Genes Do Nematodes Have? - Pristionchus pacificus Genome

A small correction about Pristionchus pacificus : ...

2009-10-15T05:41:02.316-04:00

A small correction about Pristionchus pacificus : It is NOT a parasite of the beetles but lives in a NECROMENIC association with them.
For review please see: Dieterich C, Sommer RJ. How to become a parasite - lessons from the genomes of nematodes. Trends Genet. 2009;25(5):203-209.

It also has to be thoroughly reviewed by real live...

2008-10-15T10:17:00.000-04:00

It also has to be thoroughly reviewed by real live human beings in order to eliminate errors and misinterpretations.

What sort of reviews need to be performed by human beings, and what are the common errors that need to be corrected?

Marcus says,You may have already seen it.Yes.Trans...

2008-10-10T08:59:00.000-04:00

Marcus says,

You may have already seen it.

Yes.

Transcription Factors Bind Thousands of Active and Inactive Regions in the Drosophila Blastoderm

Larry,Thanks for flagging these two papers that I ...

2008-10-09T15:05:00.000-04:00

Larry,

Thanks for flagging these two papers that I had not yet discovered. I read them with great interest. Here is a paper published earlier this year regarding the chromosomal binding sites for six of our old friends:

Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm PLoS Biol. 2008 Feb;6(2):e27, PMID: 18271625

You may have already seen it. If not I think you may find it of interest; particularly in relation to the discussion of genomes that are "ubiquitously transcribed," and the locations of putative DNA binding sites.

art says,A well-annotated genome will have gene ma...

2008-09-27T10:06:00.000-04:00

art says,

A well-annotated genome will have gene maps (fl cDNAs, ...

I think we may be using different definitions of "well-annotated." What I mean by the term is not just that all of the data is complied and presented in a nice attractive format. It also has to be thoroughly reviewed by real live human beings in order to eliminate errors and misinterpretations.

That's absolutely critical if you are going to use the genome for serious cross-species comparisons.

I gave you a clear example of what needs to be done. One of the genes you asked me to look at is not clearly identified. If one were to do a computer driven search of the Arabidopsis genome in order to extract HSP70 genes it would certainly pick out AT1G79920 since the first line of the description says "heat shock protein 70, putative / HSP70, putative." This is the only identification in the Entrez Gene entry [844332] and it's just plain wrong.

This is an HSP91 gene, not an HSP70 gene. They are very different. Those kinds of errors have to be removed from a well-annotated genome and it has to be done manually. That's why it takes so long.

Here's another example. The first "HSP70" gene on my list is AT1G09080. This is correct, it is one of the versions of the BiP gene in Arabidopsis and those genes are important members of the HSP70 gene family.

If you look at the complete description of the gene you will see near the end of a long list of gene name synonyms the words "contains InterPro domain Heat shock protein 70." This is the sort of thing that's added by computer-driven database collations. It's an important clue to the fact that this is an HSP70 gene.

Unfortunately, the clue is too far down in the description list to make it into Entrez Gene [837429]. The Entrez Gene record just copies the first few words of the Arabidopsis genome data and that's not enough to identify the gene as an important member of the HSP70 gene family. When a human being eventually gets around to examining this gene, that kind of computer generated sloppiness will be fixed.

Here's the Entrez Gene record for the human version of the same gene [3309]. Note that the human gene has an official name (HSPA5). This is one of the things that annotation (my version) does. Also note that in the Entrez Gene record for the human gene under "RefSeq status" it says "VALIDATED" whereas for the Arabidopsis gene it says "PROVISIONAL."

Good annotation also applies to things like putative alternative splicing. If you look at the human, C. elegans, and Drosphila genomes you'll see that most of the silly alternative splice predictions have been removed by intelligent annotators. I'm not sure that this has been done for Arabidopsis gene. Has it?

Oops! Yeast is one of the well-annotated genomes. ...

2008-09-27T09:19:00.000-04:00

Oops! Yeast is one of the well-annotated genomes. It wasn't included in the study because it isn't an animal.

I should have said well-annotated animal genomes and I've just changed it.

However, the debate over Arabidopsis is still valid. I really don't know how good that genome is and I haven't seen a paper that tells me.

I would have guessed that S. cerevisiae would have...

2008-09-26T22:55:00.000-04:00

I would have guessed that S. cerevisiae would have been by far the best annotated eukaryotic genome.

Larry, gimmee a break. Of course the search I des...

2008-09-26T21:45:00.000-04:00

Larry, gimmee a break. Of course the search I described will find all entries that mention HSP70, for any reason. This includes authentic HSP70's and other HSPs and proteins (that, somewhere in their annotation entry, may also mention HSP70).

A well-annotated genome will have gene maps (fl cDNAs, splicing patterns, open reading frames, etc.), expression data (ESTs, microarrays, MPSS, and other data; tissue-specificity, developmental timing, responses to biotic and abiotic stimuli, effects of mutations, etc.), mutant information (of all manner), small RNA information, transcript info (antisense transcripts? alternatively-processes RNAs? non-coding RNAs?), proteome information, subcellular distributions, and much more. The Arabidopsis database has all this and more, AND the information is not derived from computer predictions. It is data-driven, thru and thru. (That's right - every line in the genome browser is informed by data.)

A relatively-recent reference: http://nar.oxfordjournals.org/cgi/content/full/36/suppl_1/D1009 . You'll see that the Ath genome is complete, for all practical purposes, and very well annotated.

art,I agree that there's a lot of information on t...

2008-09-26T20:57:00.000-04:00

art,

I agree that there's a lot of information on the website. Most of it looks like computer generated summaries. If you look at the HSP70 gene names for instance, you will find everything but the kitchen sink. Nobody has made a decision about the correct name.

Take the fifth gene on the HSP70 list (AT1G79920). What is it? Is it a member of the HSP70 gene family? (Hint: NO, it isn't!) A well annotated genome wouldn't have these ambiguities.

Can you tell me where to find information on the amount of the genome that has been sequenced and the number of scaffolds?

Larry, your impression about the Arabidopsis genom...

2008-09-26T17:24:00.000-04:00

Larry, your impression about the Arabidopsis genome project is wrong. You can try the following to see how polished is the Ath genome, and how much information and how many resources are available:

First, visit the link I gave in the previous comment. In the search box near the upper right corner, type in HSP70. You'll retrieve a list of genes (which will likely be a complete set of HSP70 homologs, as well as others that the search pulls up, owing to the idiosyncrasies of the search). Choose the first one, and you will retrieve a detailed annotation. (Actually, you will get the same for any of the genes in the list.) It will have a genome browser view, lists of insertion mutations you can obtain from stock centers, snps, other polymorphisms, gene expression data, miRNA target sites (if there are any), and much, much more. (You can do this for any of the 20,000+ Arabidopsis genes. Very few, if any, will be totally unannotated.)

Now, click on the "Map Detail Image". This will bring you to the genome browser, that can be modified to show all these features and much more. Zoom out one or two clicks, and you will see that all of this information exists for most predicted or confirmed Arabidopsis genes. (I notice that the first HSP70 that is retrieved has some peptides from a large-scale proteomic study - how cool is that!?) Scroll down and you will see, at your fingertips, an astonishing amount of information.

Lagging? I think not. In fact, I doubt that any other model system (well, except yeast and E. coli) can bring all of these items to bear, and on all of the predicted genes. I'm so sure of this that I can pick, sight unseen, Arabidopsis gene IDs and assign them as web-based research subjects for my class. I have yet to assign a gene that has no information.

(Another way to grasp the detail and thoroughness of the Arabidopsis annotation is to get ahold of and browse through some tiling array data. But that requires access and tools that I cannot link to here.)

@Brummell: There's a note of caution that's deserv...

2008-09-26T17:03:00.000-04:00

@Brummell: There's a note of caution that's deserved here. C. elegans reproduces primarily by selfing, which reduces its effective population size and thus tends to lead to the greater likelihood of fixation of slightly deleterious alleles (among many other things). This means that estimates of divergence times between nematodes (especially w.r.t. C. elegans and C. briggsae may be overestimated. In general, estimates of divergence vary widely (e.g., elegans-briggsae diverged anywhere from 20-120 mya according to sources such as Cutter and Payseur 2003, for example). I haven't read the papers detailing new estimates though.

If that 900-mya estimate for a nematode divergence...

2008-09-26T16:50:00.000-04:00

If that 900-mya estimate for a nematode divergence is approximately correct, then some nematode families may be as distinct from each other as other animal phyla. Stunning!

Are there any parasitologists reading this who can tell me how specialized most parasites are? I tend to think that most are specialized to one host species, such that there may indeed be many more nematodes than insects. However, if most parasites are more generalized in their host-preferences, then the same species of nematode may parasitize many species of (for example) beetles. Do we have rough estimates available of parasite specialization?

My impression is that the annotation of the Arabid...

2008-09-26T14:04:00.000-04:00

My impression is that the annotation of the Arabidopsis genome lags far behind that of the other genomes.

Is this a false impression? Do you have a reference to the polished version of the genome?

Many eukaryotic genomes have been sequenced but just because they've been sequenced and they have a website does not mean that the annotation and polishing are nearing completion.

It took ten years for Drosophila and C. elegans. Are you saying that the Arabidopsis project went much faster?

"The number of genes can be compared to the number...

2008-09-26T12:42:00.000-04:00

"The number of genes can be compared to the number in the Drosohila melanoaster genome (~15,000) and the human genome (20,500). These are the only two other eukaryotic genomes that have been extensively annotated."

What the heck, Larry. What do you consider higher plants (such as Arabidopsis, whose genome is extensively annotated and presented to the public in a very user-friendly format) to be?