The prevalence of errors and omissions in sequence databases is one of the ugly little secrets of molecular biology [
Errors in Sequence Databases]. We know how to fix the problem; it requires careful annotation by knowledgeable experts. Unfortunately, this is time-consuming and expensive since you have to hire annotators. One other possibility is to allow open access to all existing records in databases such as
GenBank,
RefSeq, or
PDB. This ain't gonna happen because here's no way to verify the changes to make sure they are valid. The people who control these databases are very reluctant to allow open access and the authors of the database entries are uneasy about allowing others to insert annotations into their records.
But there are other models that might work. A recent paper by Huss et al. (2008) in
PLoS Biology describes a possible solution. They point out that Wikipedia seems to be a successful model of collaborative effort to ensure accuracy. Why not adopt this model for gene annotation?
Some examples of human genes already had Wikipedia entries and these entries were updated and annotated by various users. In order to stimulate and encourage this process, Huss et al. (2008) created stub entries on Wikipedia for every human gene. Here's how they describe it in their paper.
In principle, a comprehensive gene wiki could have naturally evolved out of the existing Wikipedia framework, and as described above, the beginnings of this process were already underway. However, we hypothesized that growth could be greatly accelerated by systematic creation of gene page stubs, each of which would contain a basal level of gene annotation harvested from authoritative sources. Here we describe an effort to automatically create such a foundation for a comprehensive gene wiki. Moreover, we demonstrate that this effort has begun the positive-feedback loop between readers, contributors, and page utility, which will promote its long-term success.
Today, anyone with access to Wikipedia can contribute to annotating human genes. Two examples of well annotated genes are
HSP90 and
NF-κB.
Let's look at some examples of stub entries to see how the process might work. I've chosen the human members of the HSP70 multigene family because I'm familiar with these genes. All members of the family function as molecular chaperones, helping to ensure that proteins are properly folded [
Heat Shock and Molecular Chaperones].
There are two major inducible genes called HSPA1A and HSPA1B. They are adjacent to one another on chromosome 6. The database entries for these genes are confusing and in most cases it's almost impossible to discern which gene is being referred to.
Here's the Wikipedia stub for
HSPA1A. Clearly there's an opportunity to modify this entry in order to make it clear that there are two very similar genes and to point to the proper sequence records for this gene. The second gene, HSPA1B, has its own entry in EntrezGene so I was expecting to find it on Wikipedia. Unfortunately, it's not there. A search for
HSPA1B redirects you to HSPA1A. So right away we have a problem. Someone made a decision to merge these entries on Wikipedia making it very difficult to correctly annotate the separate genes.
HSPA1L is an intronless gene closely linked to HSPA1A and HSPA1B. HSPA1L is not heat shock inducible, instead it is developmentally regulated. The gene is expressed exclusively in the testes. The stub entry for this gene [
HSPA1L] includes an RNA expression profile that beautifully illustrates the developmental regulation but there's nothing in the annotations that mentions this. This is an excellent opportunity to correct an omission in the existing databases.
Let's look at one more example to see how useful the Wikipedia effort might be. The HSPA4 gene is identified on all databases as a member of the HSP70 gene family. It's usually called "Heat shock 70kDa protein 4." The Wikipedia stub reflects the GenBank annotation [
HSPA4]. However, it has been known for a long time that this gene is
NOT a member of the HSP70 gene family. The annotation is incorrect. Instead, this gene is
Apg-2 an HSP100 homologue not related to HSP70. The original error is due to
Fathallah et al. (1993) who sequenced the first example. They mistakenly called it a novel hsp70 gene due, in part, to sequencing errors and partly to an overactive imagination. Mistakes such as these are extremely difficult to remove from the database but we now have an opportunity to correct the error on the Wikipedia entry.
Putting the human genes on Wikipedia is almost as good as allowing open access to the primary sequence databases. The effort will only be successful if scientists make the effort to edit the Wikipedia entries. It's unlikely that most gene entries will be modified but even if only a subset is annotated, it's better than none at all. It would be nice if the RefSeq records could point to the Wikipedia records. That will encourage people to make comments on Wikipedia.
Huss III, J.W., Orozco, C., Goodale, J., Wu, C., Batalov, S., Vickers, T.J., Valafar, F., and Su, A.I. (2008) A Gene Wiki for Community Annotation of Gene Function. PLoS Biol 6(7): e175 [doi:10.1371/journal.pbio.0060175]