Friday, June 20, 2008

Errors in Sequence Databases

Sandra Porter at Discovering Biology in a Digital World brings up an issue that has been bugging me for two decades [Biologists vs. the Age of Information]. The issue is the accuracy of information in biological databases.
Let's begin with GenBank - GenBank is the main database of nucleotide sequences at the NCBI. Sequence data are submitted to GenBank by researchers or sequencing centers. If mistakes are found, the information in the records can be updated by the submitters or by third parties if the corrected versions are published. This correction activity doesn't always happen though, and the requirement for third party annotations to be published makes it pretty unlikely that anyone will submit small corrections to a sequence.

This is why we see these kinds of quotes from Steven Salzberg (3):
So you think that gene you just retrieved from GenBank [1] is correct? Are you certain? If it is a eukaryotic gene, and especially if it is from an unfinished genome, there is a pretty good chance that the amino acid sequence is wrong. And depending on when the genome was sequenced and annotated, there is a chance that the description of its function is wrong too.
This is a serious problem. Most people don't realize that GenBank is full of sequences that are known to be incorrect and/or poorly annotated. In most cases, the errors are relatively minor such as one or two incorrect codons or deletion of a single codon. In other cases, the errors are more important, such as a pseudogene being represented as a real gene, or missing exons. Sometimes the identity of a gene is completely wrong. I've even seen examples where the species is incorrectly identified.

Sandra asks,
So what do we do? Do we care if the database information is up-to-date? If so, who should be responsible for the updates?

I'm sure some people would like the NCBI to be the final authority and just fix everything but I don't think that's very realistic.

Other people have proposed that wikis are the answer. Maybe they're right, but I really wonder if researchers would be any better at updating wikis than they are at updating information in places like the NCBI.

Well, dear readers, what do you think? Does GenBank need to be fixed? Do we just need more alternatives? Does it even matter?
Back in 1992, I spent part of a summer at the GenBank site in Los Alamos (New Mexico, USA). That was before GenBank moved to NCBI in Bethesda. My task was to explore the possibility of curating GenBank to fix all the errors. I worked with the HSP70 sequences since I had already documented most of the errors in those sequences (The HSP70 Sequence Database).

We decided that I could make corrections to any HSP70 sequence as long as I annotated the changes and got permission from the authors by 'phone.1 This didn't work. Most of the authors were unwilling to allow changes 'cause they weren't aware of the fact that there was a conflict between their sequences and the aligned sequence database. They didn't even know that others had sequenced the same gene and gotten a different sequence.

We discussed this problem. At the time, everyone was aware of the fact that the SwissProt database was curated and that the curators were making decisions on their own about which sequences were correct and which ones were errors. Here's an example of the entry for human HSPA1A showing the conflicts and variations.

Sometimes the SwissProt curators get it wrong and identify the correct sequence as an error and vice versa. Sometimes they really screw up. Here's an example of that mistake [P23931].

Curating a sequence database is incredibly expensive. You need to hire hundreds of competent workers who can analyze every sequence as it comes in. There are some tools that will help identify errors but in order to reach an acceptable level of accuracy you need to build aligned sequence databases for every gene. That can't be done automatically; you need to have real people look at the data and make the best alignment if you are going to use it to make judgements on the accuracy of a submitted sequence.

The final decision at GenBank was to forget about correcting errors and treat the database as an archive of submitted sequences. It would be up to every researcher to become aware of the error-prone nature of the database before drawing any conclusions. I think this was the correct decision—it was the only realistic decision. Unfortunately, the average researcher doesn't realize how may errors are being propagated in the sequence databases.

1. It was a huge ego-trip to have the power to change records in GenBank. All of the changes I made to other people's sequences have been removed but the ones I made to my own sequences are still there. You can check out [M76613] to see an example of what an annotated sequence could have looked like. Note the references to "old-sequence," "conflict," "variation," and "unsure." These represent differences between the genomic sequence and our older error-prone cDNA sequences.


  1. I'm not familiar with how the database works, but could they set up a comment section where any interested party could informally note potential problems? It wouldn't change the sequence in the database, but would alert others that it might be unreliable.

  2. A directly related aspect is the way that these errors propagate through the database. Someone annotates a new gene in a newly sequenced organism, finds it to be most highly related to an existing one with a wrong annotation; and hey presto, you now have two wrong annotations.

  3. I assume some sort of higher data aggregation is used in other cases to push down the errors to a reasonable level, trying to include reports of unreliability and all that.

    How does astronomers do? They have also huge (and perhaps incompatible) databases with automatic pattern recognition for stars, fixes for optical alignment and other data effects, et cetera. They would perhaps in general not be so much affected by errors due to huge statistics, but I'm fairly sure they also have situations where errors means a lot.

  4. Don't forget all the analysis that relies on annotations - e.g., gene ontology (GO) enrichment analysis of gene expression data.

    What we need are means of keeping track of the precise source and procedures used to arrive at the annotations, and systematic and periodic checking of annotations for consistency against other sources of information, and means for automatically propagating corrections through the dependency chain.

    I have not seen too many papers on detecting and correcting annotation errors. But I vaguely remember reading an article by Andorf et al. that appeared in BMC Bioinformatics in 2007 that described - if my memory serves me right - systematic errors that they found in a family of mouse gene annotations that had propagated to rat gene annotations.

  5. GenBank is designed to be a database of primary sequence data. RefSeq is an annotated and curated database for model organisms with DNA, mRNA and protein sequences derived from Genbank. It is analogous to a "review article" vs. GenBank is the primary literature (See
    RefSeq works great as a souce for long as your organism is part the database.