Sandwalk: Errors in Sequence Databases

Friday, June 20, 2008

Errors in Sequence Databases

Sandra Porter at Discovering Biology in a Digital World brings up an issue that has been bugging me for two decades [Biologists vs. the Age of Information]. The issue is the accuracy of information in biological databases.

Let's begin with GenBank - GenBank is the main database of nucleotide sequences at the NCBI. Sequence data are submitted to GenBank by researchers or sequencing centers. If mistakes are found, the information in the records can be updated by the submitters or by third parties if the corrected versions are published. This correction activity doesn't always happen though, and the requirement for third party annotations to be published makes it pretty unlikely that anyone will submit small corrections to a sequence.

This is why we see these kinds of quotes from Steven Salzberg (3):

So you think that gene you just retrieved from GenBank [1] is correct? Are you certain? If it is a eukaryotic gene, and especially if it is from an unfinished genome, there is a pretty good chance that the amino acid sequence is wrong. And depending on when the genome was sequenced and annotated, there is a chance that the description of its function is wrong too.

This is a serious problem. Most people don't realize that GenBank is full of sequences that are known to be incorrect and/or poorly annotated. In most cases, the errors are relatively minor such as one or two incorrect codons or deletion of a single codon. In other cases, the errors are more important, such as a pseudogene being represented as a real gene, or missing exons. Sometimes the identity of a gene is completely wrong. I've even seen examples where the species is incorrectly identified.

Sandra asks,

So what do we do? Do we care if the database information is up-to-date? If so, who should be responsible for the updates?

I'm sure some people would like the NCBI to be the final authority and just fix everything but I don't think that's very realistic.

Other people have proposed that wikis are the answer. Maybe they're right, but I really wonder if researchers would be any better at updating wikis than they are at updating information in places like the NCBI.

Well, dear readers, what do you think? Does GenBank need to be fixed? Do we just need more alternatives? Does it even matter?

Back in 1992, I spent part of a summer at the GenBank site in Los Alamos (New Mexico, USA). That was before GenBank moved to NCBI in Bethesda. My task was to explore the possibility of curating GenBank to fix all the errors. I worked with the HSP70 sequences since I had already documented most of the errors in those sequences (The HSP70 Sequence Database).

We decided that I could make corrections to any HSP70 sequence as long as I annotated the changes and got permission from the authors by 'phone.¹ This didn't work. Most of the authors were unwilling to allow changes 'cause they weren't aware of the fact that there was a conflict between their sequences and the aligned sequence database. They didn't even know that others had sequenced the same gene and gotten a different sequence.

We discussed this problem. At the time, everyone was aware of the fact that the SwissProt database was curated and that the curators were making decisions on their own about which sequences were correct and which ones were errors. Here's an example of the entry for human HSPA1A showing the conflicts and variations.

Sometimes the SwissProt curators get it wrong and identify the correct sequence as an error and vice versa. Sometimes they really screw up. Here's an example of that mistake [P23931].

Curating a sequence database is incredibly expensive. You need to hire hundreds of competent workers who can analyze every sequence as it comes in. There are some tools that will help identify errors but in order to reach an acceptable level of accuracy you need to build aligned sequence databases for every gene. That can't be done automatically; you need to have real people look at the data and make the best alignment if you are going to use it to make judgements on the accuracy of a submitted sequence.

The final decision at GenBank was to forget about correcting errors and treat the database as an archive of submitted sequences. It would be up to every researcher to become aware of the error-prone nature of the database before drawing any conclusions. I think this was the correct decision—it was the only realistic decision. Unfortunately, the average researcher doesn't realize how may errors are being propagated in the sequence databases.

1. It was a huge ego-trip to have the power to change records in GenBank. All of the changes I made to other people's sequences have been removed but the ones I made to my own sequences are still there. You can check out [M76613] to see an example of what an annotated sequence could have looked like. Note the references to "old-sequence," "conflict," "variation," and "unsure." These represent differences between the genomic sequence and our older error-prone cDNA sequences.

5 comments :

Anonymous said...: I'm not familiar with how the database works, but could they set up a comment section where any interested party could informally note potential problems? It wouldn't change the sequence in the database, but would alert others that it might be unreliable.; Saturday, June 21, 2008 12:09:00 AM
Anonymous said...: A directly related aspect is the way that these errors propagate through the database. Someone annotates a new gene in a newly sequenced organism, finds it to be most highly related to an existing one with a wrong annotation; and hey presto, you now have two wrong annotations.; Saturday, June 21, 2008 10:26:00 PM
Torbjörn Larsson said...: I assume some sort of higher data aggregation is used in other cases to push down the errors to a reasonable level, trying to include reports of unreliability and all that.

How does astronomers do? They have also huge (and perhaps incompatible) databases with automatic pattern recognition for stars, fixes for optical alignment and other data effects, et cetera. They would perhaps in general not be so much affected by errors due to huge statistics, but I'm fairly sure they also have situations where errors means a lot.; Sunday, June 22, 2008 12:24:00 AM
Anonymous said...: Don't forget all the analysis that relies on annotations - e.g., gene ontology (GO) enrichment analysis of gene expression data.

What we need are means of keeping track of the precise source and procedures used to arrive at the annotations, and systematic and periodic checking of annotations for consistency against other sources of information, and means for automatically propagating corrections through the dependency chain.

I have not seen too many papers on detecting and correcting annotation errors. But I vaguely remember reading an article by Andorf et al. that appeared in BMC Bioinformatics in 2007 that described - if my memory serves me right - systematic errors that they found in a family of mouse gene annotations that had propagated to rat gene annotations.; Monday, June 23, 2008 3:30:00 PM
Anonymous said...: GenBank is designed to be a database of primary sequence data. RefSeq is an annotated and curated database for model organisms with DNA, mRNA and protein sequences derived from Genbank. It is analogous to a "review article" vs. GenBank is the primary literature (See http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.section.GenBank_ASM).
RefSeq works great as a souce for sequences...as long as your organism is part the database.; Wednesday, June 25, 2008 6:51:00 PM

Quotations

The old argument of design in nature, as given by Paley, which formerly seemed to me to be so conclusive, fails, now that the law of natural selection has been discovered. We can no longer argue that, for instance, the beautiful hinge of a bivalve shell must have been made by an intelligent being, like the hinge of a door by man. There seems to be no more design in the variability of organic beings and in the action of natural selection, than in the course which the wind blows.Charles Darwin (c1880)

Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed, during a long course of years, from a point of view directly opposite to mine. It is so easy to hide our ignorance under such expressions as "plan of creation," "unity of design," etc., and to think that we give an explanation when we only restate a fact. Any one whose disposition leads him to attach more weight to unexplained difficulties than to the explanation of a certain number of facts will certainly reject the theory.

Charles Darwin (1859)

Science reveals where religion conceals. Where religion purports to explain, it actually resorts to tautology. To assert that "God did it" is no more than an admission of ignorance dressed deceitfully as an explanation...

Peter Atkins

Quotations

The world is not inhabited exclusively by fools, and when a subject arouses intense interest, as this one has, something other than semantics is usually at stake. Stephen Jay Gould (1982)
I have championed contingency, and will continue to do so, because its large realm and legitimate claims have been so poorly attended by evolutionary scientists who cannot discern the beat of this different drummer while their brains and ears remain tuned to only the sounds of general theory. Stephen Jay Gould (2002) p.1339
The essence of Darwinism lies in its claim that natural selection creates the fit. Variation is ubiquitous and random in direction. It supplies raw material only. Natural selection directs the course of evolutionary change. Stephen Jay Gould (1977)
Rudyard Kipling asked how the leopard got its spots, the rhino its wrinkled skin. He called his answers "just-so stories." When evolutionists try to explain form and behavior, they also tell just-so stories—and the agent is natural selection. Virtuosity in invention replaces testability as the criterion for acceptance. Stephen Jay Gould (1980)
Since 'change of gene frequencies in populations' is the 'official' definition of evolution, randomness has transgressed Darwin's border and asserted itself as an agent of evolutionary change. Stephen Jay Gould (1983) p.335
The first commandment for all versions of NOMA might be summarized by stating: "Thou shalt not mix the magisteria by claiming that God directly ordains important events in the history of nature by special interference knowable only through revelation and not accessible to science." In common parlance, we refer to such special interference as "miracle"—operationally defined as a unique and temporary suspension of natural law to reorder the facts of nature by divine fiat. Stephen Jay Gould (1999) p.84

Quotations

My own view is that conclusions about the evolution of human behavior should be based on research at least as rigorous as that used in studying nonhuman animals. And if you read the animal behavior journals, you'll see that this requirement sets the bar pretty high, so that many assertions about evolutionary psychology sink without a trace.

Jerry Coyne
Why Evolution Is True

I once made the remark that two things disappeared in 1990: one was communism, the other was biochemistry and that only one of them should be allowed to come back.

Sydney Brenner
TIBS Dec. 2000

It is naïve to think that if a species' environment changes the species must adapt or else become extinct.... Just as a changed environment need not set in motion selection for new adaptations, new adaptations may evolve in an unchanging environment if new mutations arise that are superior to any pre-existing variations

Douglas Futuyma

One of the most frightening things in the Western world, and in this country in particular, is the number of people who believe in things that are scientifically false. If someone tells me that the earth is less than 10,000 years old, in my opinion he should see a psychiatrist.

Francis Crick

There will be no difficulty in computers being adapted to biology. There will be luddites. But they will be buried.

Sydney Brenner

An atheist before Darwin could have said, following Hume: 'I have no explanation for complex biological design. All I know is that God isn't a good explanation, so we must wait and hope that somebody comes up with a better one.' I can't help feeling that such a position, though logically sound, would have left one feeling pretty unsatisfied, and that although atheism might have been logically tenable before Darwin, Darwin made it possible to be an intellectually fulfilled atheist

Richard Dawkins

Another curious aspect of the theory of evolution is that everybody thinks he understand it. I mean philosophers, social scientists, and so on. While in fact very few people understand it, actually as it stands, even as it stood when Darwin expressed it, and even less as we now may be able to understand it in biology.

Jacques Monod

The false view of evolution as a process of global optimizing has been applied literally by engineers who, taken in by a mistaken metaphor, have attempted to find globally optimal solutions to design problems by writing programs that model evolution by natural selection.

Richard Lewontin

More Recent Comments

Friday, June 20, 2008

Errors in Sequence Databases

5 comments :