Saturday, January 17, 2009

BioGPS

 
BioGPS is billed as a "Biology Gene Portal System." It's another database. You can read the review on genomeweb but you will have to register [GNF Team Rolls Out BioGPS Gene Portal for Users and Contributors].

The brains behind BioGPS is Andrew Su at the Genomics Institute of the Novartis Research Foundation (GNF) in San Diego (USA). According to the genomeweb article ...
As scientists move forward in analyzing experimental results, they generally consult up to a dozen "standard web sites" Su said, such as Entrez Gene, Ensembl, UniProt, or the Mouse Genome Informatics site. Each site delivers "partially overlapping gene annotation," so users must visit each, enter their search, learn the interface, and learn how to find each of the genes of interest on that site, he said. "Often that is a quite daunting process."

The idea behind BioGPS, Su said, is to avoid that process as well as reveal to researchers smaller and less-known gene portals that scientists might have missed.
Call me skeptical. The author of the article, Vivian Marx, contacted me and asked me to check out BioGPS. I have a long-standing interest in biological databases dating back to an early attempt to improve and update GenBank by adding annotation. That attempt was a failure—for very sound reasons [Errors in Sequence Databases].

I looked at my favorite genes on BioGPS. Here's the link to their homepage: BioGPS. The first thing you notice is that that database is restricted to rat, mouse, and human genes. The second thing you notice is that there's no value added. The data appears to be copied from other databases. This includes all of the errors, omissions, and misinterpretations found at each site. The emphasis is on expression data—that's what overwhelms the visible record of each gene.

Here's an example. This is the human HSPA1L gene. It happens to be a member of the HSP70 gene family. HSP70 proteins are the major chaperones of the cell. The HSP1AL version is specifically expressed in testes.


The expression data is correct but none of the databases mention that this gene is a developmentally regulated member of the HSP70 gene family even though that information has been in the literature for almost twenty years. You don't learn anything from visiting BioGPS that you wouldn't learn from visiting most other databases and, more importantly, you don't learn the information that might be most important to your research because it isn't in any of the databases. Anyone looking at this record would be puzzled by the lack of connection between the correct expression profile and all of the other information.

It gets worse. If you check out the rat HSPA1L gene you won't even learn that it is developmentally regulated because the expression profile doesn't include testes. The links to this genes suggest that it responds to stress, but it doesn't.

This is just one example of the problems with biological databases. Collecting together links from a variety of databases doesn't help. It just ensures that the errors from each database will be combined, creating maximum confusion.

I'm quoted correctly in the article ...
Larry Moran, a biochemist at the University of Toronto, told BioInform by e-mail that he had looked at a few of his "favorite genes" in the portal. "I don't think it's a very useful database," he said, since it is a summary of information gleaned from other databases with "no attempt at annotation."

In addition, he said, "much of the information is wrong or misleading," such as some of the expression profiles, which "seem to be incorrect; probably because the data is for another gene and not the one in the database record."

Users "who would rely on that sort of expression data would be making a very serious mistake," he said."

Reacting to these comments, Su said, "I think it is a good thing, in terms of making those errors more widely seen. The more eyes that see it, the more likely that that error will be fixed."

Being able to detect errors, however, has to be connected to the ability to fix it, he said. "This is the wiki principle, everybody can edit it, everybody can fix it, everybody has the responsibility and the power to make sure it's correct."
In an ideal world, researchers will fix errors in the databases and a Wiki-like system seems like a good idea. The experiment is already underway [A Gene Wiki]. But, as it turns out, this approach is incredibly naive as I discovered from attempts to fix GenBank a few decades ago. Nobody's going to do it. It's way too much work and there's no motivation to share information on public databases.

I received an email message from one of the authors of the expression data. As you might expect, the expression data profiles that are so prominently featured in the BioGPS database records are from the team at The Genomics Institute of the Novartis Research Foundation (e.g. Su et al., 2004). Much of it may be correct—it certainly succeeded with the HSP1Al gene—but I think it's wrong for HSPA1A.

My correspondent pointed out that his expression data has been widely used by hundreds of researchers and the papers have tons of citations.1 He described several studies that have made important discoveries based on the expression profiles that have been published. I don't doubt that this is true. That's not the point. The point is whether taking the expression data and adding links from other sources makes BioGPS a valuable resource.

Not as far as I can see.


1. The idea that just because a paper is widely quoted means that it must be correct is something that troubles me greatly. It seems to be part of the new way of doing science.

Su, A.I., Wiltshire, T., Batalov, S., Lapp, H., Ching, K.A., Block, D., Zhang, J., Soden, R., Hayakawa, M., Kreiman, G., Cooke, M.P., Walker, J.R. and Hogenesch, J.B. (2004) A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl. Acad. Sci. (USA) 101:6062-6067. [PubMed]

10 comments :

  1. Hi Larry -- why don't we let your readers decide whether you summarized my email accurately.

    Larry,

    I read your comments in Genome Web with regard to Andy Su's BioGPS project. I know how easy it is to be misquoted, and suspect that may be at play here.

    However, as the senior author of the two existing gene atlas papers I thought I would respond to the two comments they ran with.

    First, "much of the information is wrong or misleading". To be precise, none of the information is wrong -- all of the probe sets are annotated and we provide this information in a convenient format. Some of the probe sets may not be accurate reporters of expression, not sensitive enough, cross hybridize with other genes, etc. and we address these issues in the paper and provide recommendations for interpretation (Su et al., PNAS, 2002; Su et al., PNAS, 2004). More importantly, these resources are in wide and productive use in the research community (as evidenced by their citation and more than 10,000,000! instances they have been utilized -- you read the number correctly). To my delight, most of the use is by everyday biologists like myself -- this is what we intended. Some of the follow up work has been phenomenal, for example, Vamsi Mootha and his colleagues at the Broad have cloned not one but three different human disease loci using this resource! In short, just because you did not find the resource useful does not mean it is uniformly so -- the body of evidence in fact strongly argues against it.

    Second, your comment that you "don't think its very useful because its a summary of existing data" misses the point entirely. BioGPS is designed to seamlessly and extensibly integrate existing data from several to many sources to allow people to more quickly accomplish their work. For example, you can collate expression information from several databases including GNFs, the Allen Brain project, Gensat, etc., and your own datasets without doing any of the heavy lifting yourself. If you want to work on the gene in the lab, another view allows you to get reagent information from public and commercial sources -- compare prices of antibodies if you want, easily w/o having to go to each database and look things up.

    I realize the paper isn't yet written, so you can't benefit from explanations of the intent and features. However, I respectfully suggest injecting a little caution in publicly commenting on projects that are outside of your area of expertise.

    Finally, I am a fan of your science blog -- with regards,

    John

    --
    John Hogenesch, Ph.D.
    Associate Professor, Pharmacology
    Associate Director, Penn Genome Frontiers Institute
    Institute for Translational Medicine and Therapeutics
    University of Pennsylvania School of Medicine
    810 BRBII/III
    421 Curie Blvd
    Philadelphia 19104-6160
    phone 484-842-4232
    hogenesc@mail.med.upenn.edu
    http://bioinf.itmat.upenn.edu/hogeneschlab/

    ReplyDelete
  2. This is more generally related to your idea of a gene wiki. I think one of the reasons there is little interest in doing that is there is no recognition and a possible loss of control when editing a wiki. I spend some time updating wikipedia articles on some genes/processes in which I am interested but it dosent go on my CV and its not considered an authoritative source. A lot of people would consider it a waste of time until some sort of acknowledgement is given to these efforts. Thats why its easier to just mash together a dozen databases and claim that its comprehensive (all the while being able to blame the source databases for any errors)

    ReplyDelete
  3. The main point of BioGPS is that its extensible and customizable. The goal is not creating new data -- rather providing easier access to existing resources -- and hooks for other developers or data providers to not have to recode the underpinnings.

    Use case: lets say you want to buy and antibody against YFG (your favorite gene). BioGPS provides a way enter YFG with links to commercially available antibodies without you having to know about the companies a priori or enter YFG at each one.

    Finally, no one is claiming BioGPS is comprehensive -- no annotation effort can ever be comprehensive. We'll never know all the things there are to know about our genes.

    ReplyDelete
  4. G8TR says,

    Hi Larry -- why don't we let your readers decide whether you summarized my email accurately.

    Thanks for quoting the email message. I have a policy of not quoting directly from personal email.

    ReplyDelete
  5. Speech recognition software for repetitive motion disorder. Thanks for pointing out the error, though, it was certainly germane to the discussion.

    ReplyDelete
  6. BioGPS's target audience

    "Larry's right, we're not attempting to do annotation, so BioGPS might not be useful to him. But he's not our target audience either..."

    ReplyDelete
  7. Here's what the BioGPS team says on BioGPS's target audience.

    Larry has focused his entire scientific career on the study of HSP70 family genes. For people like Larry who only care about a handful of genes, they really don't have a great need for gene portals. They know their genes backward and forward, and they get their information directly by following the primary literature. Relative to that, every gene portal will be missing important information.

    You miss the point. There are hundreds of people who are knowledgeable about a small subset of genes. Almost all of them agree that the existing databases are not very accurate with respect to their genes.

    What's the logical conclusion?

    I suppose you could conclude that the only thing wrong with biological databases is that they aren't very accurate for those genes that have been intensively studied but they are extremely useful for all the other genes.

    Large scale experiments, such as expression studies, are very important and useful but the very nature of such work means that the researchers don't know very much about the genes they're working with.

    The goal is to connect the survey results with the experts on individual genes to see if the survey results are accurate. You don't so this by simply linking to existing biological databases and hoping that everyone will assume the database entries are accurate, and so are the survey results.

    Real science means getting down and dirty and exploring the details. Superficial isn't going to work and it could, in fact, be very harmful.

    ReplyDelete
  8. Reply to Larry's comment posted here.

    "Does your blender also cook eggs for you in the morning? If not, is the blender useless?"

    ReplyDelete