More Recent Comments

Saturday, February 06, 2021

The 20th anniversary of the human genome sequence: 1. Access to the data and the complicity of Science

The first drafts of the human genome sequence were published 20 years ago. The paper from the International Human Genome Project (IHGP) was published in Nature on Febuary 15, 2001 and the paper from Celera was published in Science on February 16, 2001.

The original agreement was to publish both papers in Science but IHGP refused to publish their sequence in that journal when it choose to violate its own policy by allowing Celera to restrict access to its data. I highly recommend James Shreeve's book The Genome War for the history behind these publications. It paints an accurate, but not pretty, picture of science and politics.

Lander, E., Linton, L., Birren, B., Nusbaum, C., Zody, M., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., Funke, R., Gage, D., Harris, K., Heaford, A., Howland, J., Kann, L., Lehoczky, J., LeVine, R., McEwan, P., McKernan, K., Meldrim, J., Mesirov, J., Miranda, C., Morris, W., Naylor, J., Raymond, C., Rosetti, M., Santos, R., Sheridan, A. and Sougnez, C. (2001) Initial sequencing and analysis of the human genome. Nature 409:860-921. doi: 10.1038/35057062

Venter, J., Adams, M., Myers, E., Li, P., Mural, R., Sutton, G., Smith, H., Yandell, M., Evans, C., Holt, R., Gocayne, J., Amanatides, P., Ballew, R., Huson, D., Wortman, J., Zhang, Q., Kodira, C., Zheng, X., Chen, L., Skupski, M., Subramanian, G., Thomas, P., Zhang, J., Gabor Miklos, G., Nelson, C., Broder, S., Clark, A., Nadeau, J., McKusick, V. and Zinder, N. (2001) The sequence of the human genome. Science 291:1304 - 1351. doi: 10.1126/science.1058040

Science marks the anniversary by publishing a special issue devoted to The human genome at 20. Here's the introduction to the nine articles along with the titles of the articles.

Millions of people today have access to their personal genomic information. Direct-to-consumer services and integration with other “big data” increasingly commoditize what was rightly celebrated as a singular achievement in February 2001 when the first draft human genomes were published. But such remarkable technical and scientific progress has not been without its share of missteps and growing pains. Science invited the experts below to help explore how we got here and where we should (or ought not) be going.

  • An ethos of rapid data sharing, more relevant than ever
  • Lack of diversity hinders the promise of genome science
  • Algorithmic biology unleashed
  • Value and affordability in precision medicine
  • End the entanglement of race and genetics
  • Genetic privacy in the post-COVID world
  • Emerging ethics in Indigenous genomics
  • Polygenic risk in a diverse world
  • Risks of genomic surveillance and how to stop it

As you might guess from reading the titles, the articles consist largely of woke platitudes and very little science. They serve mainly to illustrate the depths to which science jouralism has sunk.

Most of the articles don't deserve any attention but one of them stands out in terms of irony and hypocrisy. It's the very first article on the importance of data sharing. Here's the juicy part ...

Sharing data can save lives. The “Bermuda Principles” for public data disclosure are a fundamental legacy of producing the first human reference DNA sequence during the Human Genome Project (HGP). Since the 1990s, these principles have become a touchstone for open science.

... The HGP set a high bar. Its core values of open science and rapid data flow persist, fomented by the urgency of rapid data sharing in biomedicine.

Recall that Science editors placed severe restrictions on the use of the Celera genome sequence when they published the draft sequence in 2001. Here's what they said back then ...

As described in the accompanying editorial, Science has taken care to craft a policy which guarantees that the data on which Celera's analyses are based will be available for examination. But the purpose of insisting that primary scientific data be released is not merely to ensure that the published conclusions are correct, but also to permit building on these results, to allow further scientific advancement. Bioinformatics research is particularly dependent on unencumbered access to data, including the ability to reanalyze and repost results. Thus the statement that “… any scientist can examine and work with Celera's sequence in order to verify or confirm the conclusions of the paper, perform their own basic research, and publish the results” is inaccurate with respect to research in bioinformatics. For example, a genome-wide analysis and reannotation of additional features identified in Celera's database could not be published or posted on the Web without compromising the proprietary nature of the underlying data. Nor could this information be combined with the resources available from other databases—such as the information from additional species necessary for cross-species comparisons, or data from microarray and proteomics resources that would permit queries based on a combination of genome sequence data, expression patterns, and structural information.

You have to read between the lines but what this means is that Celera did not put their data in public databases such as GenBank. If you wanted to see the primary data you needed to get permission from Celera and in some cases you had to submit a letter from your university agreeing to Celera's terms and conditions. I remember that I wanted to compare the IHGP and Celera sequences of my favorite gene but I couldn't get access to the Celera data.

Recall that restricting access to the Celera data was the reason why IHGP abandoned Science and published in Nature. The decision by Science editor Donald Kennedy to surrender to Celera promted several prominent scientists to write letters to the journal. Here's how James Shreeve describes it in his book ...

The decision nonetheless outraged the academics. Harlod Varmus, by then president of the Memorial Sloan-Kettering Cancer Center, wrote a letter to Kennedy co-signed by a number of heavyweights warning the editor that he was setting a dangersous precedent by allowing the company [Celera] to dictate the terms by which its published data could be used. The British Drosophila expert Michael Ashburner kited a series of explosive missives to Kennedy announcing his intention to cut all ties with Science, and to advise all his colleagues to do the same.

"You have lowered a proud journal to the level of a newspaper Sunday supplement, accepting a paid advertisement in the guise of a scientific paper ..." Asbrurner wrote. "The problem comes, of course, because Celera, particularly in the form of Craig [Venter], want[s] the best of both worlds. They want the commercial advantage of having done a whole genome shotgun sequence and they (or at least Craig) want the academic kudos which goes with it."

When Kennedy stood his ground, the public program withdrew its paper from consideration at Science and submitted it to the British jounral Nature instead. Francis Collins, who throughout the latest wrangling had been doing his best to maintain friendly relations with Venter, called him at his home to tell him the news. Venter took it in stride, "That's too bad, Francis," he said, "But this way, at least we both get our own covers." (pp. 361-362)

Given this history, you would think that an article in this week's Science on the ethos of data sharing would at least mention the controversy, wouldn't you?


  1. 'The Common Thread' by John Sulston and Georgina Ferry is also an excellent book. It's a biography of Sulston, but obviously features a lot of info on sequencing the genome.

  2. I was aghast to see those nine articles

    You nailed it, Larry.