An open letter to bioinformatcis researchers (fwd)

Peter van Heusden pvh at egenetics.com
Wed Dec 13 06:12:53 PST 2000


The following email is a letter written by Ewan Birney and Sean Eddy, who are both significant players in the 'open source bioinformatics' scene. The limitations in the email are clear (see Anton Pannekoek's 'Revolt of the scientists' which assess the outrage of scientists against Cold War science, and points at the limits of that outrage), but I thought people might be interested anyway.

On GenomeWeb (http://genomeweb.com/articles/view-article.asp?Article=20001211131433), the letter is quoted in part, with a comment from Barbara Jasny that "Science would be receiving a copy of the full sequence in escrow, on a DVD-ROM. This copy of the sequence would provide security for Celera's assurance that it would uphold its end of the agreement. "We will be keeping it safely with other AAAS valuables," said Jasny." Certainly sounds like a disc worth stealing!

Peter -- Peter van Heusden <pvh at egenetics.com> NOTE: I do not speak for my employer, Electric Genetics "Criticism has torn up the imaginary flowers from the chain not so that man shall wear the unadorned, bleak chain but so that he will shake off the chain and pluck the living flower." - Karl Marx, 1844 OpenPGP: 1024D/0517502B : DE5B 6EAA 28AC 57F7 58EF 9295 6A26 6A92 0517 502B ---------- Forwarded message ---------- Date: Sat, 9 Dec 2000 19:03:10 +0000 (GMT) From: Ewan Birney <birney at ebi.ac.uk> To: bioperl-l at bioperl.org, biojava-l at biojava.org, biopython at biopython.org,

bioxml-dev at bioxml.org, ensembl-dev at ebi.ac.uk, apollo at ebi.ac.uk Subject: An open letter to bioinformatcis researchers

Dear fellow bioinformatics developers:

By now you have probably heard that Celera Genomics has submitted their human genome paper to the journal Science. Science and Celera have agreed to special terms for the release of the human genome sequence data. It will be made available through the Celera website, and will not be submitted to the international DNA database consortium (GenBank, EMBL and DDBJ). Science's statement regarding the agreement is at: http://www.sciencemag.org/feature/data/announcement/genomesequenceplan.shl

All major journals, including Science, have a policy of deposition of sequence data with the "appropriate data bank". The accepted community standard is submission to GenBank/EMBL/DDBJ. The reason for this deposition is to make the results of the work openly available for future research. This principle was specifically mentioned in the Clinton/Blair statement on human genome sequencing -

http://www.usinfo.state.gov/topical/global/biotech/00031401.htm - who strongly upheld the view that "unencumbered access" to genome data was critical.

The terms of the Celera/Science agreement will give us access to the genome sequence, but not unencumbered access. Celera is suggesting publishing their data under a MTA (Material Transfer Agreement) which would prevent large scale downloads and incorporation of this data into GenBank/EMBL/DDBJ. In order to download the data, you and your institution will have to sign a contract guaranteeing that you will not "redistribute" the Celera data.

Science believes that the deal is an adequate compromise because it provides us the right to download the data and publish our results. We believe Science is thinking in terms of single gene biology, not large scale bioinformatics. It is probably not hard for you to imagine scenarios in bioinformatics in which "publication" and "redistribution" are virtually the same thing; we cannot imagine Celera allowing us to incorporate data into Pfam, for example, nor into Ensembl.

We are asking for your support in writing to Science to politely insist that genome sequence papers should be accompanied by unencumbered deposition to GenBank/EMBL/DDBJ. Please note that we have no issue with Celera either keeping this data unpublished for commercial reasons, nor with them combining their data with freely available data from the public genome projects. We would defend their right to do either. Our view is simply that the genome community has established a clear principle that published genome data must be deposited in the international databases, that bioinformatics is fueled by this principle, and that Science therefore threatens to set a precedent that undermines our research.

We encourage you to express your views on this matter to Donald Kennedy (kennedyd at kennedyd.pobox.stanford.edu), the Editor-in-Chief of Science, and/or to Barbara Jasny (bjasny at aaas.org), the managing editor in charge of genomics papers at Science.

Here is a Q/A about some points.

* Why does this matter?

A classic example of how our field began to have an impact on molecular biology was Russ Doolittle's discovery of a significant sequence similarity between a viral oncogene and a cellular growth factor receptor. Russ could not have found that result if he did not have an aggregate database of previously published sequences. We have come a long way from Russ and his son typing data into the NEWAT protein sequence database by hand.

Throughout the 80's the international database community fought hard to insist that DNA sequence data be deposited into the public domain databases. Journals now generally require deposition as a condition of accepting a paper. The forming of these databases and the international agreements on data sharing between the European, American and Japanase databases fostered the rapid development of bioinformatics research. We now all take for granted the fact that large DNA databases are accessible from a single point of contact, and the identifiers are coordinated worldwide.

Bioinformatics research relies on open data with minimal legal encumberances submitted to public databases. Without these databases there is no real substrate for bioinformatics research.

* What would happen if this precedent was set?

There are a number of consequences if Science set a precedent that allowed people to publish DNA data under a variety of MTAs.

- One would not be able to form a single DNA database on which to

do bioinformatics research, and the derivative databases (Swissprot,

PIR, Pfam, PROSITE, etc.) would not be legal.

- Bench biologists would have to visit a number of websites and

possibly enter into a number of different contracts for access to DNA

data. Unexpected informative homologies could become prohibitively

difficult to find.

- You may need to get a legal review before you can publish

the results of an analysis, if your analysis is large-scale and

detailed enough that it could be reasonably interpreted as a

"redistribution" of the primary sequence data. You could

be sued for breach of contract for a Web Supplement page

that discloses extensive sequence data supporting your results.

- Scientific openness will be undermined. Efforts to engage the

community in cooperative annotation of large genomes, for instance,

would be blocked -- we can't usefully annotate a genome we can't freely

redistribute.

* Celera paid for it. Can't they set their own access terms?

Absolutely. We have no issue with Celera's commercial data gathering, and their right to set their own access terms to their data. We do feel, though, that scientific publications carry a certain ethical responsibility. The purpose of a paper is to enable the community to efficiently build on your work. There is always a tension between disclosing your work to your competitors (this is not unique to private companies!) and receiving scientific credit for your work via publication. This tension is natural, and maintaining a consistent and acceptable balance is the reason that scientist and journals establish community standards that dictate how data are required to be disclosed. In this case, the clearly accepted community standard is that DNA sequence data are deposited in Genbank/EMBL/DDBJ upon publication.

We certainly do not blame Celera (much) for seeking a special deal that lets them have their cake and eat it too -- they would understandably like scientific credit for their terrific and important work in human sequencing, and they would also like a profitable business model.

We do blame Science for failing to take a strong stand in upholding accepted scientific publication practices. We cannot accept that it is necessary to sacrifice ethics for expediency.

* Science claims they are honouring their own policy. What gives?

Science now claims that all their policy really requires is that archival data be available via a publicly accessible database. We think this is a conveniently revisionist view of their own policy, which states (in Instructions to Authors):

"archival data sets (such as sequence and structural data) must be deposited with the appropriate data bank and the identifier code should be sent to Science for inclusion in the published manuscript (coordinates must be released at the time of publication)"

Notice the use of the definitive article "THE appropiate data bank", the notion of "deposition", and the additional rider that the identifier code should be sent.

The spirit of this statement seems clear to us. Science's statement anticipates that there is an appropriate, single, aggregrate community database for each sort of archival data, whether DNA sequence, protein structure coordinates, or something else. Sensibly, they don't name every possible database for every possible archival data set. They expect that recognized community standards exist. In no way does Science's statement seem consistent with the view that an individual lab could start its own "public" DNA sequence database and send a meaningless internal database identifier; to try to read it that way is a post hoc rationalisation.

* What can Science do? This is a done deal.

It's true that this is a done deal. Science and Celera have mutually agreed to the general terms of data release. But there are two ways that we can minimize the damage.

First, the details of the agreement are not set. In particular, there is no definition of allowed "publication" versus prohibited "redistribution". Science could specify definitions that did not interfere with noncommercial uses of the data in bioinformatics, allowing us redistribution rights if it made sense in the context of our project (for example, a genome annotation project like Ensembl).

Second, and preferably, Science -- or even the peer reviewers -- can uphold Science's own data access policy, and reject the paper.

Incidentally, they might also choose to enforce Science's policy on prior publication, which states "...the main findings of a paper should not have been reported in the mass media. Authors are, however, permitted to present their data at open meetings but should not overtly seek media attention." If I issued a press release upon submission of a manuscript to Science, like Celera did, Science would rightly fire it back to me without review.

* What can I do?

Agitate. Let Science know that you care. They consider this deal to be a trial balloon for future genome papers. Even if we can't change the deal with Celera, we can try to make sure it's a one-time-only deal that's viewed as a Big Mistake. Write a letter to Science and tell them how their actions would impact your research, both in the long term and in the short term. Also, you can pass on this open letter to other bioinformatics researchers you know.

Dr Sean Eddy, Alvin Goldfarb Professor of Computational Biology, Howard Hughes Medical Institute, Washington University in St. Louis, USA

Dr Ewan Birney Team Leader, Genomic Annotation European Bioinformatics Institute, UK



More information about the lbo-talk mailing list