Feeds:
Posts
Comments

PANGAEA (Publishing Network for Geoscientific & Environmental Data) is a repository for geoscience data with many features similar to Dryad, including use of DOIs for data files.  A recent press release reports that Elsevier and PANGAEA have implemented reciprocal linking between data in the repository and journal articles.   Research data sets deposited at PANGAEA are now automatically linked to the corresponding articles in Elsevier journals on its electronic platform ScienceDirect and vice versa.   The data are freely available from the publication’s page in ScienceDirect, without a login or subscription.

Try it out:

  1. From this PANGAEA record, follow the DOI to the article in ScienceDirect (citations and abstracts only, unless you or your institution have subscription access)
  2. The PANGAEA link is to the right of the article with Supplementary Data beside it

This valuable two-way connectivity between data and article is most easily achieved when the data are captured at the time of article submission.  See this previous post for more on Dryad’s approach to this problem, which is designed to work across multiple publishers.

Similar to the appearance of the PANGAEA logo in the online version of the article, we are toying with the idea of calling attention to the link in the opposite direction by placing  journal cover images next to article DOIs in the Dryad display.  We’d like to hear your thoughts on that.  Is it helpful signage?  Or distracting eye candy?

“For science to effectively function, and for society to reap the full benefits from scientific endeavours, it is crucial that science data be made open.” The just-released Panton Principles propose that “data related to published science should be explicitly placed in the public domain.”

The creators recommend “adopting and acting on the following principles:”

  1. When publishing data make an explicit and robust statement of your wishes.
  2. Use a recognized waiver or license that is appropriate for data.
  3. If you want your data to be effectively used and added to by others it should be open as defined by the Open Knowledge/Data Definition – in particular non-commercial and other restrictive clauses should not be used.
  4. Explicit dedication of data underlying published science into the public domain via PDDL or CCZero is strongly recommended and ensures compliance with both the Science Commons Protocol for Implementing Open Access Data and the Open Knowledge/Data Definition.

These principles were written by Peter Murray-Rust, Cameron Neylon, Rufus Pollock and John Wilbanks at the Panton Arms in Cambridge, UK, and then refined by the Open Knowledge Foundation Working Group on Open Data in Science. There are open data web buttons available, and individuals and organizations can endorse the principles here.

There are lots of opinions and answers to this question.  For starters, here’s a lively blog post, responding to this editorial last April.  Consider also this blog post.

What do you think are the barriers to data sharing?

Data from: Thompson S, Daniels K. 2010. A porous convection model for small-scale grass patterns. American Naturalist 175: E10-E15. Dryad Digital Repository. http://hdl.handle.net/10255/dryad.857

The Journal of Evolutionary Biology, the journal of the European Society for Evolutionary Biology, has just published an editorial supporting data archiving. The editorial is now available online:

The need for archiving data in evolutionary biology.  Allen J. Moore, Mark A. McPeek, Mark D. Rausher, Loren Rieseberg, Michael C. Whitlock.  Journal of Evolutionary Biology 2010.
Published Online: Feb 9 2010
DOI: 10.1111/j.1420-9101.2010.01937.x


The journal Evolution has joined other Dryad partner journals in announcing a new data archiving policy mandating, as a condition of publication, that the data used in a paper be made publicly available.

The editorial says

Data that are properly archived are saved for posterity, and the archives also function to preserve data in a useable form for the original authors. Moreover, if datasets are put into a readily interpretable format while the methods and structure of the data are foremost in the scientists’ minds, that data can be used later more easily by those scientists and others.

When fully in place, the policy will require authors to archive the data required to support the conclusions in their published paper, along with sufficient details that a third party can reasonably interpret those data correctly.

To be implemented next year, the policy is parallel to those already announced by two other prominent journals, also Dryad partners:

  • Whitlock, M. C., M. A. McPeek, M. D. Rausher, L. Rieseberg, and A. J. Moore. 2010. Data Archiving. American Naturalist. 175:145-146, doi:10.1086/650340
  • Rieseberg, L., T. Vines, and N. Kane. Editorial and retrospective 2010. Molecular Ecology. 19:1-22, doi:10.1111/j.1365-294X.2009.04450.x

A new commentary piece, Linking big: the continuing promise of evolutionary synthesis,  in the journal Evolution describes the promise of “synthetic science,”  which includes re-use of data sets,  research results, or unconnected methods or concepts,  leading to new discoveries or trends.    The authors, who all are affiliated with the National Evolutionary Synthesis Center (NESCent),  argue for removing the cultural and technological barriers to enable new breakthroughs.

“By putting together pieces of prior research, it is possible to transform how you do science and open the doors to findings that previously were unattainable,” said Brian Sidlauskas, a fish biologist from Oregon State University and lead author on the Evolution article. “But such an approach runs counter to the way science traditionally has been conducted, so pursuing synthetic science is somewhat risky.”

“We need to reduce the risk, remove the barriers, and encourage more pursuit of synthesis because the potential,” he added, “is staggering.”

Sidlauskas cites access to actionable data as one of the major obstacles. “When you’re looking to synthesize data from several hundred individual studies, data formatting, storage and accessibility become huge issues,” he said.   He says that  “…the vast majority of data supporting previous studies are unavailable, often because the data are lost or preserved in inaccessible forms (notebooks, floppy disks).”

The article refers to Dryad as

… working to alleviate the problem of data availability by providing an open-access home for ecological and evolutionary data that does not fit into more specialized repositories. Dryad actively works with a coalition of journals and scientific societies to make deposition of all data a normal part of the research workflow. As more journals require data deposition as part of the manuscript publication process, the opportunities for potential syntheses linking such data will increase substantially.

Sidlauskas adds, “It’s kind of an open-source approach to science,” he added. “Data archives may require some kind of proprietary protection for a few months or years, but after a certain amount of time, they should become public domain. Only by saving the data that underlie today’s science will we allow future scientists to use those data in ways that may far exceed what the original researchers envisioned.”

Other authors on the commentary piece include Ganeshkumar Ganapathy, of the National Evolutionary Synthesis Center (NESCent); Einat Hazkani-Covo, Duke University Medical Center; Kristin P. Jenkins, NESCent; Hilmar Lapp, NESCent; Lauren W. McCall, NESCent; Samantha Price, University of California-Davis; Ryan Scherle, NESCent; Paula A. Spaeth, Northland College; and David M. Kidd, NERC Centre for Population Biology, Imperial College London.

CITATION: Sidlauskas, B., G. Ganapathy, et al. (2010). “Linking big: The continuing promise of evolutionary synthesis.” Evolution doi: 10.1111/j.1558-5646.2009.00892.x.

A strong editorial on data archiving is now available online in the February issue of The American Naturalist.

Authors Michael C. Whitlock, Mark A. McPeek, Mark D. Rausher, Loren Rieseberg, and Allen J. Moore present the case for the importance of data archiving in science.   This is the first of several coordinated editorials soon to appear in major journals:

To promote the preservation and fuller use of data, The American Naturalist, Evolution, the Journal of Evolutionary Biology, Molecular Ecology, Heredity, and other key journals in evolution and ecology will soon introduce a new data‐archiving policy. The policy has been enacted by the Executive Councils of the societies owning or sponsoring the journals.

Citation: Am Nat 2010. Vol. 175, pp. 145–146. DOI: 10.1086/650340

In order to make data submission to Dryad as easy as possible for authors, the system piggybacks in an innovative way on the journal submission process.  The key is that most authors will be submitting their data to Dryad immediately after they learn that their final manuscript has been accepted by the journal.  Through behind-the-scenes communication with the journal, Dryad will already know the “vital information” about that paper before the author comes to Dryad to submit data.  This saves them from the laborious and error-prone task of filling in the paper details at Dryad.  We call this process “submission integration”, and it is one of the fundamental services provided to partner journals.

Dryad submission integration screenshot

A screenshot of the Dryad submission page.

Most journals employ one of a small number of manuscript management software systems to interact with authors, editors and reviewers. These software systems regularly employ customizable email form letters to communicate among the various parties.  Through emails that are automatically sent, and automatically processed upon receipt, Dryad can ensure that authors need not re-enter data that is already available to the journal, that the journal knows the web address that authors can use to access the submission page for that specific article, and – once data has been submitted – that the journal and the author receive notice about the record identifier to include in print.

We’re happy to report that after several months of testing, this system is ready to roll out.  The first guinea pig for testing was The American Naturalist, which publishes a relatively small number of data papers.  Then Molecular Ecology, which publishes a whole lot more.  We are now in the process of setting up submission integration with a long list of partner journals, thanks to Tim Vines of Molecular Ecology, who has written an easy-to-follow instructions for the many journals that use the popular Manuscript Central software.

As a teaser for things to come, we are working to make data archiving even more like falling off of a log, by implementing one-stop data deposition, through Dryad, to one or more specialized repositories required by our partner journals.  Techniques like submission integration and handshaking should greatly facilitate submission to the repository and the usefulness of the data records.

For the curious, here’s a little more detail on how submission integration works. First, the journal automatically sends an email to Dryad upon acceptance of a manuscript. Dryad parses the incoming email and creates an (empty) record for each new article, with a unique identifier based upon the manuscript number.  Second, the author receives the link to the submission page for that article.  Since the bibliographic information about the paper is already stored in Dryad, all the author needs to do is follow the link, log in, and upload their datafiles. Not only does this save the author needless time re-entering author names, paper title and so on, but it also helps to ensure the information is accurate and properly formatted. Ideally, the author also provides a ReadMe document to promote reusability, and optional metadata to make the data more easily discoverable.  Third, upon submission, unique identifiers such as Handles or Digital Object Identifiers (DOIs) are assigned to the data. These identifiers can be resolved to web addresses.  The identifier for the whole record, or what we call the “data package”, is then included in the article according to the conventions of each journal, so that readers of the article can easily find the record in Dryad.   Most data packages will become available as soon as the issue comes out, although some may have an embargo of up to one year.  For more gory details, see our wiki pages.

Gate at The British Library

Gate at the British Library
(source: gaspa)

The Dryad Management Board recently held their Winter 2009 meeting at the British Library Conference Center in London. The meeting was attended by 13 journal representatives and 4 members of the Dryad development team. A few highlights from the meeting:

Dryad now includes 489 data files in 163 data packages, though a large proportion of this content has been imported from the Systematic Biology archives.

The rate of submissions to Dryad is slowly increasing. Dryad has been able to accept submissions from authors since early 2009. Two journals, The American Naturalist and Molecular Ecology, have completed initial integration with Dryad, allowing their authors to use a more streamlined submission process. The Journal of Heredity is making progress on integration, and several other journals expect to integrate in the near future.

We are currently improving the user interface for locating and obtaining data. We are developing more sophisticated tools for curation, and we are working with several partner repositories to replicate content and provide federated searching services. For more detail, see the Dryad Development Plan.

The board discussed the role of identifiers in Dryad and whether DOIs should be assigned to Dryad’s holdings. Representatives from CrossRef and DataCite led discussions on the advantages of DOIs. The board unanimously recommended that each Dryad data package be given a DOI (a data package is all data associated with a single article). The executive committee will determine whether DOIs should be used at more granular levels (e.g., the individual files within a data package).

The longest discussion of the meeting focused on plans for transitioning Dryad from the current grant funding to a model that is more sustainable for the long term. Todd Vision presented a cost model created by the Dryad development team and consultant Lorraine Eakin. Consultants from Charles Beagrie Limited presented an analysis of expected staffing needs and potential revenue streams. The board provided guidance on the schedule and methods for pursuing revenue from a variety of sources.

Community engagement emerged as a critical factor in ensuring long-term sustainability. Towards that end, the board discussed many ideas for increasing the visibility of the repository. Notable steps include increasing the frequency of posts on this blog, having a more visible presence at scientific meetings, and expanding use of social networking tools like Facebook and Twitter.

Once the Dryad development team compiles all notes from the meeting, we will release a more detailed report.

Dryad was at the 5th International Digital Curation Conference in London last week,  getting several prominent mentions by speakers, and with a poster on our research supporting the curation workflow, available here.

A pre-conference workshop on Citability of Research Data provided an introduction to DataCite,  a cooperative effort of the German National Library of Science and Technology (TIB), the British Library, Canada Institute for Scientific and Technical Information (CISTI), among others, with the goal:  …to establish a not-for-profit agency that enables organisations to register research datasets and assign persistent identifiers to them, so that research datasets can be handled as independent, citable, unique scientific objects. This was followed by a useful discussion of the barriers and challenges, which produced a nice little checklist of things to do.  Change scientific culture around data, gain journal/publisher support, facilitate good data management,  yes– terminology matters!, resolve data granularity issues, encourage & make it easy for authors to deposit data….

Here are some more highlights from the meeting. See the IDCC’s videos of the sessions, or the Digital Curation Blog for more.

Dryad board member William Michener presented on DataONE, and made a prominent mention of Dryad in the discussion afterwards.  Thanks, Bill!

In his keynote, Ed Seidel, Associate Director, Directorate of Mathematical and Physical Sciences, National Science Foundation, said

  • publicly funded data should be made available
  • simply ‘expecting’ researchers to share data = like expecting teenagers to clean their rooms
  • we need “executable publications” that include code and data with paper to run and reproduce science
  • and then he called for journals to require data deposition, “If journals require data associated with publication to be available; that would be a major push.”

Timo Hannay, Publishing Director, Nature Publishing Group, began his closing keynote address by saying that “at lunch 3 separate people were kind enough to point out that supplementary information was [no good] in PDF.”  Other tidbits from his talk:

  • journals need to become more like databases, more structured, more searchable
  • we are joining the dots across the intellectual terra incognita
  • all information is inter-connected
  • the associations between facts are just as important as the facts themselves; we have increasingly interconnected data sets, and are building one global computer and one global database
  • this is vast and messy and inconsistent and immensely valuable
  • there must be more efficient ways to do peer review but no one has come up with one yet
  • Q: do authors send data?  what do you do with it?
    • A: supplementary info is a catchall phrase
    • some of it is data, not most of it
    • we just take the file and put it online and link to it
    • it’s mostly Excel spreadsheets
    • our system used to just put it into a PDF– have fixed that
    • there’s slow progress, and is dependent on authors
    • interested to see encouraging making usable data available

One interesting paper from Australia, by Dr Andrew Treloar, Australian National Data Service (ANDS), identified data sharing verbs; these are proposed “as a useful way to design and structure flexible services in a heterogeneous environment.”

  1. create/capture
  2. store– “ANDS doesn’t do storage but we care that it happens”
  3. describe– info for discovery, determination of value, access, & re-use
  4. identify– using handles, just joined DataCite, can now can generate DOI’s,  have an “Identify My Data” service; want data to be a first-class output
  5. register– host a registry of collections
  6. discover– offer discovery services
  7. access– 4 ways: direct link, link to data repository, contact info to get data, or metadata only
  8. exploit, or use– build on what’s available

For more detail see the full paper here.  The full IDCC programme is here, and all the recorded sessions are available here.  Next year the IDCC will be in Chicago.  If you like O’Hare in Dec., this should be a real treat!

Older Posts »