A shiny new look and lots more info

seed-2We encourage you to visit the Dryad homepage today and check out our new look.  We’ve made many changes, both large and small, and added lots of new content.

Highlights include:

  • A new Ideas Forum, where you can let us know what features you’d like us to work on next, upvote or comment on ideas submitted by others, and check back to see our responses.
  • New membership and pricing plans, which we will feature in upcoming posts.
  • Updates about our  Annual Membership Meeting and related events from 22-24 May in Oxford, UK.
  • An Integrated Journals page that helps depositors see which journals are coordinating the submission process with Dryad, figure out which stage in the publication process to submit data for your chosen journal, and more.
  • Prominent positioning of Dryad’s Terms of Service, which we view as a two-way compact with our users. We wrote it in plain language and sincerely want it to be read!
  • Improved accessibility to persons with visual disabilities (following the guidelines in Section 508 of the U.S. code)
  • Improved navigation, including an integrated page of Frequently Asked Questions
  • More intuitive search and browse of data packages and a revamped layout for the data package page

There are lots more improvements underway.  Not all of these will be immediately obvious to website visitors, but you can expect to see more changes over the coming months.  Thanks to all who have provided feedback and helped with usability testing, and please let us know what you think!

New journals integrate data submission with Dryad

Dryad is pleased to announce that a diverse array of new partner journals have completed submission integration during the first quarter of 2013.  Authors to these journals will benefit from streamlined data deposition, while the journals will benefit from enhancement of the articles through a tighter linkage to the underlying data.

Submission integration is completely free, and can be implemented with a wide variety of manuscript submission systems.  We welcome inquiries from other journals that wish to integrate submission with Dryad, and encourage authors from non-integrated journals to let their editors know if it is a service that they would value.

gms2

  • eLife is a prestigious new open-access journal published by the Howard Hughes Medical Institute,  the Max Planck Society, and the Wellcome Trust.
    elife
  • Journal of Open Public Health Data (JOPHD) is a new journal from Ubiquity Press that publishes peer-reviewed data papers describing public health datasets with high reuse potential.  The data itself must be made freely available in a public repository.

jophd

Each journal that integrates with Dryad chooses whether to have authors archive their data prior to peer review or after manuscript acceptance.  Of these six journals, GMS Medical Sciences, eLife, and the Journal of Open Public Health Data chose to have their authors submit data prior to peer review.

Hope and change for research data in the US

OSTP homepageOn Friday, the Obama administration made a long-awaited announcement regarding public access to the results of federally funded research in the United States.

There has been considerable attention given to the implications for research publications (a concise analysis here).  Less discussed so far — but just as far reaching — the new policy also has quite a lot to say about research data, a topic on which the White House solicited, and received, an earful of input just over a year ago.

What does the directive actually require?  All federal government agencies with at least $100M in R&D expenditures must develop, in the next six month, policies for digital data arising from non-classified research that address a host of objectives, including:

  • to “maximize access, by the general public and without charge, to digitally formatted scientific data created with federal funds” while recognizing that there are cases in which preservation and access may not be desirable or feasible.
  • to promote greater use of data management plans for both intramural and extramural grants and contracts, including review of such plans and mechanisms for ensuring compliance
  • to allow inclusion of appropriate costs for data management and access in grants
  • to promote the deposit of data in publicly accessible databases
  • to address issues of attribution to scientific data sets
  • to support training in data management and stewardship
  • to “outline options for developing and sustaining repositories for scientific data in digital formats, taking into account the efforts of public and private sector entities”

Interestingly, the directive is silent on the issue of embargo periods for research data, neither explicitly allowing or disallowing them.

In the words of White House Science Advisor John Holdren

…the memorandum requires that agencies start to address the need to improve upon the management and sharing of scientific data produced with Federal funding. Strengthening these policies will promote entrepreneurship and jobs growth in addition to driving scientific progress. Access to pre-existing data sets can accelerate growth by allowing companies to focus resources and efforts on understanding and fully exploiting discoveries instead of repeating basic, pre-competitive work already documented elsewhere.

The breadth of research impacted by this directive is notable.  Based on the White House’s proposed 2013 budget, the covered agencies would spend more then $60 billion on R&D.  A partial list includes:

  • The National Institutes of Health (NIH)
  • The National Science Foundation (NSF)
  • The National Aeronautics and Space Administration (NASA)
  • The Department of Energy (DOE)
  • The Department of Agriculture (USDA)
  • The National Oceanic and Atmospheric Administration (NOAA)
  • The National Institutes for Standards and Technology (NIST)
  • The Department of the Interior (which includes the Geological Survey)
  • The Environmental Protection Agency (EPA)
  • and even the Smithsonian Institution

We applaud OSTP for moving to dramatically improve the availability of research data collected in the public interest with federal funds.

You can read the full memo here: the data policies are covered in Section 4.

How to decide what data should be archived at publication

The following guest post is from Tim Vines, Managing Editor of Molecular Ecology and Molecular Ecology Resources.  ME and MER have among the most effective data archiving policies of any Dryad partner journal, as measured by the availability of data for reuse [1].  In this post, which may be useful to other journals figuring out how to support data archiving, Tim explains how Molecular Ecology’s approach has been refined over time.

newsman

Ask almost anyone in the research community, and they’ll say that archiving the data associated with a paper at publication is really important. Making sure it actually happens is not quite so simple. One of the main obstacles is that it’s hard to decide which data from a study should be made public, and this is mainly because consistent data archiving standards have not yet been developed.

It’s impossible for anyone to write exhaustive journal policies laying out exactly what each kind of study should archive (I’ve tried), so the challenge is to identify for each paper which data should be made available.

Before I describe how we currently deal with this issue, I should give some history of data archiving at Molecular Ecology. In early 2010 we joined with the five other big evolution journals in adopting the ‘Joint Data Archiving Policy’, which mandates that “authors make all the data required to recreate the results in their paper available on a public archive”. This policy came into force in January 2011, and since all five journals brought it in at the same time it meant that no one journal suffered the effects of bringing in a (potentially) unpopular policy.

To help us see whether authors really had archived all the required datasets, we started requiring that authors include ‘Data Accessibility’ (DA) section in the final version of their manuscript. This DA section lists where each dataset is stored, and normally appears after the references.  For example:

Data Accessibility:

  • DNA sequences: Genbank accessions F234391-F234402
  • Final DNA sequence assembly uploaded as online supplemental material
  • Climate data and MaxEnt input files: Dryad doi:10.5521/dryad.12311
  • Sampling locations, morphological data and microsatellite genotypes: Dryad doi:10.5521/dryad.12311

We began back in 2011 by including a few paragraphs about our data archiving policies in positive decision letters (i.e. ‘accept, minor revisions’ and ‘accept’), which asked for a DA section to be added to the manuscript during their final revisions. I would also add a sticky note to the ScholarOne Manuscripts entry for the paper indicating which datasets I thought should be listed. Most authors added the DA, but generally only included some of the data. I then switched to putting my list into the decision letter itself, just above the policy itself. For example:

“Please don’t forget to add the Data Accessibility section- it looks like this needs a file giving sampling details, morphology and microsatellite genotypes for all adults and offspring. Please also consider providing the input files for your analyses.”

This was much more effective than expecting the authors to work out which data we wanted. However, it still meant that I was combing through the abstract and the methods trying to work out what data had been generated in that manuscript.

We use ScholarOne Manuscripts’ First Look system for handling accepted papers, and we don’t export anything to be typeset until we’re satisfied with the DA section. Being strict about this makes most authors deal with our DA requirements quickly (they don’t want their paper delayed), but a few take longer while we help authors work out what we want.

The downside of this whole approach is that it takes me quite a lot of effort to work out what should appear in the DA section, and would be impossible in a journal where an academic does not see the final version of the paper. A more robust long-term strategy has to involve the researcher community in identifying which data should be archived.

I’ll flesh out the steps below, but simply put our new approach is to ask authors to include a draft Data Accessibility section at initial submission. This draft DA section should list each dataset and say where the authors expect to archive it. As long as the DA section is there (even if it’s empty) we send the paper on to an editor. If it makes it to reviewers, we ask them to check the DA section and point out what datasets are missing.

A paper close to acceptance can thus contain a complete or nearly complete DA section. Furthermore, any deficiencies should have been pointed out in review and corrected in revision. The editorial office now has the much easier task of checking over the final DA section and making sure that all the accession numbers etc. are added before the article is exported to be typeset.

The immediate benefit is that authors are encouraged to think about data archiving while they’re still writing the paper – it’s thus much more an integral part of manuscript preparation than an afterthought. We’ve also found that a growing proportion of papers (currently about 20%) are being submitted with a completed DA section that requires no further action on our part. I expect that this proportion will be more like 80% in two years, as this seems to be how long it takes to effect changes in author or reviewer behavior.

Since the fine grain of the details may be of interest, I’ve broken down the individual steps below:

1) The authors submit their paper with a draft ‘Data Accessibility’ (DA) statement in the manuscript; this lists where the authors plan to archive each of their datasets. We’ve included a required checkbox in the submission phase that states ‘A draft Data Accessibility statement is present in the manuscript’.

2) Research papers submitted without a DA section are held in the editorial office checklist and the authors contacted to request one. In the first few months of using this system we have found that c. 40% of submissions don’t have the statement initially, but after we request it the DA is almost always emailed within 3-4 days. If we don’t hear for five working days we unsubmit the paper; this has happened to about only 5% of papers.

3) If the paper makes it out to review, the reviewers are asked to check whether all the necessary datasets are listed, and if not, request additions in the main body of their review. Specifically, our ‘additional questions’ section of the review tab in S1M now contains the question: “Does the Data Accessibility section list all the datasets needed to recreate the results in the manuscript? If ‘No’, please specify which additional data are needed in your comments to the authors.”  Reviewers can choose ‘yes’, ‘no’ or ‘I didn’t check’; the latter is important because reviewers who haven’t looked at the DA section aren’t forced to arbitrarily click ‘yes’ or ‘no’.

4) The decision letter is sent to the authors with the question from (3) included. Since we’re still in the early days of this system and less than a quarter of our reviewers understand how to evaluate the DA section, I am still checking the data myself and requesting that any missing datasets be included in the revision. This is much easier than before as there is a draft DA section to work with and sometimes some feedback from the reviewers.

5) The editorial office then makes sure that any deficiencies identified by myself or the reviewers are dealt with by the time the paper goes to be typeset; this is normally dealt with at the First Look stage.

I’d be very happy to help anyone that would like to know more about this system or its implementation – please contact me at managing.editor@molecol.com

[1] Vines TH, Andrew RL, Bock DG, Franklin MT, Gilbert KJ, Kane NC, Moore JS, Moyers BT, Renaut S, Rennison DJ, Veen T, Yeaman S. Mandated data archiving greatly improves access to research data. FASEB J. 2013 Jan 8. Epub ahead of print.  Update: Also available from arXiv.

Lee Dirks: friend, colleague and information scientist par excellence

Lee Dirks

We are profoundly saddened by the untimely and tragic death of our dear friend and colleague Lee Dirks, who was killed together with his wife Judy Lew in a road accident in the Peruvian Andes.

Lee had recently been elected to the Board of Directors for Dryad.  He also served on the Board of Visitors for the UNC School of Information Sciences (of which he was a proud alumnus) and was a member of the Board of the SILS Metadata Research Center.  Lee made a named for himself in recent years as Director of Education and Scholarly Communication at Microsoft.

Lee was a visionary information scientist, a warm and generous personality, and a man who loved adventure.  The number of people whose lives he touched in his own short life was staggeringly large.

Lee and his wife are survived by their two young daughters, who were at home in Seattle at the time of the accident.  Our thoughts are with them.  And we will miss Lee greatly.

The Dryad June 2012 Newsletter: Lots of news, and a new format

We are experimenting with a nimble new format for our newsletter, in which each item consists of an individual blog post.  All the news items are also available in one PDF document if you’d prefer.

  1. Stakeholder governance.  “The scientific, educational, and charitable mission of Dryad is to promote the availability of data underlying findings in the scientific literature for research and educational reuse. The vision of Dryad is a scholarly communication system in which learned societies, publishers, institutions of research and education, funding bodies and other stakeholders collaboratively sustain and promote the preservation and reuse of data underlying the scholarly literature.”  This Mission Statement is from Dryad’s new Bylaws, which were approved this month by a vote of its Interim Partners. Since its inception, Dryad been guided by the idea that an enduring community resource requires stakeholder governance…
  2. Sustainability planning.  Another important milestone was reached when the organization officially adopted a cost recovery plan to ensure Dryad’s sustainability.  The plan was the result of several years of deliberation among Dryad’s Interim Partners, experts in sustainability, and many prospective Member organizations…
  3. Summer 2011 Interim Board meeting. The governance and cost recovery plan emerged from a consultation process that culminated in a meeting of the Dryad Interim Board in Vancouver, Canada in July 2011. In addition to the governance and sustainability plans, participants also made progress on a number of important policy issues. Several of these bear on what content Dryad will accept…
  4. New funding from the US National Science Foundation. Earlier this year, the NSF, through its Advances in Biological Informatics program, announced a new award of $2.4M over four years to enable Dryad to scale up its technical infrastructure to support the rapidly expanding user base of journals and researchers, ensure that the repository is meeting the needs of that user base…
  5. New integrated journals.  In recent months, more journals have implemented submission integration with Dryad to make data archiving easier for authors.  Technically, the process entails setting up semi-automated communications between Dryad and the manuscript submission system of the journal.  Currently 24 journals have implemented submission integration…
  6. New features. A number of enhancements to Dryad have been made in recent months, including these three that were in high demand from users…

If you do not yet receive our newsletters by email and would like to, please sign up for our low traffic Dryad-announcements mailing list.

NSF provides further support to Dryad

Scaling up. Courtesy of Swamibu via flickr, CC-BY-NC

The US National Science Foundation, through its Advances in Biological Informatics program, has announced a new award of $2.4M over four years to Duke University (NESCent), the University of North Carolina Chapel Hill (Metadata Research Center), and North Carolina State University (Digital Library).

The award will enable Dryad to scale up its technical infrastructure to support the rapidly expanding user base of journals and researchers, ensure that the repository is meeting the needs of that user base, and to complete the transition to a financially independent non-profit organization.

This is one of a new breed of Development Awards being made by ABI, in which the review criteria judge the ability of the project to produce “robust, broadly-adopted cyberinfrastructure” with an emphasis on “user engagement, design quality, engineering practices, management plan, and dissemination”.

Repositories such as Dryad enable researchers to comply with funding agency expectations for long-term data preservation and availability, and we are grateful to NSF for its continuing support of this mission.

1E+3

Fig 1. Helen of Troy, detail from an Attic red-figure krater, c. 450–440 BC

It is said that a picture is worth a thousand words and that Helen of Troy (Fig 1) had a face that launched a thousand ships.  Why is the number 1000 significant to those of us at Dryad today?  (Especially since its place in literature is ultimately an accident of our decimal number system [1]).

The reason is that Dryad released its 1000th data package.  The lucky submission is: Hager R, Cheverud JM, Wolf JB (2011) Data from: Genotype dependent responses to levels of sibling competition over maternal resources in mice. doi:10.5061/dryad.8qq3p0d8  [2]. This (arbitrary, but see [3]) milestone has put us in a reflective mood, and so here we take the opportunity to consider what it means.

First, it encourages us that Dryad’s multipronged approach to making data available for reuse (raising awareness of the issues, coordinating data archiving policy across journals, providing a user-friendly submission interface, paying attention to the incentives of researchers) is bearing fruit.  As a result of this strategy, the rate of submissions continues to grow; over 60% of submissions are from the past nine months alone.  Since a picture is worth a thousand words, see Fig 2.

Figure 2. Data packages submitted to Dryad through September 2011

We are mindful will take some time before we can measure the impact of the availability of these data for reuse, but there are encouraging signs from the frequency with which data are being downloaded.  We will discuss those results in a separate post.

What else can we learn from these first 1000 submissions?  One is the importance of making data submission integral to publication. While there are 88 different journals in which the corresponding articles appear, about three quarters of the submissions come from the first nine journals that worked to integrate manuscript and data submission with Dryad [4].  Journal policy matters, and the enthusiasm with which journals implement policy matters.

As far as disciplinary diversity goes, the first 1000 submissions are dominated by journals in evolutionary biology and ecology.  Dryad’s first biomedical journal partner, BMJ Open, was integrated within the past few months, and as a result of many other new journal partnerships being developed, we expect submissions to the repository to represent a much broader array of basic and applied biosciences in the near future.

Interestingly, most of the deposits are relatively small in size. Counting all files in a data package together, almost 80% of data packages are less than one megabyte.  Furthermore, the majority of data packages contain only one data file and the mean is a little less than two and a half.  As one might expect, many of the files are spreadsheets or in tabular text format.  Thus, the files are rich in information but not so difficult to transfer or store.

We are pleasantly surprised to report that most authors, most of the time, see the value in having their data released at the same time as the article is published.  Authors are making their data available immediately upon publication, or earlier, for over 90% of data files.  In nearly all cases where files are put under embargo, authors choose to release them one-year post-publication rather than requesting a longer embargo from the journal.

Thomson Reuters indexes more than half a million abstracts annually in BIOSIS.  A difficult-to-estimate, but undoubtedly substantial, fraction of this literature reports on data that cannot be, or is not, archived in a specialized public data repository.  This helps put Dryad’s 1000 data packages in perspective.   As a discipline, we still have a long way to go to preserve and make available for reuse all the “published” data that has no home.  But every data package that is submitted to Dryad is a little victory for the transparency and robustness of science.

So here’s to the first thousand.  May they have plenty of company in the coming years.

Footnotes:

  1. Things might have turned out very differently judging by the presence early vertebrate fossils with more than five digits (see http://en.wikipedia.org/wiki/Polydactyly_in_early_tetrapods)
  2. To celebrate, we are sending a Dryad-logo coffee mug to Dr. Reinmar Hager, who submitted the 1000th data package.
  3. Random cool fact about the number 1000.  It is “the smallest number that generates three primes in the fastest way possible by concatenation of decremented numbers (1000999, 1000999998997, and 1000999998997996995994993 are prime) … [excluding] the number itself” (see http://primes.utm.edu/curios/page.php/1000.html).
  4. This includes a collection of legacy data packages from the Systematic Biology archives that was submitted en masse to Dryad in mid-2009.

Why aren’t cancer microarray datasets archived more often?

A new study in PLoS ONE by Heather Piwowar, a postdoctoral associate affiliated with DataONE, Dryad, and NESCent, reveals interesting trends in the archiving of data underlying published microarray results.  From the press release:

By querying the full text of the scientific literature through websites like Google Scholar and PubMed Central, Piwowar identified eleven thousand studies that collected a particular type of data about cellular activity, called gene expression microarray data. Only 45% of recent gene expression studies were found to have deposited their data in the public databases developed for this purpose. The rate of data publication has increased only slightly from 2007 to 2009. Data is shared least often from studies on cancer and human subjects: cancer studies make their data available for wide reuse half as often as similar studies outside cancer.

“It was disheartening to discover that studies on cancer and human subjects were least likely to make their data available. These data are surely some of the most valuable for reuse, to confirm, refute, inform and advance bench-to-bedside translational research,” Piwowar said.

“We want as much scientific progress as we can get from our tax and charity dollars. This requires increased access to data resources. Data can be shared while maintaining patient privacy,” Piwowar added, noting that patient re-identification is rarely an issue for gene expression microarray studies.

Reference:  Piwowar, H. (2011). “Who shares? Who doesn’t? Factors associated with openly archiving raw research data.” PLoS ONE 6(7): e18657. doi:18610.11371/journal.pone.0018657

“In the spirit of the topic”, the data behind the study are publicly available in Dryad at doi:10.5061/dryad.mf1sd