Feeds:
Posts
Comments

Christopher Pirrone excavating an odontocete skull (photo by Robert Boessenecker)

Perhaps it’s understandable that paleontologists are committed to preserving the scientific record, since they spend a lot of time and energy finding and extracting shreds of evidence millions of years old.  Now, thanks to a partnership between Dryad and The Paleontological Society announced last year [1], coupled with strong data archiving policies adopted by two of its journals (Paleobiology and the Journal of Paleontology), a rich trove of data will be available for future researchers to unearth from Dryad.

For both journals, authors are being instructed to deposit the underlying data at the time their manuscript is submitted, so that editors and referees will be able to review it prior to acceptance.  Once published on Dryad, the data will be independently discoverable and citable, while at the same time prominently linked both to and from the original article.  Researchers are able to track the reuse impact of their data, independent of the citation impact of their article, by monitoring downloads from Dryad.

Preserved for ages.

Smilodon, by Charles Knight (1905), from a mural at the American Museum of Natural History.

Here’s an example from a recent issue of Paleobiology to sink your teeth into:

Article: Meachen-Samuels JA (2012) Morphological convergence of the prey-killing arsenal of sabertooth predators. Paleobiology 38(1): 1-14. doi:10.1666/10036.1

Data: Meachen-Samuels JA (2012) Data from: Morphological convergence of the prey-killing arsenal of sabertooth predators. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.h58q6

References:

[1]  Callaway E (2011) Fossil data enter the web period. Nature 472, 150. http://dx.doi.org/10.1038/472150a

doctor silencedA recent issue of BMJ highlighted the problem of missing clinical trial data from medical research, exploring both the causes and consequences of unpublished evidence.  One of the articles, from Andrew Prayle and colleagues [1], examined compliance with the US Food and Drug Administration’s ostensibly mandatory requirement that clinical trials report their results in ClinicalTrials.gov, as required by the the FDA Amendments Act (FDAAA) of 2007. Alarmingly, they found that only 22% of trials that should have reported results had actually done so.  Interestingly, industry-funded trials reported results at a higher frequency than other funders.  They conclude:

If the reporting rate does not increase, the laudable FDAAA legislation will not achieve its goal of improving the accessibility of trial results.

Fortunately for those interested in this research, the authors have ensured that their own data are available by depositing them in Dryad, where they have already been downloaded by over 100 users.

For more on the disturbing state of affairs in reporting of clinical trial data, we offer the irrepressible Ben Goldacre speaking at the Strata 2012 conference in February.

[1] Prayle AP, Hurley MN, Smyth AR (2012) Compliance with mandatory reporting of clinical trial results on ClinicalTrials.gov: cross sectional study. BMJ 343: d7373. doi:10.1136/bmj.d7373

Scaling up. Courtesy of Swamibu via flickr, CC-BY-NC

The US National Science Foundation, through its Advances in Biological Informatics program, has announced a new award of $2.4M over four years to Duke University (NESCent), the University of North Carolina Chapel Hill (Metadata Research Center), and North Carolina State University (Digital Library).

The award will enable Dryad to scale up its technical infrastructure to support the rapidly expanding user base of journals and researchers, ensure that the repository is meeting the needs of that user base, and to complete the transition to a financially independent non-profit organization.

This is one of a new breed of Development Awards being made by ABI, in which the review criteria judge the ability of the project to produce “robust, broadly-adopted cyberinfrastructure” with an emphasis on “user engagement, design quality, engineering practices, management plan, and dissemination”.

Repositories such as Dryad enable researchers to comply with funding agency expectations for long-term data preservation and availability, and we are grateful to NSF for its continuing support of this mission.

Until recently, Mark Hahnel was a PhD student in stem cell biology. Frustrated by seeing how much of his own research output didn’t make it to publications, he endeavored to do something about it by developing a scientific file sharing platform called FigShare. Recently, Mark and FigShare were taken under the wing of Digital Science, a Nature Publishing Group spinoff, and a sleek new FigShare was relaunched in January 2012 with many more features and an ambitious scope.

FigShare allows researchers to publish all of their research outputs in seconds in an easily citable, sharable and discoverable manner. All file formats can be published, including videos and datasets that are often demoted to the supplemental materials section in current publishing models. By opening up the peer review process, researchers can easily publish null results, avoiding the file drawer effect and helping to make scientific research more efficient.

Users do not have to pay for access to the content: public data is made available under the terms of a CC0 waiver and other content under CC-BY.  And FigShare is currently providing unlimited public space and 1GB of private storage space for free.

This is a promising solution for getting negative and otherwise unpublished results out into the world (figures, tables, data, etc.) in a way that is discoverable and citable.  Importantly, much of this content would not be appropriate for Dryad, since it is not associated with (and not documented by) an authoritative publication.

There are clearly some challenges to the FigShare model.  A big one, shared with many other Open Science experiments that disseminate prior to peer review, is ensuring that there is adequate documentation for users to assess fitness for reuse.  Another challenge that Dryad is greatly concerned about is guaranteeing that the content will still be usable, and there will be the means to host it, ten or twenty years down the road.  These are reflections of larger unanswered questions about how the research community can best take advantage of the web for scholarly communication, and how to optimize filtering, curating or preserving such communications. To answer these questions, the world of open data needs many more more innovative projects like FigShare.

Considering FigShare’s relaunch suggests a few strengths of the Dryad model:

  • Dryad works with journals to integrate article and data submission, streamlining the deposit process.
  • Dryad curators review files for technical problems before they are released, and ensure that their metadata enables optimal retrieval.
  • Dryad’s scope is focused on data files associated with published articles in the biosciences (plus software scripts and other files important to the article.)
  • Dryad can make data securely available during peer review, at the request of the journal.
  • Dryad is community-led, with priorities and policies shaped by the members of the Dryad Consortium, including scientific societies, publishers, and other stakeholder organizations.
  • Dryad can be accessed programmatically through a sitemap or OAI-PMH interface.
  • Dryad content is searchable and replicated through the DataONE network, and it handshakes with other repositories to coordinate data submission.

For more about Dryad, browse the repository or see Why Should I Choose Dryad for My Data?

A file sharing platform and a data repository are different animals, to be sure; both have a place in a lively open data ecosystem. We wish success to the Digital Science team, and look forward to both working together, and challenging each other, to better meet the needs of the research community.  To see what other options are out there for different disciplines and types of data, DataCite provides an updated list of list of research data repositories.

Our last post celebrated the 1000th data package in Dryad. This week, with the release of two data packages associated with articles in Ecological Monographs, we celebrate another important milestone, our 100th journal.

We believe this validates one of the premises on which Dryad was founded, that a non-specialist data repository can serve as shared infrastructure for a large and diverse set of journals.  As a group, they have little in common, serving authors and readers from many different research communities, nationalities, types of institutional affiliation, etc., and working with many different kinds of data.  Some are owned by societies, some by commercial publishers, some by not-for-profits.  Some are Open Access, many are not.  Some have specialized disciplinary or taxonomic scope (e.g. including journals that publish on birds, herps, insects, mammals, plants, protists, viruses, etc.) while some publish findings from all corners of science (Nature, PNAS, Science).

Interestingly, this set of 100 is roughly five times the number of journals that have integrated manuscript submission with Dryad in order to facilitate authors’ data archiving.  While the integrated journals still account for the majority of new data submissions, we are pleased to continue receiving data volunteered by authors publishing in outlets new to Dryad.

The journals that have integrated their manuscript processing with Dryad to date are mostly, though not exclusively, from the fields of evolutionary biology and ecology:

  • The American Naturalist
  • Biological Journal of the Linnean Society
  • BMJ Open (an important first step in that it is our first integrated biomedical journal)
  • Ecological Monographs
  • Evolution
  • Evolutionary Applications
  • Heredity
  • Journal of Evolutionary Biology
  • Journal of Heredity
  • Molecular Ecology and Molecular Ecology Resources
  • Paleobiology
  • Pensoft Publishers – 8 different journals
  • Systematic Biology

But Dryad’s broadening disciplinary coverage is best illustrated by listing some of the journals with content in the repository that have not, at least not yet, implemented integrated submission:

  • Animal Behaviour
  • Bioinformatics
  • Biotropica
  • Conservation Genetics
  • Environmental Microbiology
  • Evolution and Development
  • Frontiers in Psychology
  • Genome Biology and Evolution
  • Human Genomics
  • Integrative and Comparative Biology
  • Journal of Biogeography
  • Journal of Fish and Wildlife Management
  • The Journal of Parasitology
  • Limnology and Oceanography
  • The Plant Cell
  • PLoS Pathogens
  • Symbiosis
  • Toxicon

And we are particularly pleased by the irony of hosting data from Genesis ;)

If you are an editor, publisher, or just a passionate reader of a journal that currently has content in Dryad (you can find out for yourself here), and you would like to talk about how manuscript submission integration could strengthen the service that Dryad provides to your journal, then please contact us.

1E+3

Fig 1. Helen of Troy, detail from an Attic red-figure krater, c. 450–440 BC

It is said that a picture is worth a thousand words and that Helen of Troy (Fig 1) had a face that launched a thousand ships.  Why is the number 1000 significant to those of us at Dryad today?  (Especially since its place in literature is ultimately an accident of our decimal number system [1]).

The reason is that Dryad released its 1000th data package.  The lucky submission is: Hager R, Cheverud JM, Wolf JB (2011) Data from: Genotype dependent responses to levels of sibling competition over maternal resources in mice. doi:10.5061/dryad.8qq3p0d8  [2]. This (arbitrary, but see [3]) milestone has put us in a reflective mood, and so here we take the opportunity to consider what it means.

First, it encourages us that Dryad’s multipronged approach to making data available for reuse (raising awareness of the issues, coordinating data archiving policy across journals, providing a user-friendly submission interface, paying attention to the incentives of researchers) is bearing fruit.  As a result of this strategy, the rate of submissions continues to grow; over 60% of submissions are from the past nine months alone.  Since a picture is worth a thousand words, see Fig 2.

Figure 2. Data packages submitted to Dryad through September 2011

We are mindful will take some time before we can measure the impact of the availability of these data for reuse, but there are encouraging signs from the frequency with which data are being downloaded.  We will discuss those results in a separate post.

What else can we learn from these first 1000 submissions?  One is the importance of making data submission integral to publication. While there are 88 different journals in which the corresponding articles appear, about three quarters of the submissions come from the first nine journals that worked to integrate manuscript and data submission with Dryad [4].  Journal policy matters, and the enthusiasm with which journals implement policy matters.

As far as disciplinary diversity goes, the first 1000 submissions are dominated by journals in evolutionary biology and ecology.  Dryad’s first biomedical journal partner, BMJ Open, was integrated within the past few months, and as a result of many other new journal partnerships being developed, we expect submissions to the repository to represent a much broader array of basic and applied biosciences in the near future.

Interestingly, most of the deposits are relatively small in size. Counting all files in a data package together, almost 80% of data packages are less than one megabyte.  Furthermore, the majority of data packages contain only one data file and the mean is a little less than two and a half.  As one might expect, many of the files are spreadsheets or in tabular text format.  Thus, the files are rich in information but not so difficult to transfer or store.

We are pleasantly surprised to report that most authors, most of the time, see the value in having their data released at the same time as the article is published.  Authors are making their data available immediately upon publication, or earlier, for over 90% of data files.  In nearly all cases where files are put under embargo, authors choose to release them one-year post-publication rather than requesting a longer embargo from the journal.

Thomson Reuters indexes more than half a million abstracts annually in BIOSIS.  A difficult-to-estimate, but undoubtedly substantial, fraction of this literature reports on data that cannot be, or is not, archived in a specialized public data repository.  This helps put Dryad’s 1000 data packages in perspective.   As a discipline, we still have a long way to go to preserve and make available for reuse all the “published” data that has no home.  But every data package that is submitted to Dryad is a little victory for the transparency and robustness of science.

So here’s to the first thousand.  May they have plenty of company in the coming years.

Footnotes:

  1. Things might have turned out very differently judging by the presence early vertebrate fossils with more than five digits (see http://en.wikipedia.org/wiki/Polydactyly_in_early_tetrapods)
  2. To celebrate, we are sending a Dryad-logo coffee mug to Dr. Reinmar Hager, who submitted the 1000th data package.
  3. Random cool fact about the number 1000.  It is “the smallest number that generates three primes in the fastest way possible by concatenation of decremented numbers (1000999, 1000999998997, and 1000999998997996995994993 are prime) … [excluding] the number itself” (see http://primes.utm.edu/curios/page.php/1000.html).
  4. This includes a collection of legacy data packages from the Systematic Biology archives that was submitted en masse to Dryad in mid-2009.

Early in the process of depositing data to the Dryad repository,  authors are asked to consent to the explicit release of their data into the public domain under the terms of a Creative Commons Zero (CC0) waiver. We are frequently asked why Dryad uses CC0 rather than a license such as CC-BY, and it is important for all users to understand the rationale for this, as well as its implications.

Obviously, one of the primary purposes of archiving data in Dryad is to enable its reuse by others.  Having clear and open terms of reuse helps realize that goal.  (Along with having well-organized data, good documentation, persistent file-formats, etc.)

CC0 was crafted specifically to reduce any legal and technical impediments, be they intentional and unintentional, to the reuse of data.   In most cases, CC0 does not actually affect the legal status of the data, since facts in and of themselves are not eligible for copyright in most countries (e.g. see this commentary from Bitlaw regarding U.S. copyright law).  But where they are, CC0 waives copyright and related rights to the extent permitted by law.

Importantly, CC0 does not exempt those who reuse the data from following community norms for scholarly communication.  It does not exempt researchers from reusing the data in a way that is mindful of its limitations.  Nor does it exempt researchers from the obligation of citing the original data authors.  However, like other scientific norms, these expectations are best articulated and enforced by the community itself through processes such as peer review.

In fact, by removing un-enforcable legal barriers, CC0 facilitates the discovery, re-use, and citation of that data.

“Community norms can be a much more effective way of encouraging positive behaviour, such as citation, than applying licenses. A well functioning community supports its members in their application of norms, whereas licences can only be enforced through court action and thus invite people to ignore them when they are confident that this is unlikely.” (Panton Principles FAQ)

Dryad’s policy ultimately follows the recommendations of Science Commons, which discourage researchers from presuming copyright and using licenses that include “attribution” and “share-alike” conditions for scientific data.

Both of these conditions can put legitimate users in awkward positions.  First, specifying how “attribution” must be carried out may put a user at odds with accepted citation practice:

when you federate a query from 50,000 databases (not now, perhaps, but definitely within the 70-year duration of copyright!) will you be liable to a lawsuit if you don’t formally attribute all 50,000 owners?” Science Commons Database Protocol FAQ)

While “share-alike” conditions create their own unnecessary legal tangle:

“ ‘share-alike’ licenses typically impose the condition that some or all derivative products be identically licensed. Such conditions have been known to create significant “license compatibility” problems under existing license schemes that employ them. In the context of data, license compatibility problems will likely create significant barriers for data integration and reuse for both providers and users of data.” (Science Commons Database Protocol FAQ)

Thus,

“… given the potential for significantly negative unintended consequences of using copyright, the size of the public domain, and the power of norms inside science, we believe that copyright licenses and contractual restrictions are simply the wrong tool [for data], even if those licenses and contracts are used with the best of intentions.” (Science Commons Database Protocol FAQ)

Furthermore, Dryad’s use of CC0 to make the terms of reuse explicit has some important advantages:

  • interoperability: Since CC0 is both human and machine-readable, other people and indexing services will automatically be able to determine the terms of use.
  • universality: CC0 is a single mechanism that is both global and universal, covering all data and all countries.  It is also widely recognized.
  • simplicity: there is no need for humans to make, and respond to, individual data requests, and no need for click-through agreements.  This allows more scientists to spend their time doing science.

It is important to note that if you have data that, due to pre-existing agreements, cannot be released under the terms of CC0, please do not deposit that data to Dryad.  Journals that require data archiving in Dryad as a condition of publication can make exceptions for such special cases.

Footnote:  Interestingly, the repository had originally applied CC-BY to all its contents.  The very deliberate decision to use CC0 instead, made by Dryad’s Board in May of 2009, required us to obtain permission from all the early contributors to change the terms of reuse of their content.   And today, there are still a few items in Dryad under CC-BY for which permission was not granted.

Follow

Get every new post delivered to your Inbox.

Join 749 other followers