Feeds:
Posts
Comments

Archive for the ‘New features’ Category

1E+3

Fig 1. Helen of Troy, detail from an Attic red-figure krater, c. 450–440 BC

It is said that a picture is worth a thousand words and that Helen of Troy (Fig 1) had a face that launched a thousand ships.  Why is the number 1000 significant to those of us at Dryad today?  (Especially since its place in literature is ultimately an accident of our decimal number system [1]).

The reason is that Dryad released its 1000th data package.  The lucky submission is: Hager R, Cheverud JM, Wolf JB (2011) Data from: Genotype dependent responses to levels of sibling competition over maternal resources in mice. doi:10.5061/dryad.8qq3p0d8  [2]. This (arbitrary, but see [3]) milestone has put us in a reflective mood, and so here we take the opportunity to consider what it means.

First, it encourages us that Dryad’s multipronged approach to making data available for reuse (raising awareness of the issues, coordinating data archiving policy across journals, providing a user-friendly submission interface, paying attention to the incentives of researchers) is bearing fruit.  As a result of this strategy, the rate of submissions continues to grow; over 60% of submissions are from the past nine months alone.  Since a picture is worth a thousand words, see Fig 2.

Figure 2. Data packages submitted to Dryad through September 2011

We are mindful will take some time before we can measure the impact of the availability of these data for reuse, but there are encouraging signs from the frequency with which data are being downloaded.  We will discuss those results in a separate post.

What else can we learn from these first 1000 submissions?  One is the importance of making data submission integral to publication. While there are 88 different journals in which the corresponding articles appear, about three quarters of the submissions come from the first nine journals that worked to integrate manuscript and data submission with Dryad [4].  Journal policy matters, and the enthusiasm with which journals implement policy matters.

As far as disciplinary diversity goes, the first 1000 submissions are dominated by journals in evolutionary biology and ecology.  Dryad’s first biomedical journal partner, BMJ Open, was integrated within the past few months, and as a result of many other new journal partnerships being developed, we expect submissions to the repository to represent a much broader array of basic and applied biosciences in the near future.

Interestingly, most of the deposits are relatively small in size. Counting all files in a data package together, almost 80% of data packages are less than one megabyte.  Furthermore, the majority of data packages contain only one data file and the mean is a little less than two and a half.  As one might expect, many of the files are spreadsheets or in tabular text format.  Thus, the files are rich in information but not so difficult to transfer or store.

We are pleasantly surprised to report that most authors, most of the time, see the value in having their data released at the same time as the article is published.  Authors are making their data available immediately upon publication, or earlier, for over 90% of data files.  In nearly all cases where files are put under embargo, authors choose to release them one-year post-publication rather than requesting a longer embargo from the journal.

Thomson Reuters indexes more than half a million abstracts annually in BIOSIS.  A difficult-to-estimate, but undoubtedly substantial, fraction of this literature reports on data that cannot be, or is not, archived in a specialized public data repository.  This helps put Dryad’s 1000 data packages in perspective.   As a discipline, we still have a long way to go to preserve and make available for reuse all the “published” data that has no home.  But every data package that is submitted to Dryad is a little victory for the transparency and robustness of science.

So here’s to the first thousand.  May they have plenty of company in the coming years.

Footnotes:

  1. Things might have turned out very differently judging by the presence early vertebrate fossils with more than five digits (see http://en.wikipedia.org/wiki/Polydactyly_in_early_tetrapods)
  2. To celebrate, we are sending a Dryad-logo coffee mug to Dr. Reinmar Hager, who submitted the 1000th data package.
  3. Random cool fact about the number 1000.  It is “the smallest number that generates three primes in the fastest way possible by concatenation of decremented numbers (1000999, 1000999998997, and 1000999998997996995994993 are prime) … [excluding] the number itself” (see http://primes.utm.edu/curios/page.php/1000.html).
  4. This includes a collection of legacy data packages from the Systematic Biology archives that was submitted en masse to Dryad in mid-2009.

Read Full Post »

Why does Dryad use CC0?

Early in the process of depositing data to the Dryad repository,  authors are asked to consent to the explicit release of their data into the public domain under the terms of a Creative Commons Zero (CC0) waiver. We are frequently asked why Dryad uses CC0 rather than a license such as CC-BY, and it is important for all users to understand the rationale for this, as well as its implications.

Obviously, one of the primary purposes of archiving data in Dryad is to enable its reuse by others.  Having clear and open terms of reuse helps realize that goal.  (Along with having well-organized data, good documentation, persistent file-formats, etc.)

CC0 was crafted specifically to reduce any legal and technical impediments, be they intentional and unintentional, to the reuse of data.   In most cases, CC0 does not actually affect the legal status of the data, since facts in and of themselves are not eligible for copyright in most countries (e.g. see this commentary from Bitlaw regarding U.S. copyright law).  But where they are, CC0 waives copyright and related rights to the extent permitted by law.

Importantly, CC0 does not exempt those who reuse the data from following community norms for scholarly communication.  It does not exempt researchers from reusing the data in a way that is mindful of its limitations.  Nor does it exempt researchers from the obligation of citing the original data authors.  However, like other scientific norms, these expectations are best articulated and enforced by the community itself through processes such as peer review.

In fact, by removing un-enforcable legal barriers, CC0 facilitates the discovery, re-use, and citation of that data.

“Community norms can be a much more effective way of encouraging positive behaviour, such as citation, than applying licenses. A well functioning community supports its members in their application of norms, whereas licences can only be enforced through court action and thus invite people to ignore them when they are confident that this is unlikely.” (Panton Principles FAQ)

Dryad’s policy ultimately follows the recommendations of Science Commons, which discourage researchers from presuming copyright and using licenses that include “attribution” and “share-alike” conditions for scientific data.

Both of these conditions can put legitimate users in awkward positions.  First, specifying how “attribution” must be carried out may put a user at odds with accepted citation practice:

when you federate a query from 50,000 databases (not now, perhaps, but definitely within the 70-year duration of copyright!) will you be liable to a lawsuit if you don’t formally attribute all 50,000 owners?” Science Commons Database Protocol FAQ)

While “share-alike” conditions create their own unnecessary legal tangle:

“ ‘share-alike’ licenses typically impose the condition that some or all derivative products be identically licensed. Such conditions have been known to create significant “license compatibility” problems under existing license schemes that employ them. In the context of data, license compatibility problems will likely create significant barriers for data integration and reuse for both providers and users of data.” (Science Commons Database Protocol FAQ)

Thus,

“… given the potential for significantly negative unintended consequences of using copyright, the size of the public domain, and the power of norms inside science, we believe that copyright licenses and contractual restrictions are simply the wrong tool [for data], even if those licenses and contracts are used with the best of intentions.” (Science Commons Database Protocol FAQ)

Furthermore, Dryad’s use of CC0 to make the terms of reuse explicit has some important advantages:

  • interoperability: Since CC0 is both human and machine-readable, other people and indexing services will automatically be able to determine the terms of use.
  • universality: CC0 is a single mechanism that is both global and universal, covering all data and all countries.  It is also widely recognized.
  • simplicity: there is no need for humans to make, and respond to, individual data requests, and no need for click-through agreements.  This allows more scientists to spend their time doing science.

It is important to note that if you have data that, due to pre-existing agreements, cannot be released under the terms of CC0, please do not deposit that data to Dryad.  Journals that require data archiving in Dryad as a condition of publication can make exceptions for such special cases.

Footnote:  Interestingly, the repository had originally applied CC-BY to all its contents.  The very deliberate decision to use CC0 instead, made by Dryad’s Board in May of 2009, required us to obtain permission from all the early contributors to change the terms of reuse of their content.   And today, there are still a few items in Dryad under CC-BY for which permission was not granted.

Read Full Post »

Behind a scientific finding, in addition to unique data, there is often unique software. If Dryad archives data in part to allow others to validate the findings reported in the literature, then should we not also enable researchers to archive the software that was used to process, analyze and, in the case of simulations — create those data?

Some users have already deposited software source code alongside their data (e.g. doi:10.5061/dryad.8384, doi:10.5061/dryad.18) [1]. If users are willing and able to release their code under a CC-Zero waiver [2], then there is nothing stopping this practice. In fact, Creative Commons and the Free Software Foundation have recently stated that CC-Zero is appropriate for release of software to the public domain [3].

Yet, a number of journal partners and users have requested that Dryad provide more, or different, options for software, and that authors should not be required to waive legal rights with CC-Zero. Since software is clearly a creative work, source code unambiguously carries copyrightable intellectual property. Enabling a greater range of licensing options could open the door to more authors archiving software that is integral to their paper, and this would further Dryad’s mission of enabling scientists to validate and build upon previously work. So, how should we do that?

One important consideration is that we aim to make the submission process as easy as possible for users. This would be compromised by presenting a confusing array of licensing options, and having those differ between types of files.

The principle desiderata of a license for deposited software are more or less the same as for data: freedom to reuse, modify (analogous to the “recombine” for data), and redistribute (in original or modified form), with no more than attribution expected or required. It turns out that these are also the principles common to all licenses approved by the Open Source Initiative, or OSI [4].

So, could we just pick one of the minimally restrictive OSI-approved licenses (since we want to facilitate reuse rather than hamper it), and require release of software under those terms? We are currently of the opinion that the answer is “no”, for a couple of reasons:

(1) Some, though not all, software will already be licensed. Asking a user to choose a different one would clearly be a burden, since changing a license requires express consent from all copyright holders, including possibly the employer or funder.

(2) If the software includes third-party code to which a ‘share-alike’ license has been assigned (e.g. the GNU Public License, or GPL [5]) , then the user is required to release the code under equivalent licensing terms. Unlike for data, it would be highly unusual to combine software source code from many different sources, and so this does not pose an insurmountable barrier to archiving and reuse for scientific purposes.

Given the above, our current thinking is that Dryad should enable users to select any OSI-approved license they deem appropriate. However, we also wish to strongly guide users, when there is no prior license assigned to any part of their software, to choose either a non-share alike OSI license or a CC-Zero waiver. It is currently unclear whether dedicating software to the public domain with CC-Zero would be of as much value as it is for data [6]. We’d welcome your thoughts on that.

There are some other considerations on our plate, as well:

  • We want to be careful to avoid steering users away from using a public source code repository when that is more appropriate [7]. Is it better for Dryad to host code snapshots, or to direct users to specific versions of software in a public code repository?
  • Some users bundle software and data together in tarballs or zip archives. Since we cannot easily assign different terms to the data and software within such a combined file, it could increase the burden on users to separate these components out.
  • In addition to software, there is other content that publishers host in Supplemental Materials that some of our partner journals would like Dryad to host, instead. To the extent that some of this content is neither data nor software, should we be recognizing a third category of intellectual property, to which a license such as CC-BY [8] would be assigned?

If you have opinions or ideas, we would like to encourage you to share them with us as public comments on this blog. What’s the best way to accommodate software (and other non-data material) within Dryad?

Notes

[1] Some software source code in Dryad is already available under grandfathered license terms, such as in doi:10.5061/dryad.18.

[2] Dryad currently requires users to assign CC-Zero to all archived files. This waives all copyright and related rights in the data (to the extent legally possible in an author’s jurisdiction), effectively dedicating the data to the public domain. The use of CC-Zero is predicated on most data being “facts”, and facts in most jurisdictions cannot be copyrighted, although this not universally true (e.g. photographs). Note that Dryad has a policy that the original article and the data package are to be cited when the data are reused, but we feel that this is most appropriately enforced through scholarly practice, not through a license.

[3] According to Creative Common’s FAQ, CC-Zero “is suitable for dedicating your copyright and related rights in computer software to the public domain, to the fullest extent possible under law. Unlike CC licenses, which should not be used for software, CC0 is compatible with many software licenses, including the GPL“.

[4] http://www.opensource.org/

[5] http://www.gnu.org/licenses/gpl.html

[6] For the motivation behind the recommended use of CC-Zero for data, see the Science Commons Protocol for Implementing Open Access Data

[7] Public open source code repositories include generic ones, such as Sourceforge, as well as those specific to particular types of code, such as R-forge for R, and CPAN for Perl. For more about best practices in scientific software development, see Baxter SM, Day SW, Fetrow JS, Reisinger SJ (2006) Scientific Software Development Is Not an Oxymoron. PLoS Comput Biol 2(9): e87. doi:10.1371/journal.pcbi.0020087

[8] http://creativecommons.org/licenses/by/3.0

[9] Many thanks to H. Lapp for starting this post. I (T. Vision) take responsibility for the opinions expressed here, as well as any sins of omission or commission.

Read Full Post »

Credit: adamthelibrarian, from Flickr

This is an important month, because a host of our partner journals are implementing new policies on data archiving, and, in the U.S., the National Science Foundation is asking its new grantees to have explicit data management plans.  There are over 1000 data files from over 50 journals now in Dryad, and much of this content has been submitted only within the past year. Clearly, Dryad’s role in supporting the growing data archiving mandates from journals and funders continues to expand.

New Features
In the past few months, several new features have been added to Dryad.  Users can now save an incomplete submission and come back later to complete it.  They can see a listing of their completed and in progress submissions.  Users can download data citations to their favorite bibliography management programs and upload them to their favorite social bookmarking tools.  A new “faceted search” interface allows users to find data more easily, and also displays related content in other repositories, including ecological and environmental science data (from the Knowledge Network for Biocomplexity) and phylogenetic data (from TreeBASE). To provide an early indication of scientific impact, users can see how often data have been viewed and downloaded.

An important new feature is “handshaking”, which is what we call the process whereby authors upload some of their data to Dryad, and the information is conveyed behind-the-scenes to a specialized repository. The aim of handshaking is to reduce the time and effort need to deposit data when there are different repositories managing different aspects of the data.  Handshaking also enables persistent linkages among data in the different repositories. As a first foray into handshaking, we now offer users the option of initiating a deposit in TreeBASE, the primary repository for published phylogenetic data, whenever a NEXUS file is uploaded to Dryad.  Alternatively, the option is available to deposit in another repository first, and report the identifiers to Dryad to ensure that users can find all the data relevant to a given article.  We will be working in the months ahead to handshake with other specialized repositories required by our partner journals.

See our recent blog post about these features for more details.

Data Deposit in Three Easy Steps: The Movie
Are you looking for a way to show a colleague how straightforward data archiving can be?  We’ve added a short (2-minute) video to the site that walks users through the deposit process in three easy steps.  The video also available at SciVee.

Journals Implement Joint Data Archiving Policy
Starting this month, a number of Dryad partner journals have implemented a Joint Data Archiving Policy that requires, as a condition of publication, that authors deposit the data underlying their article in a public repository.  Some of the journals implementing this policy include: The American Naturalist, Evolution, Evolutionary Applications, Heredity, Journal of Evolutionary Biology, and Molecular Ecology. A recent TREE article by Michael Whitlock suggests how “data generators, data re-users, and journals can maximize the fairness and scientific value of data archiving.”

A growing number of journals now integrate their submission process with Dryad, meaning that the repository and journal exchange information to facilitate the author’s data deposition process and to ensure persistent linkage between articles and data. The current list includes The American Naturalist, The Biological Journal of the Linnean Society, Evolution, Journal of Evolutionary Biology, Journal of Heredity, Molecular Ecology, and Molecular Ecology Resources. And more are on the way (stay tuned).

NSF Data Management Plan Mandate
Starting this month, the U.S. National Science Foundation is requiring grant applicants to provide a data management plan describing how data will be collected, preserved and made available, and these plans will be subject to peer review.  We encourage applicants to leverage Dryad in their data management plans as a solution for the long-term preservation and dissemination of the data associated with their publications.  There are some pointers to resources for data management planning on the Dryad website.

Dryad UK Project
The Joint Information Science Committee (JISC) in the UK has made an award to Dryad and through Oxford University and the British Library to expand the scope of the journals involved, including into the areas of infectious disease and epidemiology, and to create a UK mirror of Dryad.  More information is here and at the Dryad UK site.

New Twitter Feed for Data Deposits
Interested in keeping up with new data available in Dryad?  Follow our Twitter feed (@datadryadnew) or subscribe to our RSS feed. We also Tweet general news about the repository and the world of data science as @datadryad.

Browse and search the repository at http://datadryad.org/
Follow Dryad on Twitter http://twitter.com/datadryad

This blog post is the first issue of the Dryad newsletter, summarizing recent achievements and milestones of the data repository.  If you’d like to receive future newsletters by email, please sign up for the Dryad Users mailing list.

Read Full Post »

Are you curious about what’s involved in depositing data in Dryad? looking for a quick way to show colleagues how straightforward data archiving can be?  Dryad’s new 2-minute video demonstrates the data deposit process from start to finish.

How to deposit data in Dryad

The video is embedded on the Dryad website, and also available on SciVee. Feel free to link to it and share it with colleagues.

Read Full Post »

Ever wonder what happens to your Dryad data behind the scenes? Here’s a quick overview.

Once a depositor has uploaded their data files and finalized their submission, the Dryad curator is notified of the new content. The curator looks at the uploaded files to make sure they really do contain data (and not, say, the article manuscript or pictures of kittens). The curator then exerts some quality control on the metadata, the description of the article and data files. She corrects errors, such as typos or formatting tags that are displaying incorrectly, and may enrich the metadata, by adding taxon name keywords, for example. Advanced metadata enrichment issues include the tricky realm of name authority control, which ensures that all works by a given author are gathered together despite the varying forms of their name.

Once the curator approves the submission, the metadata description of the data goes live in the repository. The status of the data files themselves depends upon the embargo options selected by the depositor. Dryad DOIs (Digital Object Identifiers) are sent to the depositor and, in the case of our integrated partner journals, to the journal editors, so that they can be included in all forms of the final published article, and allow readers of the article to find the supporting data.

After the article is published, the curator adds complete article citation information, including a hyperlinked article DOI, to the Dryad record, and updates any data file embargoes, if needed.

The outcome is data files, which

  • are securely deposited in the repository, and linked to the journal article,
  • have a unique, permanent identifier that can be cited, and
  • can be discovered independently of the article, as well as through the article.

Additionally, authors can now track the views and downloads of their data files.   Dryad displays the number of times the data package has been viewed, and the number of times each component data file has been both viewed and downloaded.

Read Full Post »

We’ve created a new Twitter feed for announcing all new data packages added to Dryad.  It’s @datadryadnew — follow it if you want to keep an eye on what is going in to the repository.

Our @datadryad feed is also available, for updates on the Dryad repository and data sharing in general.

Read Full Post »

« Newer Posts - Older Posts »

Follow

Get every new post delivered to your Inbox.

Join 6,789 other followers