What matters to you when looking for research data in a repository? UK based Digital Curation Centre is looking for Dryad users to complete a 10 minute questionnaire on this. Results will contribute to an assessment framework for Dryad, and the questionnaire includes entry to a competition for $80/ £50 Amazon tokens. DCC are carrying this out as part of the Dryad UK project, which also involves the British Library and Oxford University’s Image Bioinformatics Lab.
Archive for the ‘Uncategorized’ Category
A new study in PLoS ONE by Heather Piwowar, a postdoctoral associate affiliated with DataONE, Dryad, and NESCent, reveals interesting trends in the archiving of data underlying published microarray results. From the press release:
By querying the full text of the scientific literature through websites like Google Scholar and PubMed Central, Piwowar identified eleven thousand studies that collected a particular type of data about cellular activity, called gene expression microarray data. Only 45% of recent gene expression studies were found to have deposited their data in the public databases developed for this purpose. The rate of data publication has increased only slightly from 2007 to 2009. Data is shared least often from studies on cancer and human subjects: cancer studies make their data available for wide reuse half as often as similar studies outside cancer.
“It was disheartening to discover that studies on cancer and human subjects were least likely to make their data available. These data are surely some of the most valuable for reuse, to confirm, refute, inform and advance bench-to-bedside translational research,” Piwowar said.
“We want as much scientific progress as we can get from our tax and charity dollars. This requires increased access to data resources. Data can be shared while maintaining patient privacy,” Piwowar added, noting that patient re-identification is rarely an issue for gene expression microarray studies.
Reference: Piwowar, H. (2011). “Who shares? Who doesn’t? Factors associated with openly archiving raw research data.” PLoS ONE 6(7): e18657. doi:18610.11371/journal.pone.0018657
“In the spirit of the topic”, the data behind the study are publicly available in Dryad at doi:10.5061/dryad.mf1sd
Behind a scientific finding, in addition to unique data, there is often unique software. If Dryad archives data in part to allow others to validate the findings reported in the literature, then should we not also enable researchers to archive the software that was used to process, analyze and, in the case of simulations — create those data?
Some users have already deposited software source code alongside their data (e.g. doi:10.5061/dryad.8384, doi:10.5061/dryad.18) . If users are willing and able to release their code under a CC-Zero waiver , then there is nothing stopping this practice. In fact, Creative Commons and the Free Software Foundation have recently stated that CC-Zero is appropriate for release of software to the public domain .
Yet, a number of journal partners and users have requested that Dryad provide more, or different, options for software, and that authors should not be required to waive legal rights with CC-Zero. Since software is clearly a creative work, source code unambiguously carries copyrightable intellectual property. Enabling a greater range of licensing options could open the door to more authors archiving software that is integral to their paper, and this would further Dryad’s mission of enabling scientists to validate and build upon previously work. So, how should we do that?
One important consideration is that we aim to make the submission process as easy as possible for users. This would be compromised by presenting a confusing array of licensing options, and having those differ between types of files.
The principle desiderata of a license for deposited software are more or less the same as for data: freedom to reuse, modify (analogous to the “recombine” for data), and redistribute (in original or modified form), with no more than attribution expected or required. It turns out that these are also the principles common to all licenses approved by the Open Source Initiative, or OSI .
So, could we just pick one of the minimally restrictive OSI-approved licenses (since we want to facilitate reuse rather than hamper it), and require release of software under those terms? We are currently of the opinion that the answer is “no”, for a couple of reasons:
(1) Some, though not all, software will already be licensed. Asking a user to choose a different one would clearly be a burden, since changing a license requires express consent from all copyright holders, including possibly the employer or funder.
(2) If the software includes third-party code to which a ‘share-alike’ license has been assigned (e.g. the GNU Public License, or GPL ) , then the user is required to release the code under equivalent licensing terms. Unlike for data, it would be highly unusual to combine software source code from many different sources, and so this does not pose an insurmountable barrier to archiving and reuse for scientific purposes.
Given the above, our current thinking is that Dryad should enable users to select any OSI-approved license they deem appropriate. However, we also wish to strongly guide users, when there is no prior license assigned to any part of their software, to choose either a non-share alike OSI license or a CC-Zero waiver. It is currently unclear whether dedicating software to the public domain with CC-Zero would be of as much value as it is for data . We’d welcome your thoughts on that.
There are some other considerations on our plate, as well:
- We want to be careful to avoid steering users away from using a public source code repository when that is more appropriate . Is it better for Dryad to host code snapshots, or to direct users to specific versions of software in a public code repository?
- Some users bundle software and data together in tarballs or zip archives. Since we cannot easily assign different terms to the data and software within such a combined file, it could increase the burden on users to separate these components out.
- In addition to software, there is other content that publishers host in Supplemental Materials that some of our partner journals would like Dryad to host, instead. To the extent that some of this content is neither data nor software, should we be recognizing a third category of intellectual property, to which a license such as CC-BY  would be assigned?
If you have opinions or ideas, we would like to encourage you to share them with us as public comments on this blog. What’s the best way to accommodate software (and other non-data material) within Dryad?
 Some software source code in Dryad is already available under grandfathered license terms, such as in doi:10.5061/dryad.18.
 Dryad currently requires users to assign CC-Zero to all archived files. This waives all copyright and related rights in the data (to the extent legally possible in an author’s jurisdiction), effectively dedicating the data to the public domain. The use of CC-Zero is predicated on most data being “facts”, and facts in most jurisdictions cannot be copyrighted, although this not universally true (e.g. photographs). Note that Dryad has a policy that the original article and the data package are to be cited when the data are reused, but we feel that this is most appropriately enforced through scholarly practice, not through a license.
 According to Creative Common’s FAQ, CC-Zero “is suitable for dedicating your copyright and related rights in computer software to the public domain, to the fullest extent possible under law. Unlike CC licenses, which should not be used for software, CC0 is compatible with many software licenses, including the GPL“.
 For the motivation behind the recommended use of CC-Zero for data, see the Science Commons Protocol for Implementing Open Access Data
 Public open source code repositories include generic ones, such as Sourceforge, as well as those specific to particular types of code, such as R-forge for R, and CPAN for Perl. For more about best practices in scientific software development, see Baxter SM, Day SW, Fetrow JS, Reisinger SJ (2006) Scientific Software Development Is Not an Oxymoron. PLoS Comput Biol 2(9): e87. doi:10.1371/journal.pcbi.0020087
 Many thanks to H. Lapp for starting this post. I (T. Vision) take responsibility for the opinions expressed here, as well as any sins of omission or commission.
It would be a good idea to know and be ready to deposit your files in a data repository, because this month marks the implementation of the Joint Data Archiving Policy. The policy, endorsed by a consortium of prominent journals and societies, states that journals will require
as a condition for publication, that data supporting the results in the paper should be archived in an appropriate public archive.
The policy can be customized by each journal, and enables both embargoes and editorial discretion to make special exceptions. Blanket exemptions apply to sensitive data such as identifiable human records and endangered species localities.
The journals (and corresponding societies) implementing the policy this month are:
- The American Naturalist (American Society of Naturalists)
- Evolution (Society for the Study of Evolution)
- Evolutionary Applications
- Heredity (The Genetics Society)
- Journal of Evolutionary Biology (European Society for Evolutionary Biology)
- Molecular Biology and Evolution (Society for Molecular Biology and Evolution)
- Molecular Ecology
- Systematic Biology (Society for Systematic Biology)
A sampling of the revised Instructions to Authors includes:
- The American Naturalist: “The American Naturalist requires authors to deposit the data associated with accepted papers in a public archive. For gene sequence data and phylogenetic trees, deposition in GenBank or TreeBASE, respectively, is required. There are many possible archives that may suit a particular data set, including the Dryad repository for ecological and evolutionary biology data (http://datadryad.org). All accession numbers for GenBank, TreeBASE, and Dryad must be included in accepted manuscripts before they go to Production. Any impediments to data sharing should be brought to the attention of the editors at the time of submission.”
- Journal of Evolutionary Biology “The editors and publisher of this journal expect authors to make the data underlying published articles available. An investigator who feels that reasonable requests have not been met by the authors should correspond with the Editor-in-Chief. Authors must use the appropriate database to deposit detailed information supplementing submitted papers, and quote the accession number in their manuscripts.”
- Molecular Ecology: “Data Accessibility: To enable readers to locate archived data from Molecular Ecology papers, as of January 2011 we will require that authors include a ‘Data Accessibility’ section after their references. This should list the data base and respective accession numbers for all data from the manuscript that has been made publicly available…. Please note that this section must be complete prior to the submission of the final version of your manuscript. Papers lacking this section will not be sent to Production.”
At Dryad, we have been working for some time now with editors and publishers at these and other partner journals to support the implementation of this policy. If you submit an article to a “JDAP journal,” you will be invited to simultaneously submit your data to Dryad. This may occur either prior to review or, depending on the journal, at the time your article is accepted. Dryad and the journal communicate behind the scenes to make it as easy as possible for you to deposit your data, and also ensure that a permanent, resolvable, and citable data identifier is published in the final article. That way, in the future, no one need be frightened by the question “do you know where your data are?”
Ever wonder what happens to your Dryad data behind the scenes? Here’s a quick overview.
Once a depositor has uploaded their data files and finalized their submission, the Dryad curator is notified of the new content. The curator looks at the uploaded files to make sure they really do contain data (and not, say, the article manuscript or pictures of kittens). The curator then exerts some quality control on the metadata, the description of the article and data files. She corrects errors, such as typos or formatting tags that are displaying incorrectly, and may enrich the metadata, by adding taxon name keywords, for example. Advanced metadata enrichment issues include the tricky realm of name authority control, which ensures that all works by a given author are gathered together despite the varying forms of their name.
Once the curator approves the submission, the metadata description of the data goes live in the repository. The status of the data files themselves depends upon the embargo options selected by the depositor. Dryad DOIs (Digital Object Identifiers) are sent to the depositor and, in the case of our integrated partner journals, to the journal editors, so that they can be included in all forms of the final published article, and allow readers of the article to find the supporting data.
After the article is published, the curator adds complete article citation information, including a hyperlinked article DOI, to the Dryad record, and updates any data file embargoes, if needed.
The outcome is data files, which
- are securely deposited in the repository, and linked to the journal article,
- have a unique, permanent identifier that can be cited, and
- can be discovered independently of the article, as well as through the article.
Additionally, authors can now track the views and downloads of their data files. Dryad displays the number of times the data package has been viewed, and the number of times each component data file has been both viewed and downloaded.
Data files in Dryad don’t just get dumped in there. Someone is there to look after the accuracy and completeness of the metadata, to migrate data files into new formats when necessary, to help users with new submissions, and generally mind the details so that others can find and reuse the data files down the road. This activity is called curation, and it is a critical behind-the-scenes function of a digital repository . Here, we’d like to take this opportunity to introduce Dryad’s lead curator, Elena Feinstein.
Elena, who hails from Atlanta, has degrees in biology from NYU, education from Emory, and library & information science from the University of North Carolina (UNC) at Chapel Hill. Before coming to Dryad, she taught high school and was a science librarian at UNC. Now, Elena works with the UNC Metadata Research Center curating Dryad’s content and continually improving all aspects of the way the repository manages its metadata.
When she’s not working on Dryad, Elena volunteers with the Durham Central Market co-op grocery store, and cooks and bakes until the wee hours.
Next time you submit data to Dryad, rest assured it will receive some quality attention from Elena.
If you’re in London this week, don’t miss Science Online London on Friday and Saturday, Sept. 3-4. Hosted by the British Library, Mendeley, and Nature, this meeting is an opportunity not just to listen but to connect, engage, and interact.
Stop by the British Library booth to find out more about Dryad’s expansion under the new JISC grant involving Oxford University and BL.
Meeting topics include:
- How is the web changing the way we conduct, communicate, share, and evaluate research? How can we employ these trends for the greater good?
- How is the internet changing the way we work with data?
- How are blogs and social networking facilitating scientific discussion? What challenges do we face?
- What challenges and opportunities are there when engaging with the public?
In particular, these sessions on Friday may be of interest to those involved in data sharing:
- Breakout 1: Publishing primary research data
- Breakout 8: Connecting scientific resources
Follow the conference on Twitter @soloconf (comment with hashtag #solo10).