Dryad to join launch of the Data Curation Network

Alfred P. Sloan Foundation grant will fund implementation of shared staffing model across 7 academic libraries and Dryad

We’re thrilled to announce that Dryad will participate in a three-year, multi-institutional effort to launch the Data Curation Network. The implementation — led by the University of Minnesota Libraries and backed by a $526,438 grant from the Alfred P. Sloan Foundation — builds on previous work to better support researchers faced with a growing number of requirements to openly and ethically share their research data.

The result of many months of research and planning, the project brings together eight partners:

Currently, staff at each of these institutions provide their own data curation services. But because data curation requires a specialized skill set — spanning a wide variety of data types and discipline-specific data formats — institutions cannot reasonably expect to hire an expert in each area.

Curation workflow for the DCN

The intent of the Data Curation Network is to serve as a cross-institutional staffing model that seamlessly connects a network of expert data curators to local datasets and to supplement local curation expertise. The project aims to increase local capacity, strengthen cross-institutional collaboration, and ensure that researchers and institutions ethically and appropriately share data.

Lisa R. Johnston, Principal Investigator for the DCN and Director of the Data Repository for the University of Minnesota (DRUM), explains:

Functionally, the Data Curation Network will serve as the ‘human layer’ in a local data repository stack that provides expert services, incentives for collaboration, normalized curation practices, and professional development training for an emerging data curator community.

For our part, the Dryad curation team is excited to join a collegial network of professionals, to help develop shared procedures and understandings, and to learn from the partners’ experience and expertise (as they may learn from ours).

As an independent, non-profit repository, we are especially pleased to get to work more closely with the academic library community, and hope this project can provide a launchpad for future, international collaborations among organizations with similar missions but differing structures and funding models.

Watch this space for news as the project develops, and follow the DCN on Twitter: #DataCurationNetwork

Improvements in data-article linking

Chain link fence with highway in backgroundDryad is a curated, non-profit, general-purpose repository specifically for data underlying scientific and medical publications — mainly journal articles. As such, we place great importance on linking data packages to the articles with which they are associated, and we try our best to encourage authors and journals to link back to the Dryad data from the article, ideally in the form of a reference in the works cited section. (There’s still a long way to go in this latter effort; see this study from 2016 for evidence).

Submission integration provides closer coordination between Dryad and journals throughout the publishing workflow, and simplifies the data submission process for authors. We’ve already implemented this free service with 120 journals. If you’re interested in integrating your journal, please contact us.

We’re excited to share a few recent updates that are helping to make our data-article linkages more efficient, discoverable, and re-usable by other publishers/systems.

The Automated Publication Updater

One of the greatest housekeeping challenges for our curation team lies in finding out when the articles associated with Dryad data packages become available online. Once they do, we want to add the article citation and DOI link to our record as quickly as possible, and to release any data embargoes placed “until the article appears.” Historically, we’ve achieved this through a laborious patchwork of web searches, journal alert emails, and notifications from authors or editors themselves.

But over the past year or so, we’ve built and refined a webapp that we call the APU (or Automated Publication Updater). This super-handy tool essentially compares data packages in the Dryad workflow with publication metadata available at Crossref. When a good match is found, it automatically updates article-related fields in the Dryad record, and then sends our curation team an email alert so they they can validate the match and finalize the record. The webapp can be easily run by curators as often as needed (usually a few times a week).

While the APU doesn’t find everything, it has dramatically improved both efficiency with which we add article information and links to Dryad records — and our curators’ happiness levels. Big win. (If you’re interested in the technical details, you can find them on our wiki).

Scholix

Dryad is also pleased to be a contributor to Scholix, or Scholarly Link Exchange, an initiative of the Research Data Alliance (RDA) and the World Data System (WDS). Scholix is a high-level interoperability framework for exchanging information about the links between scholarly literature and data.

  • The problem: Many disconnected sources of scholarly output, with different practices including various persistent identifier (PID) systems, ways of referencing data, and timing of citing data.
  • The Scholix solutionA standard set of guidelines for exposing and consuming data-article links, using a system of hubs.

Here’s how it works:

  1. As a DataCite member repository, Dryad provides our data-publication links to DataCite, one of the Scholix Hubs. 
  2. Those links are made available via Scholix aggregators such as the DLI service
  3. Publishers can then query the DLI to find datasets related to their journal articles, and generate/display a link back to Dryad, driving web traffic to us, increasing data re-use, and facilitating research discovery.

Crossref publishers, DataCite repositories/data centers, and institutional repositories can all participate — information on how is available on the Scholix website.

Programmatic data access by ISSN

Did you know that content in Dryad is available via a variety of APIs (Application Program Interfaces)? Details are available at the “Data Access” page on our wiki.

The newest addition to this list is the ability to access Dryad data packages via journal ISSN. So, for example, if you wanted access to all Dryad content associated with the journal Evolution Letters, you would format your query as follows:

https://datadryad.org/api/v1/journals/2056-3744/packages

If you’re a human instead of a machine, you might prefer to visit our “journal page” for Evolution Letters:

https://datadryad.org/journal/2056-3744

————

Dryad is committed to values of openness, collaboration, standardization, seamless integration, reduction of duplication and effort, and increased visibility of research products (okay, data especially). The above examples are just some of the ways we’re working in this direction.

If you’re part of an organization who shares these values, please contact us to find out how you can be part of Dryad.

How do researchers pay for data publishing? Results of a recent submitter survey

As a non-profit repository dependent on support from members and users, Dryad is greatly concerned with the economics and sustainability of data services. Our business model is built around Data Publishing Charges (DPCs), designed to recover the basic costs of curating and preserving data. Dryad DPCs can be covered in 3 ways:

  1. The DPC is waived if the submitter is based in a country classified by the World Bank as a low-income or lower-middle-income economy.
  2. For many journals, the society or publisher will sponsor the DPC on behalf of their authors (to see whether this applies, look up your journal).
  3. In the absence of a waiver or a sponsor, the DPC is US$120, payable by the submitter.

Our long-term aim is to increase sponsorships and reduce the financial responsibility of individual researchers.

Last year, we launched a pilot study sponsored by the US National Science Foundation to test the feasibility of having a funding agency directly sponsor the DPC. We conducted a survey of Dryad submitters as part of the pilot, hoping to learn more about how researchers plan and pay for data archiving.

Initial survey results

We first want to say a hearty THANK YOU to our participants for giving us so much good information to work with! (10 participants were randomly selected to receive gift cards as a sign of our appreciation). Respondents were located around the world, with nearly all based at academic institutions.

Survey respondents' positions

A word about selection of survey participants. We know that approximately 1/3 of all Dryad data publications do not have a sponsor or waiver, meaning the researcher is responsible for covering the $120 charge. We wanted to learn more about payment methods and funding sources for these non-sponsored DPCs.

We specifically solicited researchers for our survey who had 1) submitted to Dryad in the previous year and 2) paid their Data Publishing Charge directly (via credit card or voucher code). The survey questions focused on a few topics:

  • Grant funding and Data Management Plans
  • Where the money for their Data Publishing Charges ultimately came from, and
  • Whether funding concerns affect their data archiving behavior.

A few highlights are presented below; we intend to dig deeper into the survey results (and other information gathered as part of the pilot study) and report on them publicly in the coming months.

Planning for data in grant proposals

Nearly 72% of respondents indicated that the research associated with their publication/data was supported by a grant. We wanted to know how (or whether) researchers planned ahead for archiving their data in their grant proposals, and the results were enlightening:

  • 43% did not include a Data Management Plan (DMP) as part of their proposal for funding.
  • Of those who did submit a DMP, only about 46% committed to archiving their data as part of that plan.
  • A whopping 96% said they did not specifically budget for data archiving in their proposal.
  • Only 41% were able to archive their data within the grant funding period, while 59% were unable to, or were unsure.

As these results indicate, data management/stewardship is still not a high priority at the grant proposal stage. Even when researchers plan for data deposition, they don’t consider the costs associated. And even if they do (hypothetically) have funding specifically for data, the timing may not allow them to use it before the grant expires.

These factors suggest that if funding agencies want to prioritize supporting data stewardship, they should make funds available for this purpose outside the traditional grant structure.

Show me the money

When submitters pay the Dryad Data Publishing Charge themselves, where does that money come from? Are submitters being reimbursed? If so, how/by whom?

Our results showed that, unfortunately, about a quarter of our participants paid their DPCs out-of-pocket and did not receive any reimbursement. Approximately the same number paid themselves but were reimbursed (by their institution, a grant, or some combination of these), and 37% of DPCs were paid directly by the institution (using an institutional credit card or voucher code).

How was the Dryad DPC paid?

 

Some respondents view self-funding of data publication as worthwhile:

My belief is that scientific data should be publicly available and I am willing to cover the costs myself if supervisors (grant holders) do not.

As long as the cost is reasonable, in the worse case scenario I pay from my pocket. Better the data are safe and easily accessible for years to come than stored in spurious formats and difficult-to-access servers.

But for many others, covering the payment can be a real pain point:

I paid the processing charge myself mainly because our University’s reimbursement process was so laborious, I felt it easier just to get it over and done with myself and absorb the relatively small cost personally.

I just have to beg and plead for funding support each time.

If I am publishing after the postdoc ends then I am no longer paid to work on the project. Since I have had four postdocs, each lasting less than two years, this has happened for all my publications.

Examples from the “other” payment category shown above illustrate the scrappiness of researchers in finding funding:

I paid this from flexible research funds that were recently awarded by my institution. Had that not occurred, I would have had to pay personally and not be reimbursed.

I used my RTF (research trust fund) since I didn’t have dedicated grant funding.

Scavenged money from other projects.

Key takeaways

Our preliminary results show that at a time of more and stronger open data policies, paying for data publication remains far from straightforward, with much of the burden passed along to individual researchers.

Concerns about funding for open data can have real impacts on research availability and publication choice. More than 15% of our participants indicated that they have collected data in the last few years that they have been unable to archive due to lack of funds. Meanwhile, over 40% say that when choosing which journal(s) to submit to, sponsorship of the Dryad DPC does, or at least may, influence their decision.

The good news it that during our 8-month pilot implementation period, the US National Science foundation sponsored nearly 200 Data Publishing Charges for which researchers would otherwise have been responsible.

We at Dryad are committed to finding and implementing solutions, and very much appreciate the feedback and support we receive from the research and publishing community. Stay tuned for more lessons learned.

Making open data useful: A drug safety case study

We’re pleased to present a guest post from data scientist Juan M. Banda, the lead author of an important, newly-available resource for drug safety research. Here, Juan shares some of the context behind the data descriptor in Scientific Data and associated data package in Dryad. – EH

_____

As I sit in a room full of over one hundred bio-hackers at the 2016 Biohackathon in Tsuruoka, Yamagata, Japan, the need to have publicly available and accessible data for research use is acutely evident. Organized by Japan’s National Biosciences Database Center (NBDC) and Databases Center for Life Science (DBLS), this yearly hackathon gathers people from organizations and universities all over the world, including the National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI), with the purpose of extending and interlinking resources like PubChem, PhenomeCentral, Bio2RDF, and PubAnnotation.

The end goal: finding better ways to access data that will allow researchers to focus on analysis of the data rather than preparation.

In the same spirit, our publication “A curated and standardized adverse drug event resource to accelerate drug safety research” (doi:10.1038/sdata.2016.26; data in Dryad at http://doi.org/10.5061/dryad.8q0s4) helps researchers in the drug safety domain with the standardization and curation of the freely available data from the Federal Food and Drug Administration (FDA) adverse events reporting system (FAERS).

FAERS collects information on adverse events and medication errors reported to the FDA, and is comprised of over 10 million records collected between 1969 to the present. As one of the most important resources for drug safety efforts, the FAERS database has been used in at least 750 publications as reported by PubMed and was probably manipulated, mapped and cleaned independently by the vast majority of the authors of said publications. This cleaning and mapping process takes a considerable amount of time — hours that could have been spent analyzing the data further.

Our publication hopes to eliminate this needless work and allow researchers to focus their efforts in developing methods to analyze this information.

OHDSIAs part of the Observational Health Sciences Initiative (OHDSI), whose mission is to “Improve health, by empowering a community to collaboratively generate the evidence that promotes better health decisions and better care,” we decided to tackle the task of cleaning and curating the FAERS database for our community, and the wider drug safety community. By providing a general common data model (CDM) and a general vocabulary to standardize how electronic patient data is stored, OHDSI allows its participants to join a research network with over 655 million patients.

With a significant fraction of the community’s research being focused on drug safety, it was a natural decision to standardize the FAERS database with the OMOP vocabulary, to allow all researchers on our network access to FAERS. Since the OMOP vocabulary incorporates general vocabularies such as SNOMED, MeSH, and RxNORM, among others, the usability of this resource is not limited to participants of this community.

In order to curate this dataset, we took the source FAERS data in CSV format and de-duplicated case reports. We then performed value imputation for certain fields that were missing. Drug names were standardized to RxNorm ingredients and standard clinical names (for multi-ingredient drugs). This mapping is tricky because some drug names have spelling errors, and some are non-prescription drugs, or international brand names. We achieved coverage of 93% of the drug names, which in turn cover 95% of the case reports in FARES.

For the first time, the indication and reactions have been mapped to SNOMED-CT from their original MedRA format. Coverage for indications and reactions is around 64% and 80%, respectively. The OMOP vocabulary allows RxNorm drug codes as well as SNOMED-CT codes to reside in the same unified vocabulary space, simplifying use of this resource. We also provide the complete source code we developed in order to allow researchers to refresh the dataset with the new quarterly FAERS data releases and improve the mappings if needed. We encourage users to contribute the results of their efforts back to the OHDSI community.

With a firm commitment to making open data easier to use, this resource allows researchers to utilize a professionally curated (and refreshable) version of the FAERS data, enabling them to focus on improving drug safety analyses and finding more potentially harmful drugs, as a part of OHDSI’s core mission.

OHDSI_still2

Still from OHMSDI video

The data:

http://doi.org/10.5061/dryad.8q0s4

A full description of the dataset in Scientific Data:

http://www.nature.com/articles/sdata201626

 

— Juan M. Banda

Sci-Hub stories: Digging into the downloads

The following is a guest post from science journalist John Bohannon. We asked him to give us some background on his recent dataset in Dryad and the analysis of that data in Science. What stories will you find in the data? – EH

_______

Scihub_raven

Sci-Hub is the world’s largest repository of pirated journal articles. We will probably look back and see it as inevitable. Soon after it became possible for people to share copyrighted music and movies on a massive scale, technologies like Napster and BitTorrent arrived to make the sharing as close to frictionless as possible. That hasn’t made the media industry collapse, as many people predicted, but it certainly brought transformation.

Unlike the media industry, journal publishers do not share their profits with the authors. So where will Sci-Hub push them? Will it be a platform like iTunes, with journals selling research papers for $0.99 each? Or will Sci-Hub finally propel the industry into the arms of the Open Access movement? Will nonprofit scientific societies and university publishers go extinct along the way, leaving just a few giant, for-profit corporations as the caretakers of scientific knowledge?

There are as many theories and predictions about the impact of Sci-Hub as there are commentators on the Internet. What is lacking is basic information about the site. Who is downloading all these Sci-Hub papers? Where in the world are they? What are they reading?

48 hours of Sci-Hub downloads. Each event is color-coded by the local time: orange for working hours (8am-6pm) and blue for the night owls working outside those hours.

Sometimes all you need to do is ask. So I reached out directly to Alexandra Elbakyan, who created Sci-Hub in 2011 as a 22 year-old neuroscience graduate student in Kazakhstan and has run it ever since. For someone denounced as a criminal by powerful corporations and scholarly societies, she was quite open and collaborative. I explained my goal: To let the world see how Sci-Hub is being used, mapping the global distribution of its users at the highest resolution possible while protecting their privacy. She agreed, not realizing how much data-wrangling it would ultimately take us.

Two months later, Science and Dryad are publicly releasing a data set of 28 million download request records from 1 September 2015 through 29 February 2016, timestamped down to the second. Each includes the DOI of the paper, allowing as rich a bibliographic exploration as you have CPU cycles to burn. The 3 million IP addresses have been converted into arbitrary codes. Elbakyan converted the IP addresses into geolocations using a database I purchased from the company Maxmind. She then clustered each geolocation to the coordinates of the nearest city using the Google Maps API. Sci-Hub users cluster to 24,000 unique locations.

The big take-home? Sci-Hub is everywhere. Most papers are being downloaded from the developing world: The top 3 countries are India, China, and Iran. But the rich industrialized countries use Sci-Hub, too. A quarter of the downloads came from OECD nations, and some of the most intense download hotspots correspond to the campuses of universities in the US and Europe, which supposedly have the most comprehensive journal access.

But these data have many more stories to tell. How do the reading habits of researchers differ by city? What are the hottest research topics in Indonesia, Italy, Brazil? Do the research topics shift when the Sci-Hub night owls take over? My analysis indicates a bimodal distribution over the course of the day, with most locations surging around lunchtime, and the rest peaking at 1am local time. The animated map above shows just 2 days of the data.

Something everyone would like to know: What proportion of downloaded articles are actually unavailable from nearby university libraries? Put another way: What is the size of the knowledge gap that Sci-Hub is bridging?

Download the data yourself and let the world know what you find.

The data:

http://dx.doi.org/10.5061/dryad.q447c

My analysis of the data in Science:

http://www.sciencemag.org/news/2016/04/whos-downloading-pirated-papers-everyone

 

 — John Bohannon

2015 stats roundup

2015While gearing up for the Dryad member meeting (to be held virtually on 24 May – save the date!) and publication of our annual report, we’re taking a look at last year’s numbers.

2015 was a “big” year for Dryad in many respects. We added staff, and integrated several new journals and publishing partners. But perhaps most notably, the Dryad repository itself is growing very rapidly. We published 3,926 data packages this past year — a 44% increase over 2014 — and blew past the 10,000 mark for total data packages in the repository.

Data package size

Perhaps the “biggest” Dryad story from last year is the increase in the mean size of data packages published. In 2014, that figure was 212MB. In 2015, it more than doubled to 481MB, an increase of a whopping 127%.

This striking statistic is part of the reason we opted at the beginning of 2016 to double the maximum package size before overage fees kick in (to 20GB), and simplified and reduced our overage fees. We want researchers to continue to archive more (and larger) data files, and to do so sustainably. Meanwhile, we do continue to welcome many submissions on the smaller end of the scale.

boxplot_logscale_labels

Distribution of Dryad data package size by year. Boxplot shows median, 1st and 3rd quartiles, and 95% confidence interval of median. Note the log scale of the y-axis.

In 2015, the mean number of files in a data package was about 3.4, with 104 as the largest number of files in any data package. To see how times have changed, compare this to a post from 2011 (celebrating our 1,000th submission), where we noted:

Interestingly, most of the deposits are relatively small in size. Counting all files in a data package together, almost 80% of data packages are less than one megabyte. Furthermore, the majority of data packages contain only one data file and the mean is a little less than two and a half. As one might expect, many of the files are spreadsheets or in tabular text format. Thus, the files are rich in information but not so difficult to transfer or store.

We have yet to do a full analysis of file formats deposited in 2015, but we see among the largest files many images and videos, as would be expected, but also a notable increase in the diversity of DNA sequencing-related file formats.

So not only are there now more and bigger files in Dryad, there’s also greater complexity and variety. We think this shows that more people are learning about the benefits of archiving and reusing multiple file types, and that researchers (and publishers) are broadening their view of what qualifies as “data.”

Download counts

2015speciesSo who had the biggest download numbers in 2015? Interestingly, nearly all of last year’s most-downloaded data packages are from genetics/genomics. 3 of the top 5 are studies of specific wild populations and how they adapt to changing circumstances — Sailfin Mollies (fish), blue tits (birds), and bighorn sheep, specifically.

Another top package presents a model for dealing with an epidemic that had a deadly impact on humans in 2015. And rounding out the top 5 is an open source framework for reconstructing the relationships that unite all lineages — a “tree of life.”

In 5th place, with 367 downloads:

In 4th place, with 601 downloads:

In 3rd place, with 1,324 downloads:

In 2nd place, with 1,868 downloads:

And this year’s WINNER, with 2,678 downloads:

The above numbers are presented with the usual caveats about bots, which we aim to filter out, but cannot do with perfect accuracy. (Look for a blog post on this topic in the near future).

As always, we owe a huge debt to our submitters, partners, members and users for supporting Dryad and open data in 2015!

New partnership with The Company of Biologists

We are delighted to announce the launch of a new partnership with The Company of Biologists to support their authors in making the data underlying their research available to the community.

COBNewLogo300dpiThe Company of Biologists is a not-for-profit publishing organization dedicated to supporting and inspiring the biological community. The Company publishes five specialist peer-reviewed journals:

The Company of Biologists offers further support to the biological community by facilitating scientific meetings, providing travel grants for researchers and supporting research societies.

Manuscript submission for all COB journals is now integrated with data submission to Dryad, meaning COB authors can conveniently submit their data packages and manuscripts at the same time. Dryad then makes the data securely available to journal reviewers, and releases them to the public if/when the paper is published.

We congratulate The Company of Biologists on taking this important step to help facilitate open data. To learn more about how your organization or journal can partner with Dryad, please contact us.