Archive for July, 2016


The question of who should pay for the preservation and stewardship of open research data remains unresolved, at a time when journals and funders alike are adopting strong open data policies. As a non-profit repository that relies on financial support from members and users, we at Dryad deal with this question daily, and are eager to help find new and sustainable solutions.

Along these lines, if you submit your data to Dryad, you will soon notice that we will ask for information about your grant support. That’s because we’re running a pilot project with the US National Science Foundation (NSF) to test the feasibility of having a funding organization directly sponsor Data Publication Charges (DPCs).

During this pilot implementation, if your research was supported by a grant from the US NSF, and your DPC would not otherwise be waived or sponsored by another organization, this grant information can be used to charge the DPC directly to a fund set aside as part of this project.


Entering grant information at data submission is optional. Nonetheless, we encourage researchers to fill out the funding information in order to benefit from NSF funds, enable awardees to receive credit from their institutions and funders for the open availability and reuse of the data, and to promote its discoverability.

Direct funder sponsorship of data archiving has some significant features:

Researchers also stand to benefit — they have an interest in seeing their data responsibly curated and preserved, even if they publish and archive data after their grant funds have expired.  And we are excited by the prospect of increasing the proportion of data packages for which the DPC is sponsored or waived (which is currently just over 2/3).

We aim to work out the details of achieving the goals above, and to evaluate any downsides, as part of the pilot. We will also be surveying researchers to better understand what happens when data is not sponsored by a payment plan. From that, we will be able to develop recommendations for what Dryad, funding organizations, and institutions can do to facilitate the DPC payment process for researchers.

We are grateful to the NSF Advances in Bioinformatics program for the supplemental funding behind this project, and we hope that many researchers will take advantage of the opportunity to have their DPC covered by the NSF funds, which will be available at least through February 2017.  Please let me know (at director@datadryad.org) if you have any questions or feedback!

Read Full Post »

We’re pleased to present a guest post from data scientist Juan M. Banda, the lead author of an important, newly-available resource for drug safety research. Here, Juan shares some of the context behind the data descriptor in Scientific Data and associated data package in Dryad. – EH


As I sit in a room full of over one hundred bio-hackers at the 2016 Biohackathon in Tsuruoka, Yamagata, Japan, the need to have publicly available and accessible data for research use is acutely evident. Organized by Japan’s National Biosciences Database Center (NBDC) and Databases Center for Life Science (DBLS), this yearly hackathon gathers people from organizations and universities all over the world, including the National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI), with the purpose of extending and interlinking resources like PubChem, PhenomeCentral, Bio2RDF, and PubAnnotation.

The end goal: finding better ways to access data that will allow researchers to focus on analysis of the data rather than preparation.

In the same spirit, our publication “A curated and standardized adverse drug event resource to accelerate drug safety research” (doi:10.1038/sdata.2016.26; data in Dryad at http://doi.org/10.5061/dryad.8q0s4) helps researchers in the drug safety domain with the standardization and curation of the freely available data from the Federal Food and Drug Administration (FDA) adverse events reporting system (FAERS).

FAERS collects information on adverse events and medication errors reported to the FDA, and is comprised of over 10 million records collected between 1969 to the present. As one of the most important resources for drug safety efforts, the FAERS database has been used in at least 750 publications as reported by PubMed and was probably manipulated, mapped and cleaned independently by the vast majority of the authors of said publications. This cleaning and mapping process takes a considerable amount of time — hours that could have been spent analyzing the data further.

Our publication hopes to eliminate this needless work and allow researchers to focus their efforts in developing methods to analyze this information.

OHDSIAs part of the Observational Health Sciences Initiative (OHDSI), whose mission is to “Improve health, by empowering a community to collaboratively generate the evidence that promotes better health decisions and better care,” we decided to tackle the task of cleaning and curating the FAERS database for our community, and the wider drug safety community. By providing a general common data model (CDM) and a general vocabulary to standardize how electronic patient data is stored, OHDSI allows its participants to join a research network with over 655 million patients.

With a significant fraction of the community’s research being focused on drug safety, it was a natural decision to standardize the FAERS database with the OMOP vocabulary, to allow all researchers on our network access to FAERS. Since the OMOP vocabulary incorporates general vocabularies such as SNOMED, MeSH, and RxNORM, among others, the usability of this resource is not limited to participants of this community.

In order to curate this dataset, we took the source FAERS data in CSV format and de-duplicated case reports. We then performed value imputation for certain fields that were missing. Drug names were standardized to RxNorm ingredients and standard clinical names (for multi-ingredient drugs). This mapping is tricky because some drug names have spelling errors, and some are non-prescription drugs, or international brand names. We achieved coverage of 93% of the drug names, which in turn cover 95% of the case reports in FARES.

For the first time, the indication and reactions have been mapped to SNOMED-CT from their original MedRA format. Coverage for indications and reactions is around 64% and 80%, respectively. The OMOP vocabulary allows RxNorm drug codes as well as SNOMED-CT codes to reside in the same unified vocabulary space, simplifying use of this resource. We also provide the complete source code we developed in order to allow researchers to refresh the dataset with the new quarterly FAERS data releases and improve the mappings if needed. We encourage users to contribute the results of their efforts back to the OHDSI community.

With a firm commitment to making open data easier to use, this resource allows researchers to utilize a professionally curated (and refreshable) version of the FAERS data, enabling them to focus on improving drug safety analyses and finding more potentially harmful drugs, as a part of OHDSI’s core mission.


Still from OHMSDI video

The data:


A full description of the dataset in Scientific Data:



— Juan M. Banda

Read Full Post »