Archive for the ‘Data reuse’ Category

We present a guest post from researcher Falk Lüsebrink highlighting the benefits of data sharing. Falk is currently working on his PhD in the Department of Biomedical Magnetic Resonance at the Otto-von-Guericke University in Magdeburg, Germany. Here, he talks about his experience of sharing early MRI data and the unexpected impact that it is having on the research community.

Early release of data

The first time I faced a decision about publishing my own data was while writing a grant proposal. One of our proposed objectives was to acquire ultrahigh resolution brain images in vivo, making use of an innovative development: a combination of an MR scanner with ultrahigh field strength and a motion correction setup to remediate subject motion during data acquisition. While waiting for the funding decision, I simply could not resist acquiring a first dataset. We scanned a highly experienced subject for several hours, allowing us to acquire in vivo images of the brain with a resolution far beyond anything achieved thus far.

 MRI data showing the cerebellum in vivo

MRI data showing the cerebellum in vivo at (a) neuroscientific standard resolution of 1 mm, (b) our highest achieved resolution of 250 µm, and (c) state-of-the-art 500 µm resolution.

When our colleagues saw the initial results, they encouraged us to share the data as soon as possible. Through Scientific Data and Dryad, we were able to do just that. The combination of a peer-reviewed open access journal and an open access digital repository for the data was perfect for presenting our initial results.

17,000 downloads and more

‘Sharing the wealth’ seems to have been the right decision; in the three months since we published our data, there has been an enormous amount of activity:

A distinct need for data re-use

MRI studies are highly interdisciplinary, opening up numerous opportunities for sharing and re-using data. For example, our data might be used to build MR brain atlases and illustrate brain structures in much greater detail, or even for the first time. This could advance our understanding of brain functions. Algorithms used to quantify brain structures needed in the research of neurodegenerative disorders could be enhanced, increasing accuracy and reproducibility. Furthermore, by making available raw signals measured by the MR scanner, image reconstruction methods could be used to refine image quality or reduce the time it takes to collect the data.

There are also opportunities beyond those that our particular dataset offers. A recent emerging trend in MRI comes from the field of machine learning. Neuronal networks are being built to perform and potentially improve all kinds of tasks, from image reconstruction, to image processing, and even diagnostics. To train such networks, huge amounts of data are necessary; these data could come from repositories open to the public. Such re-use of MRI data by researchers in other disciplines is having a strong impact on the advancement of science. By publicly sharing our data, we are allowing others to pursue new and exciting directions.

Download the data for yourself and see what you can do with it. In the meantime, I am still eagerly awaiting the acceptance of the grant application . . . but that’s a different story.

The data: http://dx.doi.org/10.5061/dryad.38s74

The article: http://dx.doi.org/10.1038/sdata.2017.32

— Falk Lüsebrink

Read Full Post »

We’re pleased to present a guest post from data scientist Juan M. Banda, the lead author of an important, newly-available resource for drug safety research. Here, Juan shares some of the context behind the data descriptor in Scientific Data and associated data package in Dryad. – EH


As I sit in a room full of over one hundred bio-hackers at the 2016 Biohackathon in Tsuruoka, Yamagata, Japan, the need to have publicly available and accessible data for research use is acutely evident. Organized by Japan’s National Biosciences Database Center (NBDC) and Databases Center for Life Science (DBLS), this yearly hackathon gathers people from organizations and universities all over the world, including the National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI), with the purpose of extending and interlinking resources like PubChem, PhenomeCentral, Bio2RDF, and PubAnnotation.

The end goal: finding better ways to access data that will allow researchers to focus on analysis of the data rather than preparation.

In the same spirit, our publication “A curated and standardized adverse drug event resource to accelerate drug safety research” (doi:10.1038/sdata.2016.26; data in Dryad at http://doi.org/10.5061/dryad.8q0s4) helps researchers in the drug safety domain with the standardization and curation of the freely available data from the Federal Food and Drug Administration (FDA) adverse events reporting system (FAERS).

FAERS collects information on adverse events and medication errors reported to the FDA, and is comprised of over 10 million records collected between 1969 to the present. As one of the most important resources for drug safety efforts, the FAERS database has been used in at least 750 publications as reported by PubMed and was probably manipulated, mapped and cleaned independently by the vast majority of the authors of said publications. This cleaning and mapping process takes a considerable amount of time — hours that could have been spent analyzing the data further.

Our publication hopes to eliminate this needless work and allow researchers to focus their efforts in developing methods to analyze this information.

OHDSIAs part of the Observational Health Sciences Initiative (OHDSI), whose mission is to “Improve health, by empowering a community to collaboratively generate the evidence that promotes better health decisions and better care,” we decided to tackle the task of cleaning and curating the FAERS database for our community, and the wider drug safety community. By providing a general common data model (CDM) and a general vocabulary to standardize how electronic patient data is stored, OHDSI allows its participants to join a research network with over 655 million patients.

With a significant fraction of the community’s research being focused on drug safety, it was a natural decision to standardize the FAERS database with the OMOP vocabulary, to allow all researchers on our network access to FAERS. Since the OMOP vocabulary incorporates general vocabularies such as SNOMED, MeSH, and RxNORM, among others, the usability of this resource is not limited to participants of this community.

In order to curate this dataset, we took the source FAERS data in CSV format and de-duplicated case reports. We then performed value imputation for certain fields that were missing. Drug names were standardized to RxNorm ingredients and standard clinical names (for multi-ingredient drugs). This mapping is tricky because some drug names have spelling errors, and some are non-prescription drugs, or international brand names. We achieved coverage of 93% of the drug names, which in turn cover 95% of the case reports in FARES.

For the first time, the indication and reactions have been mapped to SNOMED-CT from their original MedRA format. Coverage for indications and reactions is around 64% and 80%, respectively. The OMOP vocabulary allows RxNorm drug codes as well as SNOMED-CT codes to reside in the same unified vocabulary space, simplifying use of this resource. We also provide the complete source code we developed in order to allow researchers to refresh the dataset with the new quarterly FAERS data releases and improve the mappings if needed. We encourage users to contribute the results of their efforts back to the OHDSI community.

With a firm commitment to making open data easier to use, this resource allows researchers to utilize a professionally curated (and refreshable) version of the FAERS data, enabling them to focus on improving drug safety analyses and finding more potentially harmful drugs, as a part of OHDSI’s core mission.


Still from OHMSDI video

The data:


A full description of the dataset in Scientific Data:



— Juan M. Banda

Read Full Post »

2015While gearing up for the Dryad member meeting (to be held virtually on 24 May – save the date!) and publication of our annual report, we’re taking a look at last year’s numbers.

2015 was a “big” year for Dryad in many respects. We added staff, and integrated several new journals and publishing partners. But perhaps most notably, the Dryad repository itself is growing very rapidly. We published 3,926 data packages this past year — a 44% increase over 2014 — and blew past the 10,000 mark for total data packages in the repository.

Data package size

Perhaps the “biggest” Dryad story from last year is the increase in the mean size of data packages published. In 2014, that figure was 212MB. In 2015, it more than doubled to 481MB, an increase of a whopping 127%.

This striking statistic is part of the reason we opted at the beginning of 2016 to double the maximum package size before overage fees kick in (to 20GB), and simplified and reduced our overage fees. We want researchers to continue to archive more (and larger) data files, and to do so sustainably. Meanwhile, we do continue to welcome many submissions on the smaller end of the scale.


Distribution of Dryad data package size by year. Boxplot shows median, 1st and 3rd quartiles, and 95% confidence interval of median. Note the log scale of the y-axis.

In 2015, the mean number of files in a data package was about 3.4, with 104 as the largest number of files in any data package. To see how times have changed, compare this to a post from 2011 (celebrating our 1,000th submission), where we noted:

Interestingly, most of the deposits are relatively small in size. Counting all files in a data package together, almost 80% of data packages are less than one megabyte. Furthermore, the majority of data packages contain only one data file and the mean is a little less than two and a half. As one might expect, many of the files are spreadsheets or in tabular text format. Thus, the files are rich in information but not so difficult to transfer or store.

We have yet to do a full analysis of file formats deposited in 2015, but we see among the largest files many images and videos, as would be expected, but also a notable increase in the diversity of DNA sequencing-related file formats.

So not only are there now more and bigger files in Dryad, there’s also greater complexity and variety. We think this shows that more people are learning about the benefits of archiving and reusing multiple file types, and that researchers (and publishers) are broadening their view of what qualifies as “data.”

Download counts

2015speciesSo who had the biggest download numbers in 2015? Interestingly, nearly all of last year’s most-downloaded data packages are from genetics/genomics. 3 of the top 5 are studies of specific wild populations and how they adapt to changing circumstances — Sailfin Mollies (fish), blue tits (birds), and bighorn sheep, specifically.

Another top package presents a model for dealing with an epidemic that had a deadly impact on humans in 2015. And rounding out the top 5 is an open source framework for reconstructing the relationships that unite all lineages — a “tree of life.”

In 5th place, with 367 downloads:

In 4th place, with 601 downloads:

In 3rd place, with 1,324 downloads:

In 2nd place, with 1,868 downloads:

And this year’s WINNER, with 2,678 downloads:

The above numbers are presented with the usual caveats about bots, which we aim to filter out, but cannot do with perfect accuracy. (Look for a blog post on this topic in the near future).

As always, we owe a huge debt to our submitters, partners, members and users for supporting Dryad and open data in 2015!

Read Full Post »

The reason why Dryad is in the business of archiving, preserving, and providing access to research data is so that it will be reused, whether for deeper reading of the publication, for post-publication review, for education, or for future research. While it’s not yet as easy as we would like to track data reuse, one metric that is straightforward to collect is the number of times a dataset has been downloaded, and this is one of two data reuse statistics reported by our friends at ImpactStory and Plum Analytics.

2014 with fireworks

The numbers are very encouraging. There are already over a quarter million downloads for the 8,897 data files released in 2014 (from 2,714 data packages). That’s over 28 downloads per data file. While there is always the caveat that some downloads may be due to activity from newly emerged bots that we have yet to recognize and filter out, we think it is safe to say that most of these downloads are from people.

To celebrate, we would like to pay special tribute to the top five data packages from 2014, as measured by the maximum number of downloads for any single file (since many data packages have more than one) at the time of writing. They cover a diversity of topics from livestock farming in the Paleolithic to phylogenetic relationships among insects. That said, we are struck by the impressively strong showing for plant science — 3 of the top 5 data packages.

In 5th place, with 453 downloads

In 4th place, with 581 downloads

In 3rd place, with 626 downloads

In 2nd place, with 4,672 downloads

And in 1st place, with a staggering 34,879 downloads

Remarkably, given the number of downloads, this last data package was only released in November.

We’d like to thank all of our users, whether you contribute data or reuse it (or both), for helping make science just a little more transparent, efficient, and robust this past year. And we are looking forward to finding out some more of what you did with all those downloads in 2015!





Read Full Post »