Most popular data from 2018

As we begin a new year and celebrate the major milestone of more than 25,000 data packages published, it’s a great time to highlight the value for re-use of the scholarly resources that are openly available and licensed in Dryad. 

So, which data packages published in 2018 have received the most downloads? Here are some at the top of the list.

Whale songs

Stafford et al (2018) Extreme diversity in the songs of Spitsbergen’s bowhead whales 

Here’s a lovely example of “data” that can have uses well beyond research. We’d love to know what people might be doing with these audio files. Meditating to them? Incorporating them into musical compositions?


All about the data

It’s perhaps not surprising that Dryad data packages associated with Scientific Data get a lot of downloads, as they are a journal specifically for “descriptions of scientifically valuable datasets, and research that advances the sharing and reuse of scientific data.” These three resources are proving especially popular:

  • Bennett et al (2018) GlobTherm, a global database on thermal tolerances for aquatic and terrestrial organisms
  • Faraut et al (2018) Dataset of human medial temporal lobe single neuron activity during declarative memory encoding and recognition 
  • Kummu et al (2018) Gridded global datasets for Gross Domestic Product and Human Development Index over 1990-2015 

screen shot 2019-01-24 at 2.13.10 pm

Avian functional traits

Storchová L, Hořák D (2018) Life-history characteristics of European birds

europeanrobinThis is an example of a dataset compiled specifically for re-use. According to the authors, “Recently, functional aspects of avian diversity have been used frequently in comparative analyses as well as in community ecology studies; thus, open access to complete datasets of traits will be valuable.” To make the data as useful as possible, they included a broad spectrum of traits and provided the file in an accessible format: ASCII text, tab delimited, not compressed. Given the large number of downloads, it has indeed proven valuable!

Improving clinical research transparency

Kilicoglu et al (2018) Automatic recognition of self-acknowledged limitations in clinical research literature

Here’s another dataset created for the purpose of improving research — in this case, reporting of limitations in clinical studies. The machine-learning techniques tested here can be incorporated into the workflows of other projects, to support efforts in increasing transparency.


Huge thanks are due to researchers who take the time and effort to publish their data, to the journals who support them in doing so (including those highlighted above), and to the Dryad member organizations who make it all possible. Here’s to the next 25,000, and the millions of downloads they will produce!


Dryad celebrates international data

There’s been important discussion lately about how to make research more inclusive, equitable, diverse, and global. See the recent 2018 International Open Access Week, and International Data Week, happening now in Gaborne, Botswana, with the theme “Digital Frontiers of Global Science.”

Dryad is among these organizations seeking to provide sustainable, open scholarly infrastructure that is accessible to all. As such, we use the CC0 license exclusively, and offer fee waivers for researchers based in countries classified by the World Bank as low-income or lower-middle-income economies. Our burgeoning partnership with California Digital Library promises to make data publishing even easier for all researchers.

In celebration of a global perspective, the Dryad curation team has selected a few data packages that highlight both a wide geographic range and a collaborative approach to research projects.

Penguin imaging and classification in Antarctica

Screen Shot 2018-11-06 at 5.03.21 PM

Data from: Time-lapse imagery and volunteer classifications from the Zooniverse Penguin Watch project / associated article in Scientific Data

Data from: A remote-controlled observatory for behavioural and ecological research: a case study on emperor penguins / associated article in Methods in Ecology and Evolution

Antarctica may be a fine spot for penguins, but the cold conditions make it an inhospitable location for human beings to spend long periods. It is especially challenging for scientists engaged in gathering data under the frigid conditions and for their equipment. Two recent Dryad data packages highlight how scientists have addressed this chilly challenge with the use of remote observation systems. One provides data from a remote‐controlled system designed for information gathering, and the other employs citizen science to process large numbers of time-lapse images gathered remotely from an automated system.

The images that comprise the data from the Zooniverse project Penguin Watch are much more than just cool photos of penguins. They are the result of automated time-lapse cameras used for reliably and consistently monitoring wild penguin populations. The data includes 73,802 photos captured by 15 different Penguin Watch cameras, and the authors expressed the hope that annotated time-lapse imagery can be used to train machine learning algorithms to extract data automatically and perhaps for computer vision development.

The video and images from Richter et al. were taken by a self-sufficient remote-controlled observatory designed to operate year-round in extreme cold-weather conditions. The observatory has been capturing high-resolution images of penguins, along with other data, since 2013 using “multiple overview cameras and a high-resolution steerable camera with a telephoto lens.” The resulting images and video provide information on the life cycle, demographics, and behavior of the animals. For example, the dataset shows how the movement of penguins as individuals and as a group might be associated with the speed and direction of the wind.

Both datasets show how remote observation systems can be used by human investigators in various locations to collect data on animal populations, even in areas of the world which provide challenges to scientists.

— Debra Fagan

Collaborating across disciplines in Indonesia


Data from: Competing for blood: the ecology of parasite resource competition in human malaria-helminth co-infections / associated article in Ecology Letters

An international team of researchers reveal new knowledge about “co-infections,” multiple infectious diseases that attack the immune system at once. Budischak et al. (2018) used principles of ecological theory to answer questions about helminth-malaria co-infection in human hosts. Rather than measuring prevalence of malaria after deworming, as previous studies had done with varied results, Budischak et al. measured the density of specific species within an individual over time.

The researchers hypothesized that competition for resources, in this case red blood cells, would have an affect on the density of those species within the host. Data and samples originally collected for a 2 year placebo-controlled deworming trial in Indonesia were analyzed, and they found that when bloodsucking helminth species were removed, the density of Plasmodium vivax, which rely specifically on young red blood cells, increased 2.75-fold. This increase is enough to adversely affect the health of an individual, and heighten the chances that mosquitoes will transmit the P. vivax from one individual to another.

The researchers suggest that where resources allow, health care providers should consider the specific species that are co-infecting an individual, and weigh the cost-benefits of deworming at that time. These findings lay the groundwork for novel treatments of malaria and worm infections.

— Erin Clary

Assessing the potential of environmental citizen science in East Africa

Screen Shot 2018-10-30 at 3.59.27 PM

Data from: Developing the global potential of citizen science: Assessing opportunities that benefit people, society and the environment in East Africa / associated article in the Journal of Applied Ecology

Citizen science projects often suffer from limited visibility in developing countries. Recognizing this difficulty, these authors undertook a collaborative process with experts to assess the potential for environmental citizen science in East Africa. The .csv file published in Dryad contains scores given by workshop participants in relation to various opportunities, benefits and barriers, which serve as the basis for principles that are applicable more widely.

Importantly, the project emphasizes the benefits of citizen science not just to the natural environment, but for creating a more informed and empowered populace.

Fighting lupus in Latin America

Screen Shot 2018-10-30 at 3.57.46 PM

Data from: First Latin American clinical practice guidelines for the treatment of systemic lupus erythematosusassociated article in Annals of the Rheumatic Diseases

Dryad recently published data underlying collaborative research by the Latin American Group for the Study of Lupus (GLADEL) and the Pan-American League of Associations of Rheumatology (PANLAR). Both groups consisted of experienced Latin American rheumatologists who gathered together in Panama City to discuss special problems faced by patients with systemic lupus erythematosus (SLE) in Latin America.

The group started the research process by putting together a list of questions addressing clinical issues most commonly seen in Latin American patients. The team used the Grading of Recommendations Assessment, Development and Evaluation (GRADE) system to answer these questions with the best available evidence. Summarized preliminary findings were used to develop a framework for therapies and treatments. The underlying dataset published by Dryad consists of tables describing the groups’ main findings of therapeutic interventions by organ/systems in SLE using the GRADE approach.

This dataset has potential for reuse and would be an excellent resource for the of study of lupus in the hopes of improving outcomes in Latin America and worldwide.

— Shavon Stewart

And Now, the Numbers . . .

As the new year begins, we take note of the increasing diversity of fields represented in data archived at Dryad and review the numbers for 2016.

Dryad Grows into a General Repository

We are excited to see Dryad’s role in the preservation of data expand into new areas and fields in 2016. Researchers submitted more data involving human subjects and data from social media. In addition, a quick look at our most popular data shows that two of the top five downloaded packages were from the fields of cardiology and science journalism. While Dryad’s origins are in the life sciences, it is increasingly being used as a general repository for data from a myriad of fields.

Let’s take a look at the numbers for 2016:

Increase in Number of Data Packages and Data Files

Our curators were busy! The total number of published data packages (sets of data files associated with a publication) at the end of the year was a whopping 15,325. Our curators meticulously archived 4,307 packages, a 10% increase from 2015. The size of data packages also continued to grow – from an average of 481MB to an average of 573MB, an increase of about 20%.summary of Dryad data packages 2016

At the end of 2016, we were closing in on 50,000 archived data files; by January of this year, we passed that mark.

In a future blog, we’ll talk about the integration of new journals into the Dryad submission process, new members, and new partnerships. For now, we’ll just note that there was a 22% increase in the number of journals that have data in Dryad linking back to the article.

New Fields

We’ve seen a significant uptick in human subjects data and social media data this year, which has prompted us to develop an FAQ on cleaning and de-identification of human subjects data for public access. As the idea of what data should be preserved continues to broaden, submissions of these kinds of data will only increase. We’ll keep you updated about this trend in future blogs.

Top Downloads

Let’s take a look at the most popular data published in 2016, in terms of downloads. Among the top 5 downloads includes data on plant genetics, the early history of ray-finned fishes, and, not surprisingly in this age, the effects of climate change on boreal forests.

Also of interest are data from an article in Science evaluating how people make use of Sci-Hub, an open source scholarly library. Our guest blog on these data by science journalist John Bohannon generated a lot of interest this year and was one of our most popular blog posts ever.

Another significant development in 2016 came from the medical sciences. A comparison of coronary diagnostic techniques marked Dryad’s first submission from one of the top five cardiology journals, JACC: Cardiovascular Interventions.

The fact that 2 of the 5 top downloads come from fields outside of life sciences clearly indicates that data in Dryad now cover a broad range of fields.

Top 5 Downloads of Data Archived in 2016

Article Dryad DOI Number of Downloads
Wagner MR et al. (2016) Host genotype and age shape the leaf and root microbiomes of a wild perennial plant. Nature Communications 7: 12151. 3123
Bohannon J et al. (2016) Who’s downloading pirated papers? Everyone.  Science 352(6285): 508-512. 2969
D’Orangeville L et al. (2016) Northeastern North America as a potential refugium for boreal forests in a warming climate. Science 352(6292): 1452-1455. 741
Johnson NP et al. (2016) Continuum of vasodilator stress from rest to contrast medium to adenosine hyperemia for fractional flow reserve assessment. JACC. Cardiovascular Interventions 9(8): 757-767. 453
Lu J et al. (2016) The oldest actinopterygian highlights the cryptic early history of the hyperdiverse ray-finned fishes. Current Biology 26(12): 1602–1608. 423

Overall, we’ve had a great year and are delighted to be seeing a broader range of data from an increasing number of journals and fields. Thanks to our Board of Directors, members, and of course our staff for providing their support to make 2016 a notable year for Dryad!

Making open data useful: A drug safety case study

We’re pleased to present a guest post from data scientist Juan M. Banda, the lead author of an important, newly-available resource for drug safety research. Here, Juan shares some of the context behind the data descriptor in Scientific Data and associated data package in Dryad. – EH


As I sit in a room full of over one hundred bio-hackers at the 2016 Biohackathon in Tsuruoka, Yamagata, Japan, the need to have publicly available and accessible data for research use is acutely evident. Organized by Japan’s National Biosciences Database Center (NBDC) and Databases Center for Life Science (DBLS), this yearly hackathon gathers people from organizations and universities all over the world, including the National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI), with the purpose of extending and interlinking resources like PubChem, PhenomeCentral, Bio2RDF, and PubAnnotation.

The end goal: finding better ways to access data that will allow researchers to focus on analysis of the data rather than preparation.

In the same spirit, our publication “A curated and standardized adverse drug event resource to accelerate drug safety research” (doi:10.1038/sdata.2016.26; data in Dryad at helps researchers in the drug safety domain with the standardization and curation of the freely available data from the Federal Food and Drug Administration (FDA) adverse events reporting system (FAERS).

FAERS collects information on adverse events and medication errors reported to the FDA, and is comprised of over 10 million records collected between 1969 to the present. As one of the most important resources for drug safety efforts, the FAERS database has been used in at least 750 publications as reported by PubMed and was probably manipulated, mapped and cleaned independently by the vast majority of the authors of said publications. This cleaning and mapping process takes a considerable amount of time — hours that could have been spent analyzing the data further.

Our publication hopes to eliminate this needless work and allow researchers to focus their efforts in developing methods to analyze this information.

OHDSIAs part of the Observational Health Sciences Initiative (OHDSI), whose mission is to “Improve health, by empowering a community to collaboratively generate the evidence that promotes better health decisions and better care,” we decided to tackle the task of cleaning and curating the FAERS database for our community, and the wider drug safety community. By providing a general common data model (CDM) and a general vocabulary to standardize how electronic patient data is stored, OHDSI allows its participants to join a research network with over 655 million patients.

With a significant fraction of the community’s research being focused on drug safety, it was a natural decision to standardize the FAERS database with the OMOP vocabulary, to allow all researchers on our network access to FAERS. Since the OMOP vocabulary incorporates general vocabularies such as SNOMED, MeSH, and RxNORM, among others, the usability of this resource is not limited to participants of this community.

In order to curate this dataset, we took the source FAERS data in CSV format and de-duplicated case reports. We then performed value imputation for certain fields that were missing. Drug names were standardized to RxNorm ingredients and standard clinical names (for multi-ingredient drugs). This mapping is tricky because some drug names have spelling errors, and some are non-prescription drugs, or international brand names. We achieved coverage of 93% of the drug names, which in turn cover 95% of the case reports in FARES.

For the first time, the indication and reactions have been mapped to SNOMED-CT from their original MedRA format. Coverage for indications and reactions is around 64% and 80%, respectively. The OMOP vocabulary allows RxNorm drug codes as well as SNOMED-CT codes to reside in the same unified vocabulary space, simplifying use of this resource. We also provide the complete source code we developed in order to allow researchers to refresh the dataset with the new quarterly FAERS data releases and improve the mappings if needed. We encourage users to contribute the results of their efforts back to the OHDSI community.

With a firm commitment to making open data easier to use, this resource allows researchers to utilize a professionally curated (and refreshable) version of the FAERS data, enabling them to focus on improving drug safety analyses and finding more potentially harmful drugs, as a part of OHDSI’s core mission.


Still from OHMSDI video

The data:

A full description of the dataset in Scientific Data:


— Juan M. Banda

Sci-Hub stories: Digging into the downloads

The following is a guest post from science journalist John Bohannon. We asked him to give us some background on his recent dataset in Dryad and the analysis of that data in Science. What stories will you find in the data? – EH



Sci-Hub is the world’s largest repository of pirated journal articles. We will probably look back and see it as inevitable. Soon after it became possible for people to share copyrighted music and movies on a massive scale, technologies like Napster and BitTorrent arrived to make the sharing as close to frictionless as possible. That hasn’t made the media industry collapse, as many people predicted, but it certainly brought transformation.

Unlike the media industry, journal publishers do not share their profits with the authors. So where will Sci-Hub push them? Will it be a platform like iTunes, with journals selling research papers for $0.99 each? Or will Sci-Hub finally propel the industry into the arms of the Open Access movement? Will nonprofit scientific societies and university publishers go extinct along the way, leaving just a few giant, for-profit corporations as the caretakers of scientific knowledge?

There are as many theories and predictions about the impact of Sci-Hub as there are commentators on the Internet. What is lacking is basic information about the site. Who is downloading all these Sci-Hub papers? Where in the world are they? What are they reading?

48 hours of Sci-Hub downloads. Each event is color-coded by the local time: orange for working hours (8am-6pm) and blue for the night owls working outside those hours.

Sometimes all you need to do is ask. So I reached out directly to Alexandra Elbakyan, who created Sci-Hub in 2011 as a 22 year-old neuroscience graduate student in Kazakhstan and has run it ever since. For someone denounced as a criminal by powerful corporations and scholarly societies, she was quite open and collaborative. I explained my goal: To let the world see how Sci-Hub is being used, mapping the global distribution of its users at the highest resolution possible while protecting their privacy. She agreed, not realizing how much data-wrangling it would ultimately take us.

Two months later, Science and Dryad are publicly releasing a data set of 28 million download request records from 1 September 2015 through 29 February 2016, timestamped down to the second. Each includes the DOI of the paper, allowing as rich a bibliographic exploration as you have CPU cycles to burn. The 3 million IP addresses have been converted into arbitrary codes. Elbakyan converted the IP addresses into geolocations using a database I purchased from the company Maxmind. She then clustered each geolocation to the coordinates of the nearest city using the Google Maps API. Sci-Hub users cluster to 24,000 unique locations.

The big take-home? Sci-Hub is everywhere. Most papers are being downloaded from the developing world: The top 3 countries are India, China, and Iran. But the rich industrialized countries use Sci-Hub, too. A quarter of the downloads came from OECD nations, and some of the most intense download hotspots correspond to the campuses of universities in the US and Europe, which supposedly have the most comprehensive journal access.

But these data have many more stories to tell. How do the reading habits of researchers differ by city? What are the hottest research topics in Indonesia, Italy, Brazil? Do the research topics shift when the Sci-Hub night owls take over? My analysis indicates a bimodal distribution over the course of the day, with most locations surging around lunchtime, and the rest peaking at 1am local time. The animated map above shows just 2 days of the data.

Something everyone would like to know: What proportion of downloaded articles are actually unavailable from nearby university libraries? Put another way: What is the size of the knowledge gap that Sci-Hub is bridging?

Download the data yourself and let the world know what you find.

The data:

My analysis of the data in Science:


 — John Bohannon

2015 stats roundup

2015While gearing up for the Dryad member meeting (to be held virtually on 24 May – save the date!) and publication of our annual report, we’re taking a look at last year’s numbers.

2015 was a “big” year for Dryad in many respects. We added staff, and integrated several new journals and publishing partners. But perhaps most notably, the Dryad repository itself is growing very rapidly. We published 3,926 data packages this past year — a 44% increase over 2014 — and blew past the 10,000 mark for total data packages in the repository.

Data package size

Perhaps the “biggest” Dryad story from last year is the increase in the mean size of data packages published. In 2014, that figure was 212MB. In 2015, it more than doubled to 481MB, an increase of a whopping 127%.

This striking statistic is part of the reason we opted at the beginning of 2016 to double the maximum package size before overage fees kick in (to 20GB), and simplified and reduced our overage fees. We want researchers to continue to archive more (and larger) data files, and to do so sustainably. Meanwhile, we do continue to welcome many submissions on the smaller end of the scale.


Distribution of Dryad data package size by year. Boxplot shows median, 1st and 3rd quartiles, and 95% confidence interval of median. Note the log scale of the y-axis.

In 2015, the mean number of files in a data package was about 3.4, with 104 as the largest number of files in any data package. To see how times have changed, compare this to a post from 2011 (celebrating our 1,000th submission), where we noted:

Interestingly, most of the deposits are relatively small in size. Counting all files in a data package together, almost 80% of data packages are less than one megabyte. Furthermore, the majority of data packages contain only one data file and the mean is a little less than two and a half. As one might expect, many of the files are spreadsheets or in tabular text format. Thus, the files are rich in information but not so difficult to transfer or store.

We have yet to do a full analysis of file formats deposited in 2015, but we see among the largest files many images and videos, as would be expected, but also a notable increase in the diversity of DNA sequencing-related file formats.

So not only are there now more and bigger files in Dryad, there’s also greater complexity and variety. We think this shows that more people are learning about the benefits of archiving and reusing multiple file types, and that researchers (and publishers) are broadening their view of what qualifies as “data.”

Download counts

2015speciesSo who had the biggest download numbers in 2015? Interestingly, nearly all of last year’s most-downloaded data packages are from genetics/genomics. 3 of the top 5 are studies of specific wild populations and how they adapt to changing circumstances — Sailfin Mollies (fish), blue tits (birds), and bighorn sheep, specifically.

Another top package presents a model for dealing with an epidemic that had a deadly impact on humans in 2015. And rounding out the top 5 is an open source framework for reconstructing the relationships that unite all lineages — a “tree of life.”

In 5th place, with 367 downloads:

In 4th place, with 601 downloads:

In 3rd place, with 1,324 downloads:

In 2nd place, with 1,868 downloads:

And this year’s WINNER, with 2,678 downloads:

The above numbers are presented with the usual caveats about bots, which we aim to filter out, but cannot do with perfect accuracy. (Look for a blog post on this topic in the near future).

As always, we owe a huge debt to our submitters, partners, members and users for supporting Dryad and open data in 2015!

What were the most downloaded data packages in 2014?

The reason why Dryad is in the business of archiving, preserving, and providing access to research data is so that it will be reused, whether for deeper reading of the publication, for post-publication review, for education, or for future research. While it’s not yet as easy as we would like to track data reuse, one metric that is straightforward to collect is the number of times a dataset has been downloaded, and this is one of two data reuse statistics reported by our friends at ImpactStory and Plum Analytics.

2014 with fireworks

The numbers are very encouraging. There are already over a quarter million downloads for the 8,897 data files released in 2014 (from 2,714 data packages). That’s over 28 downloads per data file. While there is always the caveat that some downloads may be due to activity from newly emerged bots that we have yet to recognize and filter out, we think it is safe to say that most of these downloads are from people.

To celebrate, we would like to pay special tribute to the top five data packages from 2014, as measured by the maximum number of downloads for any single file (since many data packages have more than one) at the time of writing. They cover a diversity of topics from livestock farming in the Paleolithic to phylogenetic relationships among insects. That said, we are struck by the impressively strong showing for plant science — 3 of the top 5 data packages.

In 5th place, with 453 downloads

In 4th place, with 581 downloads

In 3rd place, with 626 downloads

In 2nd place, with 4,672 downloads

And in 1st place, with a staggering 34,879 downloads

Remarkably, given the number of downloads, this last data package was only released in November.

We’d like to thank all of our users, whether you contribute data or reuse it (or both), for helping make science just a little more transparent, efficient, and robust this past year. And we are looking forward to finding out some more of what you did with all those downloads in 2015!