Feeds:
Posts
Comments

We’re coming off of a big month which included a two-day Dryad board meeting, International Data Week in Denver, and the Open Access Publishers meeting (COASP) in Arlington, VA. Combined with Open Access Week, we’ve been basking in all things #openscience at Dryad.

International Data Week 2016

International Data Week was a collection of three different events: SciDataCon 2016International Data Forum, idwlogoand the 8th Research Data Alliance Plenary Meeting. While it was my first time attending RDA and SciDataCon, it wasn’t the first time for the many Dryad board members who have been actively participating in these forums for years.

Dryad staff had the pleasure of participating in a few panels over the week. As part of SciDataCon, Elizabeth Hull discussed protecting human subjects in an open data repository. In another, as part of the RDA 8th Plenary, I participated in a discussion of the challenges surrounding sustainability of data infrastructure. (The talk is available on the RDA website. The panel starts at minute 30).

29822088326_6d9db25bbf_qParticipating in IDW reminded me how important our diverse community of stakeholders and members are to furthering the adoption of open data. Dryad members create a community and support our mission. Our members benefit by receiving discounts on data publication fees and by relying on a repository that stays current in the evolving needs and mandates that surround open data. We work together to help make open data easy and affordable for authors.

Asking OA publishers to be more open

Following International Data Week, I had the opportunity to participate for the first time in the Open Access Scholarly Publishers Association meeting, COASP 2016. Heather Joseph, Executive Director of SPARC kicked off the meeting with a keynote that urged attendees to consider how they would complete the phrase “Open in order to . . .” as a way to ensure that we all keep our sights on working toward something more than just ‘open for the sake of open’. Some of other memorable talks addressed the challenges with mapping connections from articles to other related outputs, and discussed the growing interest in alternative revenue models to article processing charges (APCs). I had the privilege to deliver a keynote entitled “Be More Open” which highlighted the connections between Open Access and Open Data movement, and I encouraged OASPA to add open data policies to their membership requirements.

I’d like to thank the organizers and sponsors of International Data Week and COASP 2016 for making these important conversations possible. In addition, I would also like to encourage any interested stakeholders to join Dryad and support open data.

Change_In_Hand

We are pleased to have received a Sustaining Award from the U.S. National Science Foundation.  Sustaining Awards are an innovative proposal track, developed within NSF’s Advances in Bioinformatics program, that provides “limited support for the cost of ongoing operations and maintenance of existing cyberinfrastructure that is critical for the continued advance of priority biological research.”

The award  is to the University of North Carolina at Chapel Hill with Dryad as a subawardee. The grant provides approximately $762K in funding over three years (starting 1-Sep-2016).

From the abstract:

This award will enable Dryad to achieve the scale required for sustainability through continued growth and extension to new research communities. At the same time, it will enable the continued growth of the repository’s valuable collection of diverse and high-quality data for research and education.

The full project description is publicly available and more information about the award is at the NSF Funding Database.

We are grateful to NSF, who have generously supported the Dryad Digital Repository since its inception in 2008, including a recently funded small-scale pilot study to explore direct sponsorship of data publication charges.

 

One of the most rewarding things about working for Dryad is collaborating with talented and passionate professionals from across the globe who are dedicated to increasing the availably of open data. This summer, two new people were officially elected to serve on Dryad’s Board of Directors and we are excited to have them our governance team.

linJennifer Lin, Director of Product Management at Crossref, comes to us with lots of experience in product development and management, community outreach, scholarly communications, and more. Based in California, USA, Jennifer was instrumental in helping Dryad integrate our data submission system with PLOS journals during her tenure there. She is a data sharing evangelist, and passionate about tools for making data reusable and discoverable. We are thrilled to have her direct her energy and enthusiasm Dryad’s way.

nilssonJohan Nilsson is also new to the Dryad board and comes from the Oikos Editorial Office, a society-owned publishing foundation based at Lund University, Sweden. Johan’s past work has been as a research scientist in evolutionary ecology. He has a strong interest in scientific communication and social media engagement and focuses particularly on how the benefits of open science (and open data in particular) can be better expressed to researchers. We value his expertise and perspective into how Dryad can best serve its users.

dilloWe would be remiss if we didn’t also publicly welcome Ingrid Dillo, who was appointed to the board early in 2016. Ingrid is deputy director at DANS (Data Archiving and Networked Services). She holds a PhD in history and has a long record of policy development at DANS, the National Library of the Netherlands and Dutch Ministry of Education, Culture and Science. She is especially interested in research data management and the certification of trustworthy digital repositories. We are already relying on Ingrid’s expertise and learning from her work with groups like the Research Data Alliance.

Candidates to Dryad’s 12 member Board of Directors are nominated by Member organizations, and four of the Directors are elected or re-elected every year. Once on the Board, Directors serve as individuals rather than organizational representatives. The 12-member rotating Board aims for both diversity of perspective and depth of expertise. We are delighted to have achieved both with our new Directors. We welcome them onboard and wish to extend a heartfelt thanks to Directors past, present, and future for their contributions and dedication to Dryad’s mission.

whopays

The question of who should pay for the preservation and stewardship of open research data remains unresolved, at a time when journals and funders alike are adopting strong open data policies. As a non-profit repository that relies on financial support from members and users, we at Dryad deal with this question daily, and are eager to help find new and sustainable solutions.

Along these lines, if you submit your data to Dryad, you will soon notice that we will ask for information about your grant support. That’s because we’re running a pilot project with the US National Science Foundation (NSF) to test the feasibility of having a funding organization directly sponsor Data Publication Charges (DPCs).

During this pilot implementation, if your research was supported by a grant from the US NSF, and your DPC would not otherwise be waived or sponsored by another organization, this grant information can be used to charge the DPC directly to a fund set aside as part of this project.

nsf_flowchart

Entering grant information at data submission is optional. Nonetheless, we encourage researchers to fill out the funding information in order to benefit from NSF funds, enable awardees to receive credit from their institutions and funders for the open availability and reuse of the data, and to promote its discoverability.

Direct funder sponsorship of data archiving has some significant features:

Researchers also stand to benefit — they have an interest in seeing their data responsibly curated and preserved, even if they publish and archive data after their grant funds have expired.  And we are excited by the prospect of increasing the proportion of data packages for which the DPC is sponsored or waived (which is currently just over 2/3).

We aim to work out the details of achieving the goals above, and to evaluate any downsides, as part of the pilot. We will also be surveying researchers to better understand what happens when data is not sponsored by a payment plan. From that, we will be able to develop recommendations for what Dryad, funding organizations, and institutions can do to facilitate the DPC payment process for researchers.

We are grateful to the NSF Advances in Bioinformatics program for the supplemental funding behind this project, and we hope that many researchers will take advantage of the opportunity to have their DPC covered by the NSF funds, which will be available at least through February 2017.  Please let me know (at director@datadryad.org) if you have any questions or feedback!

We’re pleased to present a guest post from data scientist Juan M. Banda, the lead author of an important, newly-available resource for drug safety research. Here, Juan shares some of the context behind the data descriptor in Scientific Data and associated data package in Dryad. – EH

_____

As I sit in a room full of over one hundred bio-hackers at the 2016 Biohackathon in Tsuruoka, Yamagata, Japan, the need to have publicly available and accessible data for research use is acutely evident. Organized by Japan’s National Biosciences Database Center (NBDC) and Databases Center for Life Science (DBLS), this yearly hackathon gathers people from organizations and universities all over the world, including the National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI), with the purpose of extending and interlinking resources like PubChem, PhenomeCentral, Bio2RDF, and PubAnnotation.

The end goal: finding better ways to access data that will allow researchers to focus on analysis of the data rather than preparation.

In the same spirit, our publication “A curated and standardized adverse drug event resource to accelerate drug safety research” (doi:10.1038/sdata.2016.26; data in Dryad at http://doi.org/10.5061/dryad.8q0s4) helps researchers in the drug safety domain with the standardization and curation of the freely available data from the Federal Food and Drug Administration (FDA) adverse events reporting system (FAERS).

FAERS collects information on adverse events and medication errors reported to the FDA, and is comprised of over 10 million records collected between 1969 to the present. As one of the most important resources for drug safety efforts, the FAERS database has been used in at least 750 publications as reported by PubMed and was probably manipulated, mapped and cleaned independently by the vast majority of the authors of said publications. This cleaning and mapping process takes a considerable amount of time — hours that could have been spent analyzing the data further.

Our publication hopes to eliminate this needless work and allow researchers to focus their efforts in developing methods to analyze this information.

OHDSIAs part of the Observational Health Sciences Initiative (OHDSI), whose mission is to “Improve health, by empowering a community to collaboratively generate the evidence that promotes better health decisions and better care,” we decided to tackle the task of cleaning and curating the FAERS database for our community, and the wider drug safety community. By providing a general common data model (CDM) and a general vocabulary to standardize how electronic patient data is stored, OHDSI allows its participants to join a research network with over 655 million patients.

With a significant fraction of the community’s research being focused on drug safety, it was a natural decision to standardize the FAERS database with the OMOP vocabulary, to allow all researchers on our network access to FAERS. Since the OMOP vocabulary incorporates general vocabularies such as SNOMED, MeSH, and RxNORM, among others, the usability of this resource is not limited to participants of this community.

In order to curate this dataset, we took the source FAERS data in CSV format and de-duplicated case reports. We then performed value imputation for certain fields that were missing. Drug names were standardized to RxNorm ingredients and standard clinical names (for multi-ingredient drugs). This mapping is tricky because some drug names have spelling errors, and some are non-prescription drugs, or international brand names. We achieved coverage of 93% of the drug names, which in turn cover 95% of the case reports in FARES.

For the first time, the indication and reactions have been mapped to SNOMED-CT from their original MedRA format. Coverage for indications and reactions is around 64% and 80%, respectively. The OMOP vocabulary allows RxNorm drug codes as well as SNOMED-CT codes to reside in the same unified vocabulary space, simplifying use of this resource. We also provide the complete source code we developed in order to allow researchers to refresh the dataset with the new quarterly FAERS data releases and improve the mappings if needed. We encourage users to contribute the results of their efforts back to the OHDSI community.

With a firm commitment to making open data easier to use, this resource allows researchers to utilize a professionally curated (and refreshable) version of the FAERS data, enabling them to focus on improving drug safety analyses and finding more potentially harmful drugs, as a part of OHDSI’s core mission.

OHDSI_still2

Still from OHMSDI video

The data:

http://doi.org/10.5061/dryad.8q0s4

A full description of the dataset in Scientific Data:

http://www.nature.com/articles/sdata201626

 

— Juan M. Banda

On May 24, we held the first virtual Dryad Community Meeting, which allowed us to connect both with our membership and with the larger open data community, far and wide. The theme was “Leadership in data publishing: Dryad and learned societies.”

Following an introduction and update about Dryad from yours truly, we heard about the experiences from representatives of three of Dryad’s member societies.

All three societies require that data be archived in an appropriate repository as a condition of publication in their journals. Yet, they have each taken considerable time and effort to develop policies that address the needs and concerns of their different communities.

Bruna spoke about working with an audience that routinely gathers data for very long-term studies. For many Biotropica authors, embargoes are seen as an important prerequisite for data publishing. Their data policy “includes a generous embargo period of up to three years to ensure authors have ample time to publish multiple papers from more complex or long-term data sets”. Biotropica’s policy also recommends those “who re-use archived data sets to include as fully engaged collaborators the scientists who originally collected them”. To address initial resistance to data archiving, and to build understanding and consensus, Biotropica “enlisted its critics” to contribute to a paper discussing the pros and cons of data publication. Out of this process emerged an innovative policy that went into effect at the start of 2016.

Meaden, by contrast, noted that only 8% of Proceedings B authors elect to embargo data in Dryad, and the standard embargo is for only one year after publication. She credited clearer author instructions and a data availability statement in the manuscript submission system as key elements that have increased the availability of data associated with Royal Society publications.

Newton discussed BES’ move from “encouraging data publication” in 2012 to requiring it in 2014. As shown below, this resulted in an impressive increase in the availability of data. Next, the society is looking to develop guidance on data reuse etiquette. Newton noted that this effort would “need to be community-led.”

BES_data_preservation

Slide from Erika Newton’s presentation, illustrating the rise in data deposits for BES journals as associated with changing data policy.

While each speaker reported on unique challenges, all shared commonalities, such as:

  • involving the specific community in policy decisions
  • incrementally increasing efforts to make data available
  • the importance of clear author instructions 

We greatly appreciate the excellent contributions from the panelists, as well all the members and other attendees who participated and contributed to the lively Q&A.

We are also pleased that the virtual format was well received. In our follow-up survey, many of the attendees said they found it easy to ask questions and appreciated the ability to join remotely.

Our aim is that these meetings continue to be a valued forum for our diverse community of stakeholders to share knowledge and discuss emerging issues. If you have suggestions on topics for future meetings, or an interest in becoming a member, please reach out to me at director@datadryad.org.

dryad_members

 

The following is a guest post from science journalist John Bohannon. We asked him to give us some background on his recent dataset in Dryad and the analysis of that data in Science. What stories will you find in the data? – EH

_______

Scihub_raven

Sci-Hub is the world’s largest repository of pirated journal articles. We will probably look back and see it as inevitable. Soon after it became possible for people to share copyrighted music and movies on a massive scale, technologies like Napster and BitTorrent arrived to make the sharing as close to frictionless as possible. That hasn’t made the media industry collapse, as many people predicted, but it certainly brought transformation.

Unlike the media industry, journal publishers do not share their profits with the authors. So where will Sci-Hub push them? Will it be a platform like iTunes, with journals selling research papers for $0.99 each? Or will Sci-Hub finally propel the industry into the arms of the Open Access movement? Will nonprofit scientific societies and university publishers go extinct along the way, leaving just a few giant, for-profit corporations as the caretakers of scientific knowledge?

There are as many theories and predictions about the impact of Sci-Hub as there are commentators on the Internet. What is lacking is basic information about the site. Who is downloading all these Sci-Hub papers? Where in the world are they? What are they reading?

48 hours of Sci-Hub downloads. Each event is color-coded by the local time: orange for working hours (8am-6pm) and blue for the night owls working outside those hours.

Sometimes all you need to do is ask. So I reached out directly to Alexandra Elbakyan, who created Sci-Hub in 2011 as a 22 year-old neuroscience graduate student in Kazakhstan and has run it ever since. For someone denounced as a criminal by powerful corporations and scholarly societies, she was quite open and collaborative. I explained my goal: To let the world see how Sci-Hub is being used, mapping the global distribution of its users at the highest resolution possible while protecting their privacy. She agreed, not realizing how much data-wrangling it would ultimately take us.

Two months later, Science and Dryad are publicly releasing a data set of 28 million download request records from 1 September 2015 through 29 February 2016, timestamped down to the second. Each includes the DOI of the paper, allowing as rich a bibliographic exploration as you have CPU cycles to burn. The 3 million IP addresses have been converted into arbitrary codes. Elbakyan converted the IP addresses into geolocations using a database I purchased from the company Maxmind. She then clustered each geolocation to the coordinates of the nearest city using the Google Maps API. Sci-Hub users cluster to 24,000 unique locations.

The big take-home? Sci-Hub is everywhere. Most papers are being downloaded from the developing world: The top 3 countries are India, China, and Iran. But the rich industrialized countries use Sci-Hub, too. A quarter of the downloads came from OECD nations, and some of the most intense download hotspots correspond to the campuses of universities in the US and Europe, which supposedly have the most comprehensive journal access.

But these data have many more stories to tell. How do the reading habits of researchers differ by city? What are the hottest research topics in Indonesia, Italy, Brazil? Do the research topics shift when the Sci-Hub night owls take over? My analysis indicates a bimodal distribution over the course of the day, with most locations surging around lunchtime, and the rest peaking at 1am local time. The animated map above shows just 2 days of the data.

Something everyone would like to know: What proportion of downloaded articles are actually unavailable from nearby university libraries? Put another way: What is the size of the knowledge gap that Sci-Hub is bridging?

Download the data yourself and let the world know what you find.

The data:

http://dx.doi.org/10.5061/dryad.q447c

My analysis of the data in Science:

http://www.sciencemag.org/news/2016/04/whos-downloading-pirated-papers-everyone

 

 — John Bohannon