Feeds:
Posts
Comments

Archive for the ‘Data reuse’ Category

In our latest post, our Executive Director Melissanne Scheld sits down with Dryad’s Board of Directors Chair, Professor Charles Fox, to discuss challenges researchers face today, how Dryad is helping alleviate some of those pain points, why Dryad has had such staying power in a quickly changing industry,  . . . and then we move on to dessert. 

Chuck Fox

Can you tell us a little about your professional background and how that intersects with Dryad’s mission?

I wear two hats in my professional life – I am an evolutionary ecologist who studies various aspects of insect biology at the University of Kentucky, and I am a journal editor (Executive Editor of Functional Ecology).

My involvement with open data and Dryad began fortuitously in 2006. The British Ecological Society was invited to send a representative to a Data Registry Workshop, organized by the Ecological Society of America, to be held that December in Santa Barbara, California. I am (and was at that time) an editor of one of the British Ecological Society’s journals, Functional Ecology, and I live in the U.S. So Lindsay Haddon, who was Publications Manager for the BES, asked me to attend the workshop  as their representative. Before that meeting I don’t recall having thought much about open data or data archives, but I was excited to attend the meeting in part because the topic intrigued me and, selfishly, because my parents live in southern California and this was an opportunity to visit them. The discussions at that meeting, plus those at a couple follow-up meetings over the next couple years, including one at NESCent in Durham, North Carolina, and another in Vancouver, convinced me that data publishing, and open data more generally, should be a part of research publication. So I began lobbying the BES to adopt an open data policy and become a founding member of Dryad. I wrote a proposed data policy – just a revision of the Journal Sata Archiving Policy, JDAP, that many ecology and evolution journals adopted – and submitted that proposal to the BES’ publication committee. It took a few years, but in 2011 the BES adopted that data policy across their suite of journals and became a member of Dryad. The BES has since been a strong supporter of open data and required data publication as a condition of publishing a manuscript in one of their journals. Probably because I was a vocal proponent of data policies at BES meetings (along with a few others, most notably Tim Coulson), I was nominated to be a Dryad board member, and was elected to the board in 2013.

As an educator,  what are some of the biggest changes you’ve seen in the classroom during your career?

When I started teaching, first as a graduate student (teaching assistant) and then as a young university professor, we didn’t have Powerpoint and digital projectors. So I made heavy use of a chalkboard (or dry erase board) during lecture, and used an overhead projector for more complicated graphics. Students had to take detailed notes on the lecture, which required them to write furiously all throughout the class. Nowadays I produce detailed PowerPoint slides that include most of the material I cover, so I write very little on the chalkboard. And, because I can provide my slides to students before class – as a pdf that they can print and bring to class – the students are freed from scribbling furiously to capture every detail. Students still need to take some notes (my slides do not include every detail), but they are largely freed to listen to lecture and participate in class discussions. I am not convinced, though, that these changes have led to improved learning, at least not in all students. Having information too easily available, including downloadable class materials, seems to cause some students to actually disengage from class, and ultimately do poorly, possibly because they think they don’t need to attend class, or engage when they do attend, since they have all of the materials easily accessible to them outside the classroom?

What do you think the biggest challenges are for open science research today?

I have been amazed at how quickly open data has become accepted as the standard in the ecology and evolution research communities. When data policies were first proposed to journals there was substantial resistance to their adoption – journals were nervous about possibly driving away authors, and editors (who are also researchers) shared the views that were common in the community regarding ownership of their own data – but over just a few years the resistance largely disappeared among editors, societies and publishers, such that a large proportion of the top journals in the field have adopted policies requiring data to be published alongside research manuscripts. That said, some significant challenges remain, both on the researcher side and on the repository side. On the repository side, sustainable funding remains the largest hurdle. Data repositories cost money to run, such as for staff and infrastructure. Dryad has been relying on a mix of data publication charges (DPCs) and grants to fund its mission. This has worked for us so far, but constantly chasing grants is a lot of work for those writing grants, and the cost to researchers paying DPCs, albeit small, is not trivial for those without grant support.

On the researcher side, though data publishing has mostly become an accepted part of research publication in the community, there remain many important cultural and practical challenges to making open data universally practiced.  These include the development of standards for data citation and reuse (not restrictions on data reuse, but community expectations for citation and collaboration), balancing views of data ownership with the needs of the community, balancing the concerns of researchers that produce long-term datasets with those of the community, and others. We also need to improve education about data, such as teaching our students how to organize and properly annotate their datasets so that they are useful for other researchers after publication. Even when data are made available by researchers, actually using those data can be challenging if they are not well organized and annotated.

When researchers are deciding in which repository to deposit their research data, what values and functions should they consider?

Researchers should choose a repository that best fits the type of data they have to deposit and the community that will likely be reusing it. There are many repositories that handle specialized data types, such as genetic sequence data or data to be used for phylogenetic analysis. If your data suits a specialized archive, choose that. But the overwhelming majority of data generated by ecologists don’t fit into specialized archives. It’s for these types of data that Dryad was developed.

So what does Dryad offer researchers? From the perspective of the dataset author, Dryad links your dataset directly to the manuscript you have published about the dataset. This provides users detailed metadata on the contents of your dataset, helping them understand the dataset and use it correctly for future research. Dryad also ensures that your dataset is discoverable, whether you start at the journal page, on Dryad’s site, or any of a large number of collaborator services. The value of Dryad to the dataset user are similar – easy discoverability of data and clear links to the data collection details (i.e., links to the associated manuscripts).  

You’ve held several roles on Dryad’s Board of Directors – what about this organization compels you to volunteer your free time?

My experiences as a scientist, a journal editor, and participating in open data discussions have convinced me that data publication is an essential part of research publication. For decades, or even centuries, we’ve relied on a publishing model where researchers write manuscripts that describe the work they have done and summarize their results and conclusions for the broader community. That’s the typical journal paper, and was the limit of what could be done in an age where everything had to fit onto the printed page and be distributed on paper. Nowadays we have near infinite space in a digital medium to not just summarize our results, but also provide all of the details, including the actual data, as part of the research presentation. It will always be important to have an author summarize their findings and place their work into context – that intellectual contribution is an essential part of communicating your research – but there’s no reason that’s where we need to stop. I imagine a world where a reader can click on a figure, or table, or other part of a manuscript and be taken directly to the relevant details – the actual data presented in the figure, the statistical models underlying the analyses, more detailed descriptions of study sites or organisms, and possibly many other types of information about the experiment, data collection, equipment used, results, etc. We shouldn’t be constrained by historical limitations of the printed page. We’re not yet even close to where I think we can and should be  going, but making data an integral part of research publication is a huge step in the right direction. So I enthusiastically support journal mandates that require data to be published alongside each manuscript presenting research results. And facilitating this is a core part of Dryad’s mission, which leads me to enthusiastically support both Dryad’s mission and the organization itself!

Pumpkin or apple pie?  

Those are my two favorite pies, so it’s a tough question. If served a la mode, i.e., with ice cream, then I’d most often pick apple pie. But, without ice cream, I’d have to choose pumpkin pie.

Stay tuned for future conversations with industry thought leaders and other relevant blog posts here at Dryad News and Views.

 

Read Full Post »

Two cheetahs running

Image credit Cat Specialist Group, catsg.org

Dryad is thrilled to announce a strategic partnership with California Digital Library (CDL) to address researcher needs by leading an open, community-supported initiative in research data curation and publishing.

Dryad was founded 10 years ago with the mission of providing open, not-for-profit infrastructure for data underlying the scholarly literature, and the vision of promoting a world where research data is openly available and routinely re-used to create knowledge.

20,000 data publications later, that message has clearly resonated. The Dryad model of embedding data publication within journal workflows has proven highly effective, and combined with our data curation expertise, has made Dryad a name that is both known and trusted in the research community. But a lot has changed in the data publishing space since 2008, and Dryad needs to change with it.

Who/what is CDL?

CDL LoroCDL was founded by the University of California in 1997 to take advantage of emerging technologies that were transforming the way digital information was being published and accessed. Since then, in collaboration with the UC libraries and other partners, they have assembled one of the world’s leading digital research libraries and changed the ways that faculty, students, and researchers discover and access information.

CDL has long-standing interest and experience in research data management (RDM) and data publishing. CDL’s digital curation program, the University of California Curation Center (UC3), provides digital preservation, data curation, and data publishing services, and has a history of coordinating collaborative projects regionally, nationally, and internationally. It is baked into CDL’s strategic vision to build partnerships to better promote and make an impact in the library, open research, and data management spaces (e.g., DMPTool, HathiTrust).

Why a partnership?

CDL and Dryad have a shared mission of increasing the adoption and availability of open data. By joining forces, we can have a much bigger impact. This partnership is focused on combining CDL’s institutional relationships, expertise, and nimble technology with Dryad’s position in the researcher community, curation workflows, and publisher relationships. By working together, we plan to create global efficiencies and minimize needless duplication of effort across institutions, freeing up time and funds, and, in particular, allowing institutions with fewer resources to support research data publishing and ensure data remain open.

Our joint Dryad-CDL initiative will increase adoption of open data by meeting researchers where they already are. We will leverage the strengths of both organizations to offer new products and services and to build broad, sustainable, and productive approaches to data curation. We plan to move quickly to provide new value:

  • For researchers: We will launch a new, modern and easier-to-use platform. This will provide a higher level of service, and even more seamless integration into regular workflows than Dryad currently offers
  • For journals and publishers: We will offer new integration paths that will allow direct communication with manuscript processing systems, better reporting, and more comprehensive curation services
  • For academic institutions: We will work directly with institutions to craft right-sized offerings to meet your needs

We have many details to hammer out and a lot of work to do, but among our first steps will be to reach out to you — each of the groups above — to discuss your needs, wants, and preferred methods of supporting this effort. With your help, the partnership will help us grow Dryad as a globally-accessible, community-led, non-commercial, low-cost service that focus on breaking down silos between publishing, libraries, and research.

As this partnership is taking shape, we ask for community input on how our collective efforts can best meet the needs of researchers, publishers, and institutions. Please stay tuned for further announcements and information over the coming months. We hope you share our excitement as we step into Dryad’s next chapter.

Read Full Post »

Alfred P. Sloan Foundation grant will fund implementation of shared staffing model across 7 academic libraries and Dryad

We’re thrilled to announce that Dryad will participate in a three-year, multi-institutional effort to launch the Data Curation Network. The implementation — led by the University of Minnesota Libraries and backed by a $526,438 grant from the Alfred P. Sloan Foundation — builds on previous work to better support researchers faced with a growing number of requirements to openly and ethically share their research data.

The result of many months of research and planning, the project brings together eight partners:

Currently, staff at each of these institutions provide their own data curation services. But because data curation requires a specialized skill set — spanning a wide variety of data types and discipline-specific data formats — institutions cannot reasonably expect to hire an expert in each area.

Curation workflow for the DCN

The intent of the Data Curation Network is to serve as a cross-institutional staffing model that seamlessly connects a network of expert data curators to local datasets and to supplement local curation expertise. The project aims to increase local capacity, strengthen cross-institutional collaboration, and ensure that researchers and institutions ethically and appropriately share data.

Lisa R. Johnston, Principal Investigator for the DCN and Director of the Data Repository for the University of Minnesota (DRUM), explains:

Functionally, the Data Curation Network will serve as the ‘human layer’ in a local data repository stack that provides expert services, incentives for collaboration, normalized curation practices, and professional development training for an emerging data curator community.

For our part, the Dryad curation team is excited to join a collegial network of professionals, to help develop shared procedures and understandings, and to learn from the partners’ experience and expertise (as they may learn from ours).

As an independent, non-profit repository, we are especially pleased to get to work more closely with the academic library community, and hope this project can provide a launchpad for future, international collaborations among organizations with similar missions but differing structures and funding models.

Watch this space for news as the project develops, and follow the DCN on Twitter: #DataCurationNetwork

Read Full Post »

Dryad is a general purpose repository for data underlying scholarly publications. Each new submission we receive is reviewed by our curation team before the data are archived. Our main priority is to ensure compliance with Dryad’s Terms of Service, but we also strongly believe that curation activities add value to your data publication, since curated data are more likely to be FAIR (findable, accessible, interoperable, and reusable).

FAIR

Before we register a DOI, a member of our curation team will check each data package to ensure that the data files can be opened, that they appear to contain information associated with a scientific publication, and that metadata for the associated publication are technically correct. We prefer common, non-proprietary file types and thorough documentation, and we may reach out if we are unable to view files as provided.

Our curators are also on the lookout for sensitive information such as personally identifiable human subjects data or protected location information, and for files that contain copyright and license statements that are incompatible with our required CC0 waiver.

To make the data archiving process more straightforward for authors, our curation team has authored sets of guidelines that may be consulted when preparing a data submission for a public repository such as Dryad. We hope these guidelines will help you as you prepare your Dryad data package, and that they will lessen the amount of time from point of submission to registered data DOI!

A series of blog posts will highlight each of the guidelines we’ve created. First up is our best practices for sharing human subjects data in an open access repository, from former Dryad curator Rebecca Kameny.

— Erin Clary, Senior Curator – curator@datadryad.org

_______________

Preparing human subject data for open access

Collecting, cleaning, managing, and analyzing your data is one thing, but what happens when you are ready to share your data with other researchers and the public?

peopleBecause our researchers come from fields that run the gamut of academia — from biology, ecology, and medicine, to engineering, agriculture, and sociology — and because almost any field can make use of data from human subjects, we’ve provided guidance for preparing such data for open access. We based our recommendations and requirements on well-respected national and international sources from government institutions, universities, and peer-reviewed publications.

Dryad curators will review data files for compliance with these recommendations, and may make suggestions to authors, however, authors who submit data to Dryad are ultimately responsible for ensuring that their data are properly anonymized and can be shared in a public repository.

handle-43946_960_720In a nutshell, Dryad does not allow any direct identifiers, but we do allow up to three indirect identifiers. Sound simple? It’s not. If the study involves a vulnerable population (such as children or indigenous people), if the number of participants is small, or if the data are sensitive (e.g., HIV status, drug use), three indirect identifiers may be too many. We evaluate each submission on a case-by-case basis.

If you have qualitative data, you’ll want to pay close attention to open-ended text, and may need to replace names with pseudonyms or redact identifiable text.

Quick tips for preparing human subjects data for sharing

  • Ensure that there are no direct identifiers.
  • Remove any nonessential identifying details.
  • Reduce the precision of a variable – e.g., remove day and month from date of birth; use county instead of city; add or subtract a randomly chosen number.
  • Aggregate variables that are potentially revealing, such as age.
  • Restrict the upper or lower ranges of a continuous variable to hide outliers by collapsing them into a single code.
  • Combine variables by merging data from two variables into a summary variable.

It’s also good research practice to provide clear documentation of your data in a README file. Your README should define your variables and allowable values, and can be used to alert users to any changes you made to the original dataset to protect participant identity.

Our guidelines expand upon the tips above, and link to some useful references that will provide further guidance to anyone who would like to share human subjects data safely.

Read Full Post »

In 2011 Peggy Schaeffer penned an entry for this blog titled “Why does Dryad use CC0?” While 2011 seems like a long time ago, especially in our rapidly evolving digital world, the information in that piece is still as valid and relevant now as it was then. In fact, Dryad curators routinely direct authors to that blog entry to help them understand and resolve licensing issues. Since dealing with licensing matters can be confusing, it seems about time to revisit this briefly from a practical perspective.

Dryad uses Creative Commons Zero (CC0) to promote the reuse of data underlying scholarly literature. CC0 provides consistent, clear, and open terms of reuse for all data in our repository by allowing researchers, authors, and others to waive all copyright and related rights for a work and place the work in the public domain. Users know they can reuse any data available in Dryad with minimal impediments; authors gain the potential for more citations without having to spend time responding to requests from those wishing to use their data. In other words, CC0 helps eliminate the headaches associated with copyright and licensing issues for all stakeholders, leading to more data reuse.

So what does this mean in practical terms? Dryad’s curators have come up with a few suggestions to keep in mind as you prepare your data for submission. These tips can help you manage the CC0 requirements and avoid any problems:

DO:

  • Make sure any software included with your submission can be released under CC0. For example, licenses such as GPL or MIT are common and are not compatible with CC0. Be sure there are no licensing statements displayed in the software itself or in associated readme files.
  • Be aware that there are software applications out there that automatically place any output produced by the software under a non-CC0 compatible license. Consider this when you are deciding which software to use to prepare your data.
  • Know the terms of use for any information you get from a website or database.
  • Ensure that any images, videos, or other media that are not your own work can be released under CC0.
  • Be sure to clean up your data before submitting it, especially if you are compressing it using a tool such as zip or tar. Remove anything that can’t be released under CC0, along with any other extraneous materials, such as user manuals for hardware or software tools. Not only does removing extraneous files lessen the chance something will conflict with Dryad’s CC0 policy, it also makes your data more streamlined and easier to use.

DON’T:

  • Don’t add text anywhere in your data submission requiring permission or attribution for reuse. Community norms do a great job of putting in place the expectation that anyone reusing your data will provide the proper citations. CC0 actually encourages citation by keeping the process as simple as possible.
  • Don’t include your entire manuscript or parts of your manuscript in your data package. Most publications have licensing that restricts reuse and is not compatible with CC0.

I hope this post leaves you with a little more understanding about why Dryad uses CC0 and with a few tips that will help make following Dryad’s CC0 requirement easier.

 

Read Full Post »

We present a guest post from researcher Falk Lüsebrink highlighting the benefits of data sharing. Falk is currently working on his PhD in the Department of Biomedical Magnetic Resonance at the Otto-von-Guericke University in Magdeburg, Germany. Here, he talks about his experience of sharing early MRI data and the unexpected impact that it is having on the research community.

Early release of data

The first time I faced a decision about publishing my own data was while writing a grant proposal. One of our proposed objectives was to acquire ultrahigh resolution brain images in vivo, making use of an innovative development: a combination of an MR scanner with ultrahigh field strength and a motion correction setup to remediate subject motion during data acquisition. While waiting for the funding decision, I simply could not resist acquiring a first dataset. We scanned a highly experienced subject for several hours, allowing us to acquire in vivo images of the brain with a resolution far beyond anything achieved thus far.

 MRI data showing the cerebellum in vivo

MRI data showing the cerebellum in vivo at (a) neuroscientific standard resolution of 1 mm, (b) our highest achieved resolution of 250 µm, and (c) state-of-the-art 500 µm resolution.

When our colleagues saw the initial results, they encouraged us to share the data as soon as possible. Through Scientific Data and Dryad, we were able to do just that. The combination of a peer-reviewed open access journal and an open access digital repository for the data was perfect for presenting our initial results.

17,000 downloads and more

‘Sharing the wealth’ seems to have been the right decision; in the three months since we published our data, there has been an enormous amount of activity:

A distinct need for data re-use

MRI studies are highly interdisciplinary, opening up numerous opportunities for sharing and re-using data. For example, our data might be used to build MR brain atlases and illustrate brain structures in much greater detail, or even for the first time. This could advance our understanding of brain functions. Algorithms used to quantify brain structures needed in the research of neurodegenerative disorders could be enhanced, increasing accuracy and reproducibility. Furthermore, by making available raw signals measured by the MR scanner, image reconstruction methods could be used to refine image quality or reduce the time it takes to collect the data.

There are also opportunities beyond those that our particular dataset offers. A recent emerging trend in MRI comes from the field of machine learning. Neuronal networks are being built to perform and potentially improve all kinds of tasks, from image reconstruction, to image processing, and even diagnostics. To train such networks, huge amounts of data are necessary; these data could come from repositories open to the public. Such re-use of MRI data by researchers in other disciplines is having a strong impact on the advancement of science. By publicly sharing our data, we are allowing others to pursue new and exciting directions.

Download the data for yourself and see what you can do with it. In the meantime, I am still eagerly awaiting the acceptance of the grant application . . . but that’s a different story.

The data: http://dx.doi.org/10.5061/dryad.38s74

The article: http://dx.doi.org/10.1038/sdata.2017.32

— Falk Lüsebrink

Read Full Post »

We’re pleased to present a guest post from data scientist Juan M. Banda, the lead author of an important, newly-available resource for drug safety research. Here, Juan shares some of the context behind the data descriptor in Scientific Data and associated data package in Dryad. – EH

_____

As I sit in a room full of over one hundred bio-hackers at the 2016 Biohackathon in Tsuruoka, Yamagata, Japan, the need to have publicly available and accessible data for research use is acutely evident. Organized by Japan’s National Biosciences Database Center (NBDC) and Databases Center for Life Science (DBLS), this yearly hackathon gathers people from organizations and universities all over the world, including the National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI), with the purpose of extending and interlinking resources like PubChem, PhenomeCentral, Bio2RDF, and PubAnnotation.

The end goal: finding better ways to access data that will allow researchers to focus on analysis of the data rather than preparation.

In the same spirit, our publication “A curated and standardized adverse drug event resource to accelerate drug safety research” (doi:10.1038/sdata.2016.26; data in Dryad at http://doi.org/10.5061/dryad.8q0s4) helps researchers in the drug safety domain with the standardization and curation of the freely available data from the Federal Food and Drug Administration (FDA) adverse events reporting system (FAERS).

FAERS collects information on adverse events and medication errors reported to the FDA, and is comprised of over 10 million records collected between 1969 to the present. As one of the most important resources for drug safety efforts, the FAERS database has been used in at least 750 publications as reported by PubMed and was probably manipulated, mapped and cleaned independently by the vast majority of the authors of said publications. This cleaning and mapping process takes a considerable amount of time — hours that could have been spent analyzing the data further.

Our publication hopes to eliminate this needless work and allow researchers to focus their efforts in developing methods to analyze this information.

OHDSIAs part of the Observational Health Sciences Initiative (OHDSI), whose mission is to “Improve health, by empowering a community to collaboratively generate the evidence that promotes better health decisions and better care,” we decided to tackle the task of cleaning and curating the FAERS database for our community, and the wider drug safety community. By providing a general common data model (CDM) and a general vocabulary to standardize how electronic patient data is stored, OHDSI allows its participants to join a research network with over 655 million patients.

With a significant fraction of the community’s research being focused on drug safety, it was a natural decision to standardize the FAERS database with the OMOP vocabulary, to allow all researchers on our network access to FAERS. Since the OMOP vocabulary incorporates general vocabularies such as SNOMED, MeSH, and RxNORM, among others, the usability of this resource is not limited to participants of this community.

In order to curate this dataset, we took the source FAERS data in CSV format and de-duplicated case reports. We then performed value imputation for certain fields that were missing. Drug names were standardized to RxNorm ingredients and standard clinical names (for multi-ingredient drugs). This mapping is tricky because some drug names have spelling errors, and some are non-prescription drugs, or international brand names. We achieved coverage of 93% of the drug names, which in turn cover 95% of the case reports in FARES.

For the first time, the indication and reactions have been mapped to SNOMED-CT from their original MedRA format. Coverage for indications and reactions is around 64% and 80%, respectively. The OMOP vocabulary allows RxNorm drug codes as well as SNOMED-CT codes to reside in the same unified vocabulary space, simplifying use of this resource. We also provide the complete source code we developed in order to allow researchers to refresh the dataset with the new quarterly FAERS data releases and improve the mappings if needed. We encourage users to contribute the results of their efforts back to the OHDSI community.

With a firm commitment to making open data easier to use, this resource allows researchers to utilize a professionally curated (and refreshable) version of the FAERS data, enabling them to focus on improving drug safety analyses and finding more potentially harmful drugs, as a part of OHDSI’s core mission.

OHDSI_still2

Still from OHMSDI video

The data:

http://doi.org/10.5061/dryad.8q0s4

A full description of the dataset in Scientific Data:

http://www.nature.com/articles/sdata201626

 

— Juan M. Banda

Read Full Post »

Older Posts »