In May, The National Science and Technology Council (NSTC) released “a consistent set of desirable characteristics for data repositories” to help government agencies provide guidance to their research communities on selecting appropriate data repositories. These characteristics, covering a range of features from organizational sustainability to data quality assurance, will help to ensure that data resulting from federally funded research is broadly accessible, robustly curated, and preserved over the long term.
As a mission-driven data publishing platform curating tens of thousands of data sets, many originating from federally funded research, Dryad is grateful to have had the chance to submit feedback in anticipation of these recommendations and is now pleased to share the ways in which our infrastructure and services naturally align with the NSTC’s recommendations.
Organizational Infrastructure
Free and Easy Access
The repository provides broad, equitable, and maximally open access to datasets and their metadata free of charge in a timely manner after submission, consistent with legal and policy requirements related to maintaining privacy and confidentiality, Tribal and national data sovereignty, and protection of sensitive data.1
Dryad publishes research and associated metadata data exclusively under a Creative Commons Zero (CC0) License to ensure broadest possible dissemination. We make data publicly available only after it is curated by our team – ensuring that data are appropriate for sharing openly under a CC0 license, sensitive information has been removed, files are accessible and understandable for other users, and descriptive metadata are provided to facilitate downstream discovery and reuse. Dryad does not publish datasets containing identifiable human subject information; our curation process ensures that data pertaining to human subjects are properly anonymized. Our team of expert curators works to minimize the delay from submission to publication.
To further equitable access and representation, Dryad offers fee waivers for submissions originating from researchers based in countries classified by the World Bank as low-income or lower-middle-income economies. We support the CARE Principles for Indigenous Data Governance and look forward to developing a common vision for implementation with other repositories.
Clear Use Guidance
The repository ensures datasets are accompanied by documentation describing terms of dataset access and use (e.g., reuse licenses and need for approval by a data use committee).
Our curation team checks that data are appropriate for sharing under a CC0 license, and that open sharing isn’t restricted by prior agreement, such as with study participants. The use of CC0 reduces any legal and technical impediments, be they intentional and unintentional, to the reuse of data.
Risk Management
The repository has documented capabilities for ensuring that administrative, technical, and physical safeguards are employed to comply with applicable confidentiality, risk management, and continuous monitoring requirements for sensitive data.
Through our curation process, submissions including sensitive data are either adjusted by the authors to anonymise or otherwise shield details (such as location information for endangered species) or turned away. Dryad does not publish sensitive data.
Retention Policy
The repository provides documentation on policies for data retention.
Our policies to permanently preserve and archive deposited data are set out in our Terms of Service.
Long-term Organizational Sustainability
The repository has a plan for long-term management of data, including maintaining integrity, authenticity, and availability of datasets; has contingency plans to ensure data are available and maintained during and after unforeseen events.
All data published with Dryad are preserved in Merritt, a CoreTrustSeal certified repository maintained by the California Digital Library (CDL). Merritt ensures bit-level preservation and actively manages three copies of all files and digital objects in the system through use of external (cloud) storage providers distributed across two geographic regions. Dryad’s full data portfolio is also mirrored in Zenodo. CDL assures permanent preservation of data deposited with Merritt. As a core service of a well-established institution, the CDL benefits from secure permanent funding, providing reasonable expectation of its long-term sustainability. In the event of unforeseen circumstances, CDL commits to “make reasonable efforts to find another curatorial organization … willing to take on custodial responsibility for all managed content.”
Digital Object Management
Unique Persistent Identifiers
The repository assigns a dataset a citable, unique persistent identifier (PID or DPI), such as a digital object identifier (DOI), to support data discovery, reporting (e.g., of research progress), and research assessment (e.g., identifying the outputs of Federally funded research). The unique PID points to a persistent location that remains accessible even if the dataset is de-accessioned or no longer available.
Every dataset submitted to Dryad is assigned a Datacite DOI. After publication, datasets can be versioned. All versions of a dataset will be accessible, but the dataset DOI will always resolve to the newest version.
Metadata
The repository ensures datasets are accompanied by metadata to enable discovery, reuse, and citation of datasets, using schema that are appropriate to, and ideally widely used across, the communities that the repository serves.
Dryad is a generalist open data publishing platform that invites submission of any research data that doesn’t already have a home in a specialist repository. As such, our metadata schema and curation process are designed to be broad and inclusive. We support the Datacite metadata schema out-of-the-box and require that broadly applicable infrastructure PIDs such as ORCID, FundRef, and ROR are tied to every publication, and use the OECD classification to capture fields of study.
Curation and Quality Assurance
The repository provides or facilitates expert curation and quality assurance to improve the accuracy and integrity of datasets and metadata.
Dryad is the first generalist open data publishing platform to introduce curation. Our team of expert curators check every submission to ensure the validity of files and metadata. Where needed, they correspond with authors to resolve issues and enhance metadata quality. Our curation process ensures that all datasets published with Dryad can be appropriately accessed and reused.
Broad and Measured Reuse
The repository ensures datasets are accompanied by metadata that describe terms of reuse and provides the ability to measure attribution, citation, and reuse of data (e.g., through assignment of adequate and openly accessible metadata and unique PIDs).
In addition to providing DOIs for every dataset, Dryad provides suggested citations for all datasets and publishes usage metrics that conform with Make Data Count standards.
Common Format
The repository allows datasets and metadata to be accessed, downloaded, or exported from the repository in widely used, preferably non-proprietary, formats consistent with standards used in the disciplines the repository serves.
Dryad requires datasets to use open, common file formats. Our curators check that files can be opened with widely available software. Dryad uses the DataCite metadata schema and JSON metadata records for all datasets can be accessed through our API.
Provenance
The repository has mechanisms in place to record the origin, chain of custody, version control, and any other modifications to submitted datasets and metadata.
Dryad retains a full audit trail throughout the curation process, recording every action taken. Every dataset submitted to Dryad is assigned a Datacite DOI. Any edits made to a dataset after publication will create a new version of your submission using a versioned DOI. Dryad’s curation team reviews and publishes changes and makes the most recent version of the dataset available for download. Prior versions, organized by date of publication, also remain accessible and downloadable.
Technology
Authentication
The repository supports authentication of data submitters. The repository has technical capabilities that facilitate associating submitter PIDs with those assigned to their deposited digital objects, such as datasets.
Depositing authors are required to authenticate via ORCID. If depositing authors provide contact information for their co-authors, Dryad also prompts those authors to authenticate with ORCID.
Long-term Technical Sustainability
The repository has a plan for long-term management of data, building on a stable technical infrastructure and funding plans.
All data published with Dryad are preserved in Merritt, a CoreTrustSeal certified repository maintained by the California Digital Library (CDL). Merritt ensures bit-level preservation and actively manages three copies of all files and digital objects in the system through use of external (cloud) storage providers distributed across two geographic regions. All data files are stored along with a SHA-256 checksum of the file content. Regular checks of files against their checksums are made. The audit process cycles continually, with a current cycle time of approximately two months.
Security and Integrity
The repository has documented measures in place to meet well established cybersecurity criteria for preventing unauthorized access to, modification of, or release of data, with levels of security that are appropriate to the sensitivity of data (e.g., the NIST Cybersecurity Framework).
Dryad is GDPR compliant and follows best practices for privacy and security. Users who do not Submit Content to Dryad are not asked to provide any personally identifying information. Dryad implements and follows commercially reasonable electronic security measures to secure the systems through which information is collected or stored. Security protections, and all other elements of this policy, extend to data copies and backups implemented for business continuity. For site security purposes, and to ensure that this service remains available to all Users, we employ software programs to monitor traffic and to identify unauthorized attempts to upload or change information or to otherwise cause damage. In the event of authorized law enforcement investigations, and pursuant to any required legal process, information from these sources may be used to help identify an individual.
1 Italicized quotes at the beginning of each section are taken from the Desirable Characteristics of Data Repositories for Federally Funded Research.
Pingback: Day in Review (August 29–September 1) - Association of Research Libraries