What if answers to the world’s most pressing questions and challenges lie in data we already have? What could researchers achieve if they could mine vast troves of existing data for new purposes and at an unprecedented scale? Artificial intelligence (AI) technologies offer exciting possibilities that require AI-ready data to succeed.
AI-assisted scientific discovery requires AI-ready data
AI opens up promising new opportunities for scientific discovery when applied to research data. Unsupervised learning and natural language processing (NLP) can assist researchers in discovering and analyzing massive datasets that are cumbersome to work with using traditional methods. AI models can be trained to extract insights, reveal new patterns, and even generate “digitally accurate, interpretable and reproducible description[s] of natural phenomena.” Researchers can also feed data into AI models for the purposes of improving them with real-world experimental data.
However, most labs cannot, on their own, generate the volume of training data required to deploy these models with the best results. And much of the data they can aggregate from public sources lacks the machine-readable metadata, clear file structures, and robust documentation needed for effective reuse. The lack of appropriate data for machine input creates a bottleneck for researchers and limits the potential to apply AI to scientific research questions.
“Effective and trustworthy data-driven science requires the use of data at scale and a transition from … silo-based approaches … towards more networked scholarship. However, the vast majority of public domain data … is still not reusable … mainly because the data are poorly described for third-party use.” (Sansone et al., 2023)
Dryad makes data AI-ready through curation and connections
Enter open, curated research data. AI-ready data refers to data that is organized, evaluated by Dryad data curators, and prepared in a way that makes it easy for researchers to utilize it for AI modeling. Dryad provides a large corpus of this kind of well-structured, well-documented data. This data can be combined with datasets from specialist repositories and a researcher’s own data to create comprehensive datasets that fuel AI-driven research. Accessing the wide range of datasets from a “generalist” platform like Dryad, and potentially combining it from data sourced elsewhere facilitates the integration of knowledge “from various fields and knowledge systems” that can lead to “more accurate models and foster curiosity-driven research.” Dryad is also an invaluable resource for researchers who lack access to expensive equipment, distant or off-limits field sites, or face other barriers to collecting the data they need themselves.
How do we make our data AI-ready? Adhering to FAIR principles means ensuring that datasets are well-organized, properly documented, and easily accessible for machine harvesting and analysis. Dryad empowers FAIR through data curation and data connections.
Data curation by our team of reliable curators verifies that data files are accessible and usable, enhances metadata quality and completeness, and offers guidance for authors on recommended data-sharing practices. It also helps ensure that data are appropriate for sharing and do not contain personally identifiable, sensitive, or copyrighted information.
We create data connections by maximizing the use of persistent identifiers and linking to other research outputs to build robust, machine-readable linkages between data and its creators, funders, and associated outputs.
What will you do with Dryad data?
With over 60,000 datasets covering a wide range of research areas, and licensed for reuse, Dryad offers a trove of information for researchers exploring a variety of domains and methods. Information may be accessed through our website or via a convenient API. Examples include:
- DNA and RNA sequences from various organisms can be used for sequence analysis and classification tasks.
- High-throughput gene expression datasets from microarray or RNA sequencing experiments are suitable for gene expression profiling and predictive modeling.
- Datasets representing interactions between proteins are useful for network analysis and prediction of protein functions.
- Datasets containing survey responses from human subjects on various topics are suitable for predictive modeling and analysis of social trends.
- Collections of textual data, such as articles, books, or social media posts, can be used for natural language processing tasks such as sentiment analysis, topic modeling, and text classification.
- Datasets containing measurements of atmospheric variables, climate model outputs, and weather forecasts are suitable for climate modeling and prediction tasks.
- Time-series data from sensors measuring physical quantities such as temperature, pressure, and humidity can be applied to predictive maintenance, anomaly detection, and forecasting applications.
- Datasets containing information about patient demographics, treatments, and outcomes are useful for clinical prediction modeling and personalized medicine.
- Datasets comprising medical images such as MRI, CT, and histopathology images are suitable for image classification, segmentation, and diagnosis tasks.
Dryad’s straightforward submission form and thorough curation process make it easy for researchers to make their data AI-ready. In addition, researchers can take data preparation steps to make their datasets more suitable for AI applications, such as:
- explaining or filling in null values and gaps,
- removing or tagging outliers,
- building a comprehensive data dictionary or codebook,
- documenting data processing steps, and
- publishing data in a machine-readable file format.
Are you a researcher using AI to accelerate discoveries? Join our user group to help us make Dryad work better for you, or reach out to learn more about our plans to increase AI-readiness.
Funding
This work was, in part, funded by the U.S. National Institutes of Health, Office of Data Science Strategy and the Generalist Repository Ecosystem Initiative (GREI) OTA-21-00 [3OT2DB000005-01S3]. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the NIH.
Feedback and questions are always welcome, to hello@datadryad.org.
To keep in touch with the latest updates from Dryad, follow us on LinkedIn, Mastodon, and Bluesky and subscribe to our quarterly newsletter.