Feeds:
Posts
Comments

Archive for the ‘Data availability’ Category

heatherMarch2013A study providing new insights into the citation boost from open data has been released in preprint form on PeerJ by Dryad researchers Heather Piwowar and Todd Vision. The researchers looked at thousands of papers reporting new microarray data and thousands of cited instances of data reuse. They found that the citation boost, while more modest than seen in earlier studies (overall, ~9%), was robust to confounding factors, distributed across many archived datasets, continued to grow for at least five years after publication, and was driven to a large extent by actual instances of data reuse. Furthermore, they found that the intensity of dataset reuse has been rising steadily since 2003.

Heather, a post-doc based in Vancouver, may be known to readers of this blog for her earlier work on data sharing, her blog, her role as cofounder of ImpactStory, or her work to promote access to the literature for text mining. Recently Tim Vines, managing editor of Molecular Ecology and a past member of Dryad’s Consortium Board, managed to pull Heather briefly away from her many projects to ask her about her background and latest passions:

TV: Your research focus over the last five years has been on data archiving and science publishing- how did your interest in this field develop?

HP: I wanted to reuse data.  My background is electrical engineering and digital signal processing: I worked for tech companies for 10 years. The most recent was a biotech developing predictive chemotherapy assays. Working there whetted my appetite for doing research, so I went back to school for my PhD to study personalized cancer therapy.

My plan was to use data that had already been collected, because I’d seen first-hand the time and expense that goes into collecting clinical trials data.  Before I began, though, I wanted to know if the stuff in NCBI’s databases was good quality, because highly selective journals like Nature often require data archiving, or was it instead mostly the dregs of research because that was all investigators were willing to part with.  I soon realized that no one knew… and that it was important, and we should find out.  Studying data archiving and reuse became my new PhD topic, and my research passion.

My first paper was rejected from a High Profile journal.  Next I submitted it to PLOS Biology. It was rejected from there too, but they mentioned they were starting this new thing called PLOS ONE.  I read up (it hadn’t published anything yet) and I liked the idea of reviewing only for scientific correctness.

I’ve become more and more of an advocate for all kinds of open science as I’ve run into barriers that prevented me from doing my best research.  The barriers kept surprising me. Really, other fields don’t have a PubMed? Really, there is no way to do text mining across all scientific literature?  Seriously, there is no way to query that citation data by DOI, or export it other than page by page in your webapp, and you won’t sell subscriptions to individuals?  For real, you won’t let me cite a URL?  In this day and age, you don’t value datasets as contributions in tenure decisions?  I’m working for change.

TV: You’ve been involved with a few of the key papers relating data archiving to subsequent citation rate. Could you give us a quick summary of what you’ve found?

HP: Our 2007 PLOS ONE paper was a small analysis related to one specific data type: human cancer gene expression microarray data.  About half of the 85 publications in my sample had made their data publicly available.  The papers with publicly available data received about 70% more citations than similar studies without available data.

I later discovered there had been an earlier study in the field of International Studies — it has the awesome title “Posting your data: will you be scooped or will you be famous?”  There have since been quite a few additional studies of this question, the vast majority finding a citation benefit for data archiving.  Have a look at (and contribute to!) this public Mendeley group initiated by Joss Winn.

There was a significant limitation to these early studies: they didn’t control for several of important confounders of citation rate (number of authors, of example).  Thanks to Angus Whyte at the Digital Curation Centre (DCC) for conversations on this topic.  Todd Vision and I have been working on a larger study of data citation and data reuse to address this, and understand deeper patterns of data reuse.  Our conclusions:

After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported.  We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data.  Other factors that may also contribute to the citation boost are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

TV: Awareness of data archiving and its importance for the progress of science has increased massively over the past five years, but very few organizations have actually introduced mandatory archiving policies. What do you see as the remaining obstacles?

HP: Great question. I don’t know. Someone should do a study!  Several journals have told me it is simply not a high priority for them: it takes time to write and decide on a policy, and they don’t have time.  Perhaps wider awareness of the Joint Data Archiving Policy will help.

Some journals are afraid authors will choose a competitor journal if they impose additional requirements. I’m conducting a study to monitor the attitudes, experiences, and practices of authors in journals that have adopted JDAP policy and similar authors who publish elsewhere.  The study will run for 3 years, so although I have more than 2500 responses there is still another whole year of data collection to go.  Stay tuned :)

Keep an eye on Journal Research Data Policy Bank (JoRD) to stay current on journal policies for data archiving.

Funders, though.  Why aren’t more funders introducing mandatory public data archiving policies (with appropriate exceptions)?  I don’t know.  They should.  Several are taking steps towards it, but golly it is slow.  Is anyone thinking of the opportunity cost of moving this slowly?  More specific thoughts in my National Science Foundation RFI response with coauthor Todd Vision.

TV: You’re a big advocate of ‘open notebook’ science. How did you first get interested in working in this way?

HP: I was a grad student, hungry for information.  I wanted to know if everyone’s science looked like my science.  Was it messy in the same ways?  What processes did they have that I could learn from?  What were they are excited about *now* — findings and ideas that wouldn’t hit journal pages for months or years?

This was the same time that Jean-Claude Bradley was starting to talk about open notebook science in his chemistry lab.  I was part of the blogosphere conversations, and had a fun ISMB 2007 going around to all the publisher booths asking about their policies on publishing results that had previously appeared on blogs and wikis (my blog posts from the time; for a current resource see the list of journal responses maintained by F1000 Posters).

TV: It’s clearly a good way to work for people whose work is mainly analysis of data, but how can the open notebook approach be adapted to researchers who work at the bench or in the field?

HP: Jean-Claude Bradley has shown it can work well very in a chemistry lab.  I haven’t worked in the field, so I don’t want to presume to know what is possible or easy: guessing in many cases it wouldn’t be easy.  That said, more often than not, where there is a will there is a way!

TV: Given the growing concerns over the validity of the results in scientific papers, do you think that external supervision of scientists (i.e. mandated open notebook science) would ever become a reality?

HP: I’m not sure.  Such a policy may well have disadvantages that outweigh its advantages.  It does sound like a good opportunity to do some research, doesn’t it?  A few grant programs could have a precondition that the awardees be randomized to different reporting requirements, then we monitor and see what happens. Granting agencies ought to be doing A LOT MORE EXPERIMENTING to learn the implications of their policies, followed by quick and open dissemination of the results of the experiments, and refinements in policies to reflect this growing evidence-base.

TV: You’re involved in a lot of initiatives at the moment. Which ones are most exciting for you? 

HP: ImpactStory.  The previous generation of tools for discovering the impact of research are simply not good enough.  We need ways to discover citations to datasets, in citation lists and elsewhere.  Ways to find blog posts written about research papers — and whether those blog posts, in turn, inspire conversation and new thinking.  We need ways to find out which research is being bookmarked, read, and thought about even if that background learning doesn’t lead to citations.  Research impact isn’t the one dimensional winners-and-losers situation we have now with our single-minded reliance on citation counts: it is multi-dimensional — research has an impact flavour, not an impact number.

Metrics data locked behind subscription paywalls might have made sense years ago, when gathering citation data required a team of people typing in citation lists.  That isn’t the world we live in any more: keeping our evaluation and discovery metrics locked behind subscription paywalls is simply neither necessary nor acceptable.  Tools need to be open, provide provenance and context, and support a broad range of research products.

We’re realizing this future through ImpactStory: a nonprofit organization dedicated to telling the story of our research impact.  Researchers can build a CV that includes citations and altmetrics for their papers, datasets, software, and slides: embedding altmetrics on a CV is a powerful agent of change for scholars and scholarship.  ImpactStory is co-founded by me and Jason Priem, funded by the Alfred P. Sloan Foundation while we become self-sustaining, and is committed to building a future that is good for scholarship.  Check it out! and contact if you want to learn more: team@impactstory.org

Thanks for the great questions, Tim!

Read Full Post »

PubMed and GenBank, from the National Center for Biotechnology Information (NCBI), are hugely popular resources for searching and retrieving article abstracts and nucleotide sequence data, respectively.  PubMed indexes the vast majority of the biomedical literature, and deposition of nucleotide sequences in GenBank or one of the other INSDC databases is a near universal requirement for publication in a scientific journal.

Thanks to NCBI’s “LinkOut” feature, it is now easy to find associated data in Dryad from either PubMed or GenBank. For example, this Dryad data package is linked from:ncbi._linkout_tjv2

  • the article’s abstract in PubMed. “LinkOut” is at the bottom of the page;  expand “+” to see the links to Dryad and other resources.
  • nucleotide data associated with the same publication in GenBank. “LinkOut” is in the right hand navigation bar

LinkOut allows the data from an article to be distributed among repositories without compromising its discoverability.

At Dryad, we intend to expand on this feature in a couple of ways. First, we plan to make Dryad content searchable via the PubMed and GenBank identifiers, which because of their wide use will provide a convenient gateway for other biomedical databases to link out to Dryad.  Second, we will be using open web standards to expose relationships between content in Dryad and other repositories, not just NCBI.  For example, keen eyes may have noted the relationship of the Dryad data package in the example above to two records in TreeBASE.

To learn more about how Dryad implements NCBI’s LinkOut feature, please see our wiki.

Read Full Post »

The following guest post is from Tim Vines, Managing Editor of Molecular Ecology and Molecular Ecology Resources.  ME and MER have among the most effective data archiving policies of any Dryad partner journal, as measured by the availability of data for reuse [1].  In this post, which may be useful to other journals figuring out how to support data archiving, Tim explains how Molecular Ecology’s approach has been refined over time.

newsman

Ask almost anyone in the research community, and they’ll say that archiving the data associated with a paper at publication is really important. Making sure it actually happens is not quite so simple. One of the main obstacles is that it’s hard to decide which data from a study should be made public, and this is mainly because consistent data archiving standards have not yet been developed.

It’s impossible for anyone to write exhaustive journal policies laying out exactly what each kind of study should archive (I’ve tried), so the challenge is to identify for each paper which data should be made available.

Before I describe how we currently deal with this issue, I should give some history of data archiving at Molecular Ecology. In early 2010 we joined with the five other big evolution journals in adopting the ‘Joint Data Archiving Policy’, which mandates that “authors make all the data required to recreate the results in their paper available on a public archive”. This policy came into force in January 2011, and since all five journals brought it in at the same time it meant that no one journal suffered the effects of bringing in a (potentially) unpopular policy.

To help us see whether authors really had archived all the required datasets, we started requiring that authors include ‘Data Accessibility’ (DA) section in the final version of their manuscript. This DA section lists where each dataset is stored, and normally appears after the references.  For example:

Data Accessibility:

  • DNA sequences: Genbank accessions F234391-F234402
  • Final DNA sequence assembly uploaded as online supplemental material
  • Climate data and MaxEnt input files: Dryad doi:10.5521/dryad.12311
  • Sampling locations, morphological data and microsatellite genotypes: Dryad doi:10.5521/dryad.12311

We began back in 2011 by including a few paragraphs about our data archiving policies in positive decision letters (i.e. ‘accept, minor revisions’ and ‘accept’), which asked for a DA section to be added to the manuscript during their final revisions. I would also add a sticky note to the ScholarOne Manuscripts entry for the paper indicating which datasets I thought should be listed. Most authors added the DA, but generally only included some of the data. I then switched to putting my list into the decision letter itself, just above the policy itself. For example:

“Please don’t forget to add the Data Accessibility section- it looks like this needs a file giving sampling details, morphology and microsatellite genotypes for all adults and offspring. Please also consider providing the input files for your analyses.”

This was much more effective than expecting the authors to work out which data we wanted. However, it still meant that I was combing through the abstract and the methods trying to work out what data had been generated in that manuscript.

We use ScholarOne Manuscripts’ First Look system for handling accepted papers, and we don’t export anything to be typeset until we’re satisfied with the DA section. Being strict about this makes most authors deal with our DA requirements quickly (they don’t want their paper delayed), but a few take longer while we help authors work out what we want.

The downside of this whole approach is that it takes me quite a lot of effort to work out what should appear in the DA section, and would be impossible in a journal where an academic does not see the final version of the paper. A more robust long-term strategy has to involve the researcher community in identifying which data should be archived.

I’ll flesh out the steps below, but simply put our new approach is to ask authors to include a draft Data Accessibility section at initial submission. This draft DA section should list each dataset and say where the authors expect to archive it. As long as the DA section is there (even if it’s empty) we send the paper on to an editor. If it makes it to reviewers, we ask them to check the DA section and point out what datasets are missing.

A paper close to acceptance can thus contain a complete or nearly complete DA section. Furthermore, any deficiencies should have been pointed out in review and corrected in revision. The editorial office now has the much easier task of checking over the final DA section and making sure that all the accession numbers etc. are added before the article is exported to be typeset.

The immediate benefit is that authors are encouraged to think about data archiving while they’re still writing the paper – it’s thus much more an integral part of manuscript preparation than an afterthought. We’ve also found that a growing proportion of papers (currently about 20%) are being submitted with a completed DA section that requires no further action on our part. I expect that this proportion will be more like 80% in two years, as this seems to be how long it takes to effect changes in author or reviewer behavior.

Since the fine grain of the details may be of interest, I’ve broken down the individual steps below:

1) The authors submit their paper with a draft ‘Data Accessibility’ (DA) statement in the manuscript; this lists where the authors plan to archive each of their datasets. We’ve included a required checkbox in the submission phase that states ‘A draft Data Accessibility statement is present in the manuscript’.

2) Research papers submitted without a DA section are held in the editorial office checklist and the authors contacted to request one. In the first few months of using this system we have found that c. 40% of submissions don’t have the statement initially, but after we request it the DA is almost always emailed within 3-4 days. If we don’t hear for five working days we unsubmit the paper; this has happened to about only 5% of papers.

3) If the paper makes it out to review, the reviewers are asked to check whether all the necessary datasets are listed, and if not, request additions in the main body of their review. Specifically, our ‘additional questions’ section of the review tab in S1M now contains the question: “Does the Data Accessibility section list all the datasets needed to recreate the results in the manuscript? If ‘No’, please specify which additional data are needed in your comments to the authors.”  Reviewers can choose ‘yes’, ‘no’ or ‘I didn’t check’; the latter is important because reviewers who haven’t looked at the DA section aren’t forced to arbitrarily click ‘yes’ or ‘no’.

4) The decision letter is sent to the authors with the question from (3) included. Since we’re still in the early days of this system and less than a quarter of our reviewers understand how to evaluate the DA section, I am still checking the data myself and requesting that any missing datasets be included in the revision. This is much easier than before as there is a draft DA section to work with and sometimes some feedback from the reviewers.

5) The editorial office then makes sure that any deficiencies identified by myself or the reviewers are dealt with by the time the paper goes to be typeset; this is normally dealt with at the First Look stage.

I’d be very happy to help anyone that would like to know more about this system or its implementation – please contact me at managing.editor@molecol.com

[1] Vines TH, Andrew RL, Bock DG, Franklin MT, Gilbert KJ, Kane NC, Moore JS, Moyers BT, Renaut S, Rennison DJ, Veen T, Yeaman S. Mandated data archiving greatly improves access to research data. FASEB J. 2013 Jan 8. Epub ahead of print.  Update: Also available from arXiv.

Read Full Post »

If you have data packages in Dryad, consider adding a button like this next to each one on the publication list of your website or your electronic CV.

You can make a link between the button and the individual data package page on Dryad to enrich your publication list and make it easy to find your data.

Props to our early adopters below.  Check out their pages for some examples.

For other ways to show your support, please visit our page of publicity material on the Dryad wiki.  Let us know if you come up with creative ways to promote your data in Dryad. And additional suggestions are always welcome at help@datadryad.org.

Have at it!

Read Full Post »

Dryad is delighted to join with PLOS today to announce our partnership with PLOS Biologyas described here on the official PLOS Biology blog, Biologue.  As the first Public Library of Science (PLOS) journal to partner with Dryad to integrate manuscript submission, “PLOS Biology can offer authors a seamless tying together of an article with its underlying data; [and] can also provide confidential access for editors and reviewers to data associated with articles under review.”
PLoS Biology - www.plosbiology.org

Here’s how it works: During manuscript evaluation, PLOS Biology invites authors to deposit the underlying data files in Dryad, sending them a link to Dryad which enables a streamlined upload process (no need to enter the article details).  Authors may deposit complex and varied data types in multiple formats, and these files are then accessible to editors and reviewers by anonymous and secure access during the manuscript review process.  Behind the scenes, the journal’s editorial system and the Dryad repository exchange metadata, ensuring that upon publication, the article links to the associated data in Dryad, and permanently connecting the published article with its securely archived, publicly available data.

Dr. Theodora Bloom, Chief Editor, PLOS Biology, mentions that journals “are uniquely well-placed to help researchers ensure that all data underlying a study are made available alongside any published articles.”

We welcome PLOS Biology authors and editors to Dryad, and look forward to extending this partnership to other PLOS journals.

Read Full Post »

A number of enhancements to the repository have been made in recent months, including these three that were in high demand from users:

  • First, we have modified our submission process to enable the data to be deposited prior to editorial review of the manuscript. Journals that integrate manuscript and data submission at the review stage can now offer their editors and peer reviewers anonymous access to the data in Dryad while the manuscript is in review. This option is currently being used by several of our partner journals, BMJ Open, Molecular Ecology, and Systematic Biology, and is available to any existing or future integrated journal. Note: authors still begin their data deposit process at the journal.
  • Second, when authors submit data associated with previously published articles, they can pull up the article information using the article DOI or its PubMed ID, greatly simplifying the deposition process for legacy data.
  • Third, Dryad now supports versioning of datafiles. Authors can upload new versions of their files to correct or update the original file. Once logged in to their Dryad account, the My Submissions option appears under My Account in the left side-menu. Prior unfinished and completed submissions are listed; selecting an archived submission allows the author to add a new file.  Note that the earlier versions of the file will still be available to users, but the metadata may be modified to reflect the reason for the update. The DOIs will be appended with a number (e.g., “.1”, “.2”) so that each version can be uniquely referenced.  By default, users will be shown the most current version of each datafile.  They will be notified of the existence of any previous/subsequent versions.
  • Access and download statistics have been displayed for content in the repository since late 2010; Dryad now displays the statistics for an article’s data together on one page so you can see at a glance how many times the page has been viewed and how many times each component data file has been downloaded. Check out this example from Evolutionary Applications.

Read Full Post »

Christopher Pirrone excavating an odontocete skull (photo by Robert Boessenecker)

Perhaps it’s understandable that paleontologists are committed to preserving the scientific record, since they spend a lot of time and energy finding and extracting shreds of evidence millions of years old.  Now, thanks to a partnership between Dryad and The Paleontological Society announced last year [1], coupled with strong data archiving policies adopted by two of its journals (Paleobiology and the Journal of Paleontology), a rich trove of data will be available for future researchers to unearth from Dryad.

For both journals, authors are being instructed to deposit the underlying data at the time their manuscript is submitted, so that editors and referees will be able to review it prior to acceptance.  Once published on Dryad, the data will be independently discoverable and citable, while at the same time prominently linked both to and from the original article.  Researchers are able to track the reuse impact of their data, independent of the citation impact of their article, by monitoring downloads from Dryad.

Preserved for ages.

Smilodon, by Charles Knight (1905), from a mural at the American Museum of Natural History.

Here’s an example from a recent issue of Paleobiology to sink your teeth into:

Article: Meachen-Samuels JA (2012) Morphological convergence of the prey-killing arsenal of sabertooth predators. Paleobiology 38(1): 1-14. doi:10.1666/10036.1

Data: Meachen-Samuels JA (2012) Data from: Morphological convergence of the prey-killing arsenal of sabertooth predators. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.h58q6

References:

[1]  Callaway E (2011) Fossil data enter the web period. Nature 472, 150. http://dx.doi.org/10.1038/472150a

Read Full Post »

« Newer Posts - Older Posts »

Follow

Get every new post delivered to your Inbox.

Join 8,440 other followers