Feeds:
Posts
Comments

Archive for the ‘Data availability’ Category

The following is a guest post from Tom Jefferson of The Cochrane Collaboration, Peter Doshi of the University of Maryland and Carl Heneghan from the University of Oxford. We asked them to tell the story behind their recent Cochrane systematic review [1] and dataset in Dryad [2] which holds valuable lessons about the evidence-base on which major public health recommendations are decided.  -TJV1918 Influenza Poster

In the late 2000s, half the world was busy buying and stockpiling the neuraminidase inhibitors oseltamivir (Tamiflu, Roche) and zanamivir (Relenza, GSK) in fear of an influenza pandemic.

The advice to stockpile for a pandemic and also use the drugs in non-pandemic, seasonal influenza seasons came from such august bodies as the World Health Organization (WHO), the US Centers for Disease Control and Prevention (CDC) and its European counterpart, the ECDC. However, they were stockpiling on the basis of an unclear rationale, mixing the effect of the antiviral drugs on the complications of influenza (mainly pneumonia and hospitalizations) and their capacity to slow down viral spread giving time for vaccines to be crash produced and deployed.

It has since become clear that none of these parties had seen all the clinical trial evidence for these drugs. They had based their recommendations on reviews of “the literature” which sounds impressive, but in fact refers to short trial reports published in journal articles rather than the underlying detailed raw data. For example, key assumptions of antiviral performance found in the US national pandemic plan trace back to a six page long journal article written by Roche which reported on a pooled-analysis of 10 randomized trials of which only 2 have ever been published.

In contrast, each of the corresponding internal clinical study reports for these 10 trials runs thousands of pages (for background on what clinical study reports are, see here.) Despite the stockpiling, these reports have never been reviewed by CDC, ECDC, or WHO. The WHO and CDC both refused to answer our questions on the evidence base for their policies.

Our Cochrane systematic review of neuraminidase inhibitors, funded by the National Institute for Health Research in the UK, was based on analysis of the full clinical study reports for these drugs, not short journal publications. We obtained these reports from the European Medicines Agency, Roche, and GlaxoSmithKline.  It took us nearly four years to obtain the full set of reports. The story of how we got hold of the complete set of clinical trials with no access restrictions is told in our essay “Multisystem failure: the story of anti-influenza drugs”.

With the publication of our review, we are making all 107 full clinical study reports publicly available. If you disagree with our findings, if you want to carry out your own analysis or if you are just curious to see what around 150,000 pages of data look like, they are one click away. Now the discussion about how well these drugs work can happen with all parties able to independently analyze all the trial evidence. This is called open science.

Be aware that there are some minimal redactions carried out by GSK and Roche. They did this to protect investigator and participant identity. While protecting participant identity is understandable, the EMA carries a different view towards protecting investigator identity: “names of experts or designated personnel with legally defined responsibilities and roles with respect to aspects of the Marketing Authorisation dossier (e.g. QP, QPPV, Clinical expert, Investigator) are included in the dossier because they have a legally defined role or responsibility and it is in the public interest to release this data”.

References

  1. Jefferson T, Jones MA, Doshi P, Del Mar CB, Hama R, Thompson MJ, Spencer EA, Onakpoya I, Mahtani KR, Nunan D, Howick J, Heneghan CJ (2014) Neuraminidase inhibitors for preventing and treating influenza in healthy adults and children. Cochrane Database of Systematic Reviews, online in advance of print. doi:10.1002/14651858.CD008965.pub4
  2. Jefferson T, Jones MA, Doshi P, Del Mar CB, Hama R, Thompson MJ, Spencer EA, Onakpoya I, Mahtani KR, Nunan D, Howick J, Heneghan CJ (2014) Data from: Neuraminidase inhibitors for preventing and treating influenza in healthy adults and children. Dryad Digital Repository. doi:10.5061/dryad.77471

Read Full Post »

We are delighted to announce the availability of the data underlying the book “40 Years of EvGrant 40yrs of evol cover copyolution” by Peter and Rosemary Grant. In this new book, the Grants give an account of their classic, long-term study of Darwin’s finches on one of the Galápagos Islands.  From the announcement by Princeton University Press.

The authors used a vast and unparalleled range of ecological, behavioral, and genetic data–including song recordings, DNA analyses, and feeding and breeding behavior–to measure changes in finch populations on the small island of Daphne Major in the Galápagos archipelago. They find that natural selection happens repeatedly, that finches hybridize and exchange genes rarely, and that they compete for scarce food in times of drought, with the remarkable result that the finch populations today differ significantly in average beak size and shape from those of forty years ago. The authors’ most spectacular discovery is the initiation and establishment of a new lineage that now behaves as a new species, differing from others in size, song, and other characteristics. The authors emphasize the immeasurable value of continuous long-term studies of natural populations and of critical opportunities for detecting and understanding rare but significant events.

“40 Years of Evolution”, which is written a style that will be accessible to researchers, students and a more general audience, includes over 100 line drawings illustrating quantitative patterns among the many variables the authors have studied. There are 82 data files being made available in Dryad for researchers and students to explore the numbers behind those figures.  We are proud to be the custodians of this unique scientific resource.

For students and teachers interested in the Grants’ long-term studies of Darwin’s Finches, we also recommend the excellent background material and hands-on data analysis activities from the HHMI BioInteractive site.

Data citation: Grant PR, Grant BR (2013) Data from: 40 years of evolution. Darwin’s finches on Daphne Major Island. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.g6g3h

 

Read Full Post »

The Data Citation Synthesis Group has released a draft Declaration of Data Citation Principles and invites comment.

This has been a very interesting and positive collaborative process and has involved a number of groups and committed individuals. Encouraging the practice of data citation, it seems to me, is one of the key steps towards giving research data its proper place in the literature.

As the preamble to the draft principles states:

Sound, reproducible scholarship rests upon a foundation of robust, accessible data. For this to be so in practice as well as theory, data must be accorded due importance in the practice of scholarship and in the enduring scholarly record. In other words, data should be considered legitimate, citable products of research. Data citation, like the citation of other evidence and sources, is good research practice.

In support of this assertion, and to encourage good practice, we offer a set of guiding principles for data citation.

Please do comment on these principles. We hope that with community feedback and support, a finalised set of principles can be widely endorsed and adopted.

Discussion on a variety of lists is welcome, of course. However, if you want the Synthesis Group to take full account of your views, please be sure to post your comments on the discussion forum.

Some notes and observations on the background to these principles

I would like to add here some notes and observations on the genesis of these principles. As has been widely observed there have been a number of groups and interested parties involved in exploring the principles of data citation for a number of years. Mentioning only some of the sources and events that affected my own thinking on the matter, there was the 2007 Micah Altman and Gary King article, in DLib, which offered ‘A Proposed Standard for the Scholarly Citation of Quantitative Data’ and Toby Green’s OECD White Paper ‘We need publishing standards for datasets and data tables’ in 2009. Micah Altman and Mercè Crosas organised a workshop at Harvard in May 2011 on Data Citation Principles. Later the same year, the UK Digital Curation Centre published a guide to citing data in 2011.

The CODATA-ICSTI Task Group on Data Citation Standards and Practices (co-chaired by Christine Borgman, Jan Brase and Sara Callaghan) has been in existence since 2010. In collaboration with the US National CODATA Committee and the Board on Research Data and Information, a major workshop was organised in August 2011, which was reported in ‘For Attribution: Developing Data Attribution and Citation Practices and Standards’.

The CODATA-ICSTI Task Group then started work on a report covering data citation principles, eventually entitled ‘Out of Cite, Out of Mind’ – drafts were circulated for comment in April 2013 and the final report was released in September 2013.

Following the first ‘Beyond the PDF’ meeting in Jan 2011 participants produced the Force11 Manifesto ‘Improving Future Research Communication and e-Scholarship’ which places considerable weight on the availability of research data and the citation of those data in the literature. At ‘Beyond the PDF II’ in Amsterdam, March 2013, a group comprising Mercè Crosas, Todd Carpenter, David Shotton and Christine Borgman produced ‘The Amsterdam Manifesto on Data Citation Principles’. In the very same week, in Gothenburg, an RDA Birds of a Feather group was discussing the more specific problem of how to support, technologically, the reliable and efficient citation of dynamically changing or growing datasets and subsets thereof. And the broader issues of the place of data and research publication were being considered in the ICSU World Data Service Working Group on Data Publication. This group has, in turn, formed the basis for an RDA Interest Group.  Oooffff!

How great a thing is collaboration?

From June 2013, as the Force11 Group was preparing its website and activities to take forward the work on the Amsterdam Manifesto, calls came in from a number of sources for these various groups and initiatives to coordinate and collaborate. This was admirably well-received and from July the ‘Data Citation Synthesis Group’ had come into being with an agreed mission statement:

The data citation synthesis group is a cross-team committee leveraging the perspectives from the various existing initiatives working on data citation to produce a consolidated set of data citation principles (based on the Amsterdam Manifesto, the CODATA and other sets of principles provided by others) in order to encourage broad adoption of a consistent policy for data citation across disciplines and venues. The synthesis group will review existing efforts and make a set of recommendations that will be put up for endorsement by the organizations represented by this synthesis group.

The synthesis group will produce a set of principles, illustrated with working examples, and a plan for dissemination and distribution. This group will not be producing detailed specifications for implementation, nor focus on technologies or tools.

As has been noted elsewhere , the group comprised 40 individuals and brought together a large number of organisations and initiatives. What followed over the summer was a set of weekly calls to discuss and align the principles. I must say, I thought these were admirably organised and benefitted considerably from participants’ efforts to prepare documents comparing the various groups’ statements. The face-to-face meeting of the group, in which a lot of detailed discussion to finalise the draft was undertaken, was hosted (with a funding contribution from CODATA) at the US National Academies of Science between the 2nd RDA Plenary and the DataCite Summer Meeting (which CODATA also co-sponsored). It has been intellectually stimulating and a real pleasure to contribute to these discussions and to witness so many informed and engaged people bashing out these issues.

The principles developed by the Synthesis Group are now open for comment and I urge as many people, researchers, editors and publishers as possible who believe that data has a place in scholarly communications to comment on them and, in due course, to endorse them and put them into practice.

Are we finally at the cusp of real change in practice? Will we now start seeing the practice of citing data sources become more and more widespread? It’s soon to say for sure, but I hope these principles, and the work on which they build, have got us to a stage where we can start really believing the change is well underway.

Simon Hodson is Executive Director of CODATA and a member of the Dryad Board of Directors.  This post was originally published on the CODATA blog.

Read Full Post »

We are celebrating the recent publication in Dryad of the first data to accompany a book [1, 2]. Odd Couples: Extraordinary Differences Between the Sexes in the Animal Kingdom, from Princeton University Press, examines the occasionally surprising gender differences in animals, and what it means to be male or female in the animal kingdom. It is intended for both general and scientific readers.

The author, Daphne Fairbairn, a professor of biology at the University of California, Riverside, and Editor-in-Chief of Evolution, a Dryad partner journal, describes the data as:

…a survey of all recorded sexual dimorphisms in all of the animal classes that contain dioecious species (species with separate sexes).  It categorizes the prevalence of dioecy, the types of differences between the sexes (size, shape, color, etc.) and the magnitude of the differences.  I use this survey to construct frequency plots in the book, but there was no room to publish the full survey results.  This is the first time that such a survey has been done and I am hoping that it will prove useful to other biologists who might use the data for hypothesis testing.  I might even get around to this myself!

I think these archived data are one of the most significant contributions of the book to the scientific literature, even though they will not be important for non-specialist readers.

While most data in Dryad accompany journal articles, we are happy to see data archiving catching on with other types of publications such as books, thesis dissertations and conference proceedings.  Please contact us if you are interested in submitting data and have any questions about its suitability for Dryad.

[1] Fairbairn DJ (2013) Data from: Odd couples: extraordinary differences between the sexes in the animal kingdom. Dryad Digital Repository. doi:10.5061/dryad.n48cm

[2] Fairbairn DJ (2013) Odd Couples: Extraordinary Differences Between the Sexes in the Animal Kingdom, Princeton University Press, ISBN:9780691141961.

Read Full Post »

seed-1

Dryad is a nonprofit organization fully committed to making scientific and medical research data permanently available to all researchers and educators free-of-charge without barriers to reuse.  For the past four years, we have engaged experts and consulted with our many stakeholders in order to develop a sustainability plan that will ensure Dryad’s content remains free to users indefinitely.  The resulting plan allows Dryad to recoup its operating costs in a way that recovers revenues fairly and in a scalable manner.  The plan includes revenue from submission fees, membership dues, grants and contributions.

A one-time submission fee will offset the actual costs of preserving data in Dryad.  The majority of costs are incurred at the time of submission when curators process new files, and long-term storage costs scale with each submission, so this transparent one-time charge ensures that resources scale with demand.  Dryad offers a variety of pricing plans for journals and other organizations such societies, funders and libraries to purchase discounted submission fees on behalf of their researchers.  For data packages not covered by a pricing plan, the researcher pays upon submission.  Waivers are provided to researchers from developing economies.  See Pricing Plans for a complete list of fees and payment options.  Submission fees will apply to all new submissions starting September 2013.

Membership dues will supplement submission fees, allowing Dryad to maintain its strong ties to the research community through its volunteer Board of Directors, Annual Membership Meetings, and  other outreach activities to researchers, educators and stakeholder organizations.  See Membership Information.

Grants will fund research, development and innovation.

Donations will support all of the above efforts.  In addition, Dryad will occasionally appeal to donors to fund special projects or specific needs, such as preservation of valuable legacy datasets and deposit waivers for researchers from developing economies.

We are grateful for all the input we have received into our sustainability plan, and look forward to your continued support in carrying out our nonprofit mission for many long years to come.

Read Full Post »

heatherMarch2013A study providing new insights into the citation boost from open data has been released in preprint form on PeerJ by Dryad researchers Heather Piwowar and Todd Vision. The researchers looked at thousands of papers reporting new microarray data and thousands of cited instances of data reuse. They found that the citation boost, while more modest than seen in earlier studies (overall, ~9%), was robust to confounding factors, distributed across many archived datasets, continued to grow for at least five years after publication, and was driven to a large extent by actual instances of data reuse. Furthermore, they found that the intensity of dataset reuse has been rising steadily since 2003.

Heather, a post-doc based in Vancouver, may be known to readers of this blog for her earlier work on data sharing, her blog, her role as cofounder of ImpactStory, or her work to promote access to the literature for text mining. Recently Tim Vines, managing editor of Molecular Ecology and a past member of Dryad’s Consortium Board, managed to pull Heather briefly away from her many projects to ask her about her background and latest passions:

TV: Your research focus over the last five years has been on data archiving and science publishing- how did your interest in this field develop?

HP: I wanted to reuse data.  My background is electrical engineering and digital signal processing: I worked for tech companies for 10 years. The most recent was a biotech developing predictive chemotherapy assays. Working there whetted my appetite for doing research, so I went back to school for my PhD to study personalized cancer therapy.

My plan was to use data that had already been collected, because I’d seen first-hand the time and expense that goes into collecting clinical trials data.  Before I began, though, I wanted to know if the stuff in NCBI’s databases was good quality, because highly selective journals like Nature often require data archiving, or was it instead mostly the dregs of research because that was all investigators were willing to part with.  I soon realized that no one knew… and that it was important, and we should find out.  Studying data archiving and reuse became my new PhD topic, and my research passion.

My first paper was rejected from a High Profile journal.  Next I submitted it to PLOS Biology. It was rejected from there too, but they mentioned they were starting this new thing called PLOS ONE.  I read up (it hadn’t published anything yet) and I liked the idea of reviewing only for scientific correctness.

I’ve become more and more of an advocate for all kinds of open science as I’ve run into barriers that prevented me from doing my best research.  The barriers kept surprising me. Really, other fields don’t have a PubMed? Really, there is no way to do text mining across all scientific literature?  Seriously, there is no way to query that citation data by DOI, or export it other than page by page in your webapp, and you won’t sell subscriptions to individuals?  For real, you won’t let me cite a URL?  In this day and age, you don’t value datasets as contributions in tenure decisions?  I’m working for change.

TV: You’ve been involved with a few of the key papers relating data archiving to subsequent citation rate. Could you give us a quick summary of what you’ve found?

HP: Our 2007 PLOS ONE paper was a small analysis related to one specific data type: human cancer gene expression microarray data.  About half of the 85 publications in my sample had made their data publicly available.  The papers with publicly available data received about 70% more citations than similar studies without available data.

I later discovered there had been an earlier study in the field of International Studies — it has the awesome title “Posting your data: will you be scooped or will you be famous?”  There have since been quite a few additional studies of this question, the vast majority finding a citation benefit for data archiving.  Have a look at (and contribute to!) this public Mendeley group initiated by Joss Winn.

There was a significant limitation to these early studies: they didn’t control for several of important confounders of citation rate (number of authors, of example).  Thanks to Angus Whyte at the Digital Curation Centre (DCC) for conversations on this topic.  Todd Vision and I have been working on a larger study of data citation and data reuse to address this, and understand deeper patterns of data reuse.  Our conclusions:

After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported.  We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data.  Other factors that may also contribute to the citation boost are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

TV: Awareness of data archiving and its importance for the progress of science has increased massively over the past five years, but very few organizations have actually introduced mandatory archiving policies. What do you see as the remaining obstacles?

HP: Great question. I don’t know. Someone should do a study!  Several journals have told me it is simply not a high priority for them: it takes time to write and decide on a policy, and they don’t have time.  Perhaps wider awareness of the Joint Data Archiving Policy will help.

Some journals are afraid authors will choose a competitor journal if they impose additional requirements. I’m conducting a study to monitor the attitudes, experiences, and practices of authors in journals that have adopted JDAP policy and similar authors who publish elsewhere.  The study will run for 3 years, so although I have more than 2500 responses there is still another whole year of data collection to go.  Stay tuned :)

Keep an eye on Journal Research Data Policy Bank (JoRD) to stay current on journal policies for data archiving.

Funders, though.  Why aren’t more funders introducing mandatory public data archiving policies (with appropriate exceptions)?  I don’t know.  They should.  Several are taking steps towards it, but golly it is slow.  Is anyone thinking of the opportunity cost of moving this slowly?  More specific thoughts in my National Science Foundation RFI response with coauthor Todd Vision.

TV: You’re a big advocate of ‘open notebook’ science. How did you first get interested in working in this way?

HP: I was a grad student, hungry for information.  I wanted to know if everyone’s science looked like my science.  Was it messy in the same ways?  What processes did they have that I could learn from?  What were they are excited about *now* — findings and ideas that wouldn’t hit journal pages for months or years?

This was the same time that Jean-Claude Bradley was starting to talk about open notebook science in his chemistry lab.  I was part of the blogosphere conversations, and had a fun ISMB 2007 going around to all the publisher booths asking about their policies on publishing results that had previously appeared on blogs and wikis (my blog posts from the time; for a current resource see the list of journal responses maintained by F1000 Posters).

TV: It’s clearly a good way to work for people whose work is mainly analysis of data, but how can the open notebook approach be adapted to researchers who work at the bench or in the field?

HP: Jean-Claude Bradley has shown it can work well very in a chemistry lab.  I haven’t worked in the field, so I don’t want to presume to know what is possible or easy: guessing in many cases it wouldn’t be easy.  That said, more often than not, where there is a will there is a way!

TV: Given the growing concerns over the validity of the results in scientific papers, do you think that external supervision of scientists (i.e. mandated open notebook science) would ever become a reality?

HP: I’m not sure.  Such a policy may well have disadvantages that outweigh its advantages.  It does sound like a good opportunity to do some research, doesn’t it?  A few grant programs could have a precondition that the awardees be randomized to different reporting requirements, then we monitor and see what happens. Granting agencies ought to be doing A LOT MORE EXPERIMENTING to learn the implications of their policies, followed by quick and open dissemination of the results of the experiments, and refinements in policies to reflect this growing evidence-base.

TV: You’re involved in a lot of initiatives at the moment. Which ones are most exciting for you? 

HP: ImpactStory.  The previous generation of tools for discovering the impact of research are simply not good enough.  We need ways to discover citations to datasets, in citation lists and elsewhere.  Ways to find blog posts written about research papers — and whether those blog posts, in turn, inspire conversation and new thinking.  We need ways to find out which research is being bookmarked, read, and thought about even if that background learning doesn’t lead to citations.  Research impact isn’t the one dimensional winners-and-losers situation we have now with our single-minded reliance on citation counts: it is multi-dimensional — research has an impact flavour, not an impact number.

Metrics data locked behind subscription paywalls might have made sense years ago, when gathering citation data required a team of people typing in citation lists.  That isn’t the world we live in any more: keeping our evaluation and discovery metrics locked behind subscription paywalls is simply neither necessary nor acceptable.  Tools need to be open, provide provenance and context, and support a broad range of research products.

We’re realizing this future through ImpactStory: a nonprofit organization dedicated to telling the story of our research impact.  Researchers can build a CV that includes citations and altmetrics for their papers, datasets, software, and slides: embedding altmetrics on a CV is a powerful agent of change for scholars and scholarship.  ImpactStory is co-founded by me and Jason Priem, funded by the Alfred P. Sloan Foundation while we become self-sustaining, and is committed to building a future that is good for scholarship.  Check it out! and contact if you want to learn more: team@impactstory.org

Thanks for the great questions, Tim!

Read Full Post »

PubMed and GenBank, from the National Center for Biotechnology Information (NCBI), are hugely popular resources for searching and retrieving article abstracts and nucleotide sequence data, respectively.  PubMed indexes the vast majority of the biomedical literature, and deposition of nucleotide sequences in GenBank or one of the other INSDC databases is a near universal requirement for publication in a scientific journal.

Thanks to NCBI’s “LinkOut” feature, it is now easy to find associated data in Dryad from either PubMed or GenBank. For example, this Dryad data package is linked from:ncbi._linkout_tjv2

  • the article’s abstract in PubMed. “LinkOut” is at the bottom of the page;  expand “+” to see the links to Dryad and other resources.
  • nucleotide data associated with the same publication in GenBank. “LinkOut” is in the right hand navigation bar

LinkOut allows the data from an article to be distributed among repositories without compromising its discoverability.

At Dryad, we intend to expand on this feature in a couple of ways. First, we plan to make Dryad content searchable via the PubMed and GenBank identifiers, which because of their wide use will provide a convenient gateway for other biomedical databases to link out to Dryad.  Second, we will be using open web standards to expose relationships between content in Dryad and other repositories, not just NCBI.  For example, keen eyes may have noted the relationship of the Dryad data package in the example above to two records in TreeBASE.

To learn more about how Dryad implements NCBI’s LinkOut feature, please see our wiki.

Read Full Post »

Older Posts »

Follow

Get every new post delivered to your Inbox.

Join 7,413 other followers