An interview with Heather Piwowar: on data archiving, open notebook science, and discovering your impact flavor

heatherMarch2013A study providing new insights into the citation boost from open data has been released in preprint form on PeerJ by Dryad researchers Heather Piwowar and Todd Vision. The researchers looked at thousands of papers reporting new microarray data and thousands of cited instances of data reuse. They found that the citation boost, while more modest than seen in earlier studies (overall, ~9%), was robust to confounding factors, distributed across many archived datasets, continued to grow for at least five years after publication, and was driven to a large extent by actual instances of data reuse. Furthermore, they found that the intensity of dataset reuse has been rising steadily since 2003.

Heather, a post-doc based in Vancouver, may be known to readers of this blog for her earlier work on data sharing, her blog, her role as cofounder of ImpactStory, or her work to promote access to the literature for text mining. Recently Tim Vines, managing editor of Molecular Ecology and a past member of Dryad’s Consortium Board, managed to pull Heather briefly away from her many projects to ask her about her background and latest passions:

TV: Your research focus over the last five years has been on data archiving and science publishing- how did your interest in this field develop?

HP: I wanted to reuse data.  My background is electrical engineering and digital signal processing: I worked for tech companies for 10 years. The most recent was a biotech developing predictive chemotherapy assays. Working there whetted my appetite for doing research, so I went back to school for my PhD to study personalized cancer therapy.

My plan was to use data that had already been collected, because I’d seen first-hand the time and expense that goes into collecting clinical trials data.  Before I began, though, I wanted to know if the stuff in NCBI’s databases was good quality, because highly selective journals like Nature often require data archiving, or was it instead mostly the dregs of research because that was all investigators were willing to part with.  I soon realized that no one knew… and that it was important, and we should find out.  Studying data archiving and reuse became my new PhD topic, and my research passion.

My first paper was rejected from a High Profile journal.  Next I submitted it to PLOS Biology. It was rejected from there too, but they mentioned they were starting this new thing called PLOS ONE.  I read up (it hadn’t published anything yet) and I liked the idea of reviewing only for scientific correctness.

I’ve become more and more of an advocate for all kinds of open science as I’ve run into barriers that prevented me from doing my best research.  The barriers kept surprising me. Really, other fields don’t have a PubMed? Really, there is no way to do text mining across all scientific literature?  Seriously, there is no way to query that citation data by DOI, or export it other than page by page in your webapp, and you won’t sell subscriptions to individuals?  For real, you won’t let me cite a URL?  In this day and age, you don’t value datasets as contributions in tenure decisions?  I’m working for change.

TV: You’ve been involved with a few of the key papers relating data archiving to subsequent citation rate. Could you give us a quick summary of what you’ve found?

HP: Our 2007 PLOS ONE paper was a small analysis related to one specific data type: human cancer gene expression microarray data.  About half of the 85 publications in my sample had made their data publicly available.  The papers with publicly available data received about 70% more citations than similar studies without available data.

I later discovered there had been an earlier study in the field of International Studies — it has the awesome title “Posting your data: will you be scooped or will you be famous?”  There have since been quite a few additional studies of this question, the vast majority finding a citation benefit for data archiving.  Have a look at (and contribute to!) this public Mendeley group initiated by Joss Winn.

There was a significant limitation to these early studies: they didn’t control for several of important confounders of citation rate (number of authors, of example).  Thanks to Angus Whyte at the Digital Curation Centre (DCC) for conversations on this topic.  Todd Vision and I have been working on a larger study of data citation and data reuse to address this, and understand deeper patterns of data reuse.  Our conclusions:

After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported.  We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data.  Other factors that may also contribute to the citation boost are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

TV: Awareness of data archiving and its importance for the progress of science has increased massively over the past five years, but very few organizations have actually introduced mandatory archiving policies. What do you see as the remaining obstacles?

HP: Great question. I don’t know. Someone should do a study!  Several journals have told me it is simply not a high priority for them: it takes time to write and decide on a policy, and they don’t have time.  Perhaps wider awareness of the Joint Data Archiving Policy will help.

Some journals are afraid authors will choose a competitor journal if they impose additional requirements. I’m conducting a study to monitor the attitudes, experiences, and practices of authors in journals that have adopted JDAP policy and similar authors who publish elsewhere.  The study will run for 3 years, so although I have more than 2500 responses there is still another whole year of data collection to go.  Stay tuned 🙂

Keep an eye on Journal Research Data Policy Bank (JoRD) to stay current on journal policies for data archiving.

Funders, though.  Why aren’t more funders introducing mandatory public data archiving policies (with appropriate exceptions)?  I don’t know.  They should.  Several are taking steps towards it, but golly it is slow.  Is anyone thinking of the opportunity cost of moving this slowly?  More specific thoughts in my National Science Foundation RFI response with coauthor Todd Vision.

TV: You’re a big advocate of ‘open notebook’ science. How did you first get interested in working in this way?

HP: I was a grad student, hungry for information.  I wanted to know if everyone’s science looked like my science.  Was it messy in the same ways?  What processes did they have that I could learn from?  What were they are excited about *now* — findings and ideas that wouldn’t hit journal pages for months or years?

This was the same time that Jean-Claude Bradley was starting to talk about open notebook science in his chemistry lab.  I was part of the blogosphere conversations, and had a fun ISMB 2007 going around to all the publisher booths asking about their policies on publishing results that had previously appeared on blogs and wikis (my blog posts from the time; for a current resource see the list of journal responses maintained by F1000 Posters).

TV: It’s clearly a good way to work for people whose work is mainly analysis of data, but how can the open notebook approach be adapted to researchers who work at the bench or in the field?

HP: Jean-Claude Bradley has shown it can work well very in a chemistry lab.  I haven’t worked in the field, so I don’t want to presume to know what is possible or easy: guessing in many cases it wouldn’t be easy.  That said, more often than not, where there is a will there is a way!

TV: Given the growing concerns over the validity of the results in scientific papers, do you think that external supervision of scientists (i.e. mandated open notebook science) would ever become a reality?

HP: I’m not sure.  Such a policy may well have disadvantages that outweigh its advantages.  It does sound like a good opportunity to do some research, doesn’t it?  A few grant programs could have a precondition that the awardees be randomized to different reporting requirements, then we monitor and see what happens. Granting agencies ought to be doing A LOT MORE EXPERIMENTING to learn the implications of their policies, followed by quick and open dissemination of the results of the experiments, and refinements in policies to reflect this growing evidence-base.

TV: You’re involved in a lot of initiatives at the moment. Which ones are most exciting for you? 

HP: ImpactStory.  The previous generation of tools for discovering the impact of research are simply not good enough.  We need ways to discover citations to datasets, in citation lists and elsewhere.  Ways to find blog posts written about research papers — and whether those blog posts, in turn, inspire conversation and new thinking.  We need ways to find out which research is being bookmarked, read, and thought about even if that background learning doesn’t lead to citations.  Research impact isn’t the one dimensional winners-and-losers situation we have now with our single-minded reliance on citation counts: it is multi-dimensional — research has an impact flavour, not an impact number.

Metrics data locked behind subscription paywalls might have made sense years ago, when gathering citation data required a team of people typing in citation lists.  That isn’t the world we live in any more: keeping our evaluation and discovery metrics locked behind subscription paywalls is simply neither necessary nor acceptable.  Tools need to be open, provide provenance and context, and support a broad range of research products.

We’re realizing this future through ImpactStory: a nonprofit organization dedicated to telling the story of our research impact.  Researchers can build a CV that includes citations and altmetrics for their papers, datasets, software, and slides: embedding altmetrics on a CV is a powerful agent of change for scholars and scholarship.  ImpactStory is co-founded by me and Jason Priem, funded by the Alfred P. Sloan Foundation while we become self-sustaining, and is committed to building a future that is good for scholarship.  Check it out! and contact if you want to learn more: team@impactstory.org

Thanks for the great questions, Tim!

4 thoughts on “An interview with Heather Piwowar: on data archiving, open notebook science, and discovering your impact flavor

  1. Pingback: DRYAD interview with Heather Piwowar on data archiving | e-Science Community

  2. Pingback: Plants, Evolution, Genetics, Computation » Blog Archive » New paper on data reuse and the open data citation advantage

  3. Pingback: You’re more than your h-index – Heather Piwowar, February 6 | News from the Library

Comments are closed.