Feeds:
Posts
Comments

The following is a guest post from science journalist John Bohannon. We asked him to give us some background on his recent dataset in Dryad and the analysis of that data in Science. What stories will you find in the data? – EH

_______

Scihub_raven

Sci-Hub is the world’s largest repository of pirated journal articles. We will probably look back and see it as inevitable. Soon after it became possible for people to share copyrighted music and movies on a massive scale, technologies like Napster and BitTorrent arrived to make the sharing as close to frictionless as possible. That hasn’t made the media industry collapse, as many people predicted, but it certainly brought transformation.

Unlike the media industry, journal publishers do not share their profits with the authors. So where will Sci-Hub push them? Will it be a platform like iTunes, with journals selling research papers for $0.99 each? Or will Sci-Hub finally propel the industry into the arms of the Open Access movement? Will nonprofit scientific societies and university publishers go extinct along the way, leaving just a few giant, for-profit corporations as the caretakers of scientific knowledge?

There are as many theories and predictions about the impact of Sci-Hub as there are commentators on the Internet. What is lacking is basic information about the site. Who is downloading all these Sci-Hub papers? Where in the world are they? What are they reading?

48 hours of Sci-Hub downloads. Each event is color-coded by the local time: orange for working hours (8am-6pm) and blue for the night owls working outside those hours.

Sometimes all you need to do is ask. So I reached out directly to Alexandra Elbakyan, who created Sci-Hub in 2011 as a 22 year-old neuroscience graduate student in Kazakhstan and has run it ever since. For someone denounced as a criminal by powerful corporations and scholarly societies, she was quite open and collaborative. I explained my goal: To let the world see how Sci-Hub is being used, mapping the global distribution of its users at the highest resolution possible while protecting their privacy. She agreed, not realizing how much data-wrangling it would ultimately take us.

Two months later, Science and Dryad are publicly releasing a data set of 28 million download request records from 1 September 2015 through 29 February 2016, timestamped down to the second. Each includes the DOI of the paper, allowing as rich a bibliographic exploration as you have CPU cycles to burn. The 3 million IP addresses have been converted into arbitrary codes. Elbakyan converted the IP addresses into geolocations using a database I purchased from the company Maxmind. She then clustered each geolocation to the coordinates of the nearest city using the Google Maps API. Sci-Hub users cluster to 24,000 unique locations.

The big take-home? Sci-Hub is everywhere. Most papers are being downloaded from the developing world: The top 3 countries are India, China, and Iran. But the rich industrialized countries use Sci-Hub, too. A quarter of the downloads came from OECD nations, and some of the most intense download hotspots correspond to the campuses of universities in the US and Europe, which supposedly have the most comprehensive journal access.

But these data have many more stories to tell. How do the reading habits of researchers differ by city? What are the hottest research topics in Indonesia, Italy, Brazil? Do the research topics shift when the Sci-Hub night owls take over? My analysis indicates a bimodal distribution over the course of the day, with most locations surging around lunchtime, and the rest peaking at 1am local time. The animated map above shows just 2 days of the data.

Something everyone would like to know: What proportion of downloaded articles are actually unavailable from nearby university libraries? Put another way: What is the size of the knowledge gap that Sci-Hub is bridging?

Download the data yourself and let the world know what you find.

The data:

http://dx.doi.org/10.5061/dryad.q447c

My analysis of the data in Science:

http://www.sciencemag.org/news/2016/04/whos-downloading-pirated-papers-everyone

 

 — John Bohannon

2015While gearing up for the Dryad member meeting (to be held virtually on 24 May – save the date!) and publication of our annual report, we’re taking a look at last year’s numbers.

2015 was a “big” year for Dryad in many respects. We added staff, and integrated several new journals and publishing partners. But perhaps most notably, the Dryad repository itself is growing very rapidly. We published 3,926 data packages this past year — a 44% increase over 2014 — and blew past the 10,000 mark for total data packages in the repository.

Data package size

Perhaps the “biggest” Dryad story from last year is the increase in the mean size of data packages published. In 2014, that figure was 212MB. In 2015, it more than doubled to 481MB, an increase of a whopping 127%.

This striking statistic is part of the reason we opted at the beginning of 2016 to double the maximum package size before overage fees kick in (to 20GB), and simplified and reduced our overage fees. We want researchers to continue to archive more (and larger) data files, and to do so sustainably. Meanwhile, we do continue to welcome many submissions on the smaller end of the scale.

boxplot_logscale_labels

Distribution of Dryad data package size by year. Boxplot shows median, 1st and 3rd quartiles, and 95% confidence interval of median. Note the log scale of the y-axis.

In 2015, the mean number of files in a data package was about 3.4, with 104 as the largest number of files in any data package. To see how times have changed, compare this to a post from 2011 (celebrating our 1,000th submission), where we noted:

Interestingly, most of the deposits are relatively small in size. Counting all files in a data package together, almost 80% of data packages are less than one megabyte. Furthermore, the majority of data packages contain only one data file and the mean is a little less than two and a half. As one might expect, many of the files are spreadsheets or in tabular text format. Thus, the files are rich in information but not so difficult to transfer or store.

We have yet to do a full analysis of file formats deposited in 2015, but we see among the largest files many images and videos, as would be expected, but also a notable increase in the diversity of DNA sequencing-related file formats.

So not only are there now more and bigger files in Dryad, there’s also greater complexity and variety. We think this shows that more people are learning about the benefits of archiving and reusing multiple file types, and that researchers (and publishers) are broadening their view of what qualifies as “data.”

Download counts

2015speciesSo who had the biggest download numbers in 2015? Interestingly, nearly all of last year’s most-downloaded data packages are from genetics/genomics. 3 of the top 5 are studies of specific wild populations and how they adapt to changing circumstances — Sailfin Mollies (fish), blue tits (birds), and bighorn sheep, specifically.

Another top package presents a model for dealing with an epidemic that had a deadly impact on humans in 2015. And rounding out the top 5 is an open source framework for reconstructing the relationships that unite all lineages — a “tree of life.”

In 5th place, with 367 downloads:

In 4th place, with 601 downloads:

In 3rd place, with 1,324 downloads:

In 2nd place, with 1,868 downloads:

And this year’s WINNER, with 2,678 downloads:

The above numbers are presented with the usual caveats about bots, which we aim to filter out, but cannot do with perfect accuracy. (Look for a blog post on this topic in the near future).

As always, we owe a huge debt to our submitters, partners, members and users for supporting Dryad and open data in 2015!

oupDryad is very pleased to announce more integrations from charter member and partner Oxford University Press Journals. Oxford University Press (OUP) publishes over 300 journals, many with the support of learned societies. As part of Oxford University, OUP brings a rich history of working with researchers.

OUP has integrated seven more journals with Dryad, all of which can provide secure links to data during the peer review process:

oup_covers

  • Behavioral Ecology – published on behalf of  The International Society for Behavioral Ecology, Behavioral Ecology publishes studies on the whole range of behaving organisms, including plants, invertebrates, vertebrates, and humans. Data publication is sponsored for Behavioral Ecology authors.
  • BioScience – published on behalf of the American Institute of Biological Sciences, BioScience has been publishing current research in Biology since 1964.
  • Environmental Epigenetics is an open access journal that publishes research in any area of science and medicine related to the field of epigenetics.
  • Toxicological Sciences is the official journal of the Society of Toxicology and publishes influential research in toxicology. Data publication is sponsored for authors for Toxicological Sciences.
  • Journal of Urban Ecology is an open access journal which covers all aspects of urban environments. This includes the biology of the organisms that inhabit urban areas, the diversity of ecosystem services, and human social issues encountered within urban landscapes.
  • Virus Evolution serves the community of virologists, evolutionary biologists and ecologists who are interested in the genetic diversity and evolution of non-cellular forms of life.
  • Work, Aging and Retirement reflects a broad community of professionals in the fields of psychology, sociology, economics, gerontology, business and management, and industrial labor relations.

Integration with Dryad ensures bidirectional links between the article and the data, and increased visibility for both. It also simplifies the process of data submission for authors. All data in Dryad is reviewed by professional curators who perform basic checks to ensure discoverability and proper metadata, and becomes freely accessible online once approved.

Oxford University Press is increasing its commitment to authors and to quality by making it easy to publish datasets alongside the manuscript, and by allowing data to be available during the peer review process.

We’re delighted to build our partnership with Dryad by integrating this set of OUP journals. Providing authors with a simple and user-friendly route to data sharing helps to increase transparency and reproducibility of published research, and ultimately must be good for science. We hope to integrate more of our journals in the near future.

– Jennifer Boyd, Senior Publisher Life Science Journals, OUP

To learn more about journal integration with Dryad and DPCs, contact us.

We are delighted to announce the launch of a new partnership with The Company of Biologists to support their authors in making the data underlying their research available to the community.

COBNewLogo300dpiThe Company of Biologists is a not-for-profit publishing organization dedicated to supporting and inspiring the biological community. The Company publishes five specialist peer-reviewed journals:

The Company of Biologists offers further support to the biological community by facilitating scientific meetings, providing travel grants for researchers and supporting research societies.

Manuscript submission for all COB journals is now integrated with data submission to Dryad, meaning COB authors can conveniently submit their data packages and manuscripts at the same time. Dryad then makes the data securely available to journal reviewers, and releases them to the public if/when the paper is published.

We congratulate The Company of Biologists on taking this important step to help facilitate open data. To learn more about how your organization or journal can partner with Dryad, please contact us.

watering-can-simpler-2

Over the last few years, we’ve learned a lot about what is needed to curate, preserve, and provide access to data for the long term, as well as to sustain an independent not-for-profit organization. We’ve also paid close attention to the needs and wants of our user community and members. To meet these needs, we are revising our pricing structure for the first time since it was introduced in 2013.

  • Submissions initiated after 4 January 2016 will have a base Data Publication Charge (DPC) of $120US.
  • Pricing is now the same for all journals – there will no longer be an additional surcharge for non-integrated publications.
  • We encourage individuals and small groups to purchase bundles of DPC vouchers in advance and in any quantity. Purchases over 25 DPCs will enjoy a discount.
  • As a further user benefit, we will be doubling the maximum package size before overage fees kick in (to 20GB) and simplifying and reducing the overage fees.
  • We will continue to waive DPCs for researchers from World Bank low-income and low-middle-income economies upon request.
  • Membership fees are not changing, but Dryad members will be entitled to receive larger discounts on DPCs.
  • As always, there are no fees to download or reuse data from Dryad.
  • Integrating Dryad’s system with partner journals remains a free service.

Dryad’s Board of Directors will continue to keep a close eye on the repository’s sustainability progress. We anticipate this price structure will remain stable for the foreseeable future and are always seeking opportunities for savings and efficiencies.

We are grateful to our community supporters and take seriously the responsibility to ensure the long-term availability of the research data entrusted to us.

Prepaid data submission vouchers can be purchased at current pricing levels ($80 apiece) through January 4th (and at the new price of $120 apiece after that), by contacting help@datadryad.org.

Payment plans are either subscription or usage-based. Organizations and individuals may also make advance purchases of any number of DPCs and are eligible for bulk discounts for purchases of 25 or more.

What exactly do your DPCs cover?

The following breakdown of expenses reflects projected costs in the near future, extrapolating from historic growth rates. Approximately half of costs are associated with Repository Management, including membership-based nonprofit governance, communications with Dryad’s many stakeholders, members and partners, and upkeep of software systems (Repository Maintenance). Another quarter of the costs are due to the curation and user support provided to each data package, part of Dryad’s unique service offering and commitment to quality.

Since Dryad is a virtual organization, Infrastructure & Facilities largely covers server costs, digital storage, and interoperability technologies such as Digital Object identifiers (DOIs). A small fraction goes to community outreach activities to help encourage data publication best practices and raise awareness of Dryad. Administrative Support covers essential functions such as accounting and contract review.

Finally, Research and Development is essential for building new features to support changing technology and user expectations. R&D expenses are included here, but would ordinarily be covered through special project grants and not considered an operating expense paid for through DPCs.

We expect that as efficiencies are put into place, volume increases, and further economies of scale are realized, the percentage of the DPC supporting Repository Management will decrease and other areas, most notably Curation, will increase.

expense_breakdown-01

UCP logoDryad is very pleased to announce a new partnership with the University of Chicago Press – Journals Division. Founded in 1890, Chicago is one of the oldest and currently the largest continuously operating university press in the United States. Chicago has recently integrated two additional journals with Dryad: Physiological and Biochemical Zoology (PBZ) and International Journal of Plant Sciences (IJPS) and is sponsoring Data Publication Charges (DPCs) for both titles. PBZ and IJPS join sister publication, The American Naturalist, a Dryad partner since its inception.

Integration with Dryad

  • Ensures bidirectional links between the article and the data, and increased visibility for both
  • Simplifies the process of data submission for authors
  • Takes advantage of Dryad’s professional curators who perform basic checks to ensure discoverability and proper metadata
  • Ensures that the data is freely accessible once the article becomes available online

Physiological and Biochemical Zoology publishes original research in the areas of animal physiology and biochemistry. PBZ focuses on ecological, evolutionary and behavioral aspects of morphological, physiological, and biochemical mechanisms. PBZ’s integration will allow authors to make their data available to journal editors during peer review.

The International Journal of Plant Sciences has been publishing plant science research since 1875. IJPS covers a wide range of topics including genetics and genomics, developmental and cell biology, biochemistry and physiology, morphology and anatomy, systematics, evolution, paleobotany, ecology, and plant-microbe interactions. IJPS will accept data from authors at the time of article acceptance.

The University of Chicago Press – Journals Division is increasing its commitment to authors and the STM field by making it easy to publish datasets alongside the manuscript, and by taking the extra step of covering the cost of data publication on behalf of authors. To learn more about journal integration with Dryad and DPCs, contact us.

Did you ever wonder what goes on behind the scenes when Dryad curators review data files submitted by authors?  There are no wizards behind our curtains, just real live information specialists and trained data curators.

by Kaptain Kobold via Flickr

by Kaptain Kobold via Flickr

Dryad’s curation process is intentionally lightweight, so it doesn’t delay the availability of the data. Curators don’t review the scientific merit of the files – that is left to peer reviewers and the scientific community. Instead, we rely on our curators’ expertise in library and information science to ensure the integrity and preservation of the data.

Curators perform basic checks on each submission (can the files be opened? are they free of copyright restrictions? do they appear to be free of sensitive data?). The completeness and correctness of the metadata is checked and the DOI is officially registered. During their work, Dryad curators encounter thousands of data files in any number of file formats. Our team examines all of these data files to ensure they do, in fact, include data, and not manuscripts, or pictures of kittens.

Curators may communicate directly with submitters to address issues and/or to make suggestions about enhancing the description and reusability of the data package. They can also create new versions of data packages should corrections or additions be needed after archiving. Ultimately, the responsibility for the content of the files rests with the submitters, but Dryad’s curators can help to catch and fix many common problems – and some rare ones, too.

fileTypes_wordleSince Dryad’s inception, curation operations have been led by the Metadata Research Center (or MRC) directed by Dr. Jane Greenberg, initially at the University of North Carolina at Chapel Hill, and now at Drexel University. The team is supervised by Senior Curator Erin Clary, and currently includes all students in, or graduates of, Library and Information Science (LIS) or Informatics Master’s programs.

So, (wizard) hats off to all our behind-the-curtains data curators, whose vital contributions ensure that the data in the repository is findable and usable. If you have a question about Dryad curation or need advice on preparing your data for archiving, don’t hesitate to email us at curator@datadryad.org.

Follow

Get every new post delivered to your Inbox.

Join 10,154 other followers