Why Scientists Don't Share Data and How to Fix it

Towards a Better Science

Recently, there's been a highly encouraging push towards a more open science with a greater emphasis on reproducibility, spurred by a few high profile retractions and a growing awareness of the slippery nature of several scientific findings.

This slow shift has important implications. While in theory journals encourage or require the sharing of published data to anyone who requests it, often this is very difficult or even unpleasant, requiring multiple requests to journal editors to compel the authors to share the raw data. Why is this data sharing so controversial and complex?

Incentives and Data Sharing

Andrew Gellman wrote about this from the point of view of political science and statistics both on his blog1 2 and here. As he discusses - part of the problem is technical, and part of the problem is a matter of misaligned incentives.

The technical part of the problem is that sharing big data sets is very complex, and, until recently, getting financed to work on infrastructure was very difficult. Time spent figuring out how to put a big data set online in a way that could be producitively mined has potentially very little return. Thankfully, there's some effort on this front - with Titus Brown and the Moore Foundation stepping in to hopefully make this easier.

The other obstacle in the way of greater data sharing is a matter of incentives. Putting raw data in a nice format, and annotating it properly takes a lot of time, that leads to very little tangible rewards. PIs often build careers from the results of a long running study, and sharing all the data can put their competitive advantage at risk. In biology - there's also an additional element at work, as many journals (the big three especially) overwhelmingly favour papers with new experiments over papers that obtain novel results from existing datasets.

This bias towards new experimental results is highly counter productive, and puts scientists in a terrible spot. In the constant quest to maximize impact (a necessity in the current funding climate) scientists have to decide how to best spread out the results from a big study among the biggest possible number of papers, making sure that no paper clips the novelty of any of the other papers. For an example of this, look at the collection of papers from the Encode project, where an immense amount of planning went into figuring out how to maximize the number of first author papers from the study.

What Makes a Paper New?

This emphasis on associating new experiments with novelty is largely a hold over from a previous era - where thinking of the critical experiment and performing it were far more difficult than analyzing the resulting data. With the new data sets trickling in from high throughput experiments, obtaining insight from data is often more challenging than obtaining it in the first place, and the publishing guidelines should adjust to this new reality.

Although this is a very complex topic, I'd like to offer a few simple recommendations that don't require seismic shifts:

  • Judge papers by how novel and robust their insights are, not how novel the data is.
  • Shift away from the idea of 'first author' and 'last author'. Part of what drives the need to produce multiple papers is that on large multi-year projects, you need to produce enough first author papers for all the postdocs and PhDs. Especially on papers that require complex experiments + computational analysis, multiple people often contribute equally.
  • Finally, develop a better mechanism to reward people who make highly sophisticated datasets available online. Something in between a citation and co-authorship.

While I was putting the finishing touches on this blog, I was linked this excellent article by Ioannidis on how to make more published research true. It's excellent, and everyone should read it, especially the part about the reward system.