Anecdotal Perceptions of DataSharing

Posted on June 16, 2010 by


Maybe I come from an arcane science, but as I explained this internship to my stream ecologist peers at a conference last week they responded with varying degrees of “huh?” Ok, I understand this project is well outside the scope of insects and flow dynamics, but most of my scientific peers have an admittedly poor understanding of how to share their data with each other. Or even that they should at all. Many of them, even the fresh post-docs, have operated within their tiny institutional spheres and made heaps of Excel spreadsheets, but never thought to share them. The publication, the ever coveted tenure currency, is first and foremost. Unfortunately, at the neglect of good science on many levels. I could go on for days (maybe years) about how the hunger for publishing leads to universities not performing some of their most important theoretical roles, like enhancing their local community and producing well-rounded students. But that’s a discussion for another time and probably another blog entirely. For now, I’ll focus on the neglected aspect of transparent datasharing.

Partly, the word “dataset” is not colloquial in stream ecology. At the drop of a hat, the group of scientists I associate with can spout of the scientific names of species, complex modelling procedures, and technicalities of procotols. But, when you ask them what their dataset looks like, they hem and haw.

First, the basic question: “Dataset? What, my raw data? That doesn’t mean anything alone.”  Well, actually, yes it does.

I prod them, “What if I wanted to redo your work for historical or validation purposes? How would I get your data?”

They say that I have their email and they can send it to me. But I know how this drill goes…I’ve been a lowly grad student for awhile now. I email the PI on the project who holds all the data on his computer. If he even responds to my email, he sends it to me three weeks after I needed it and in the messiest possible format that is nearly unredeemable.

“Ok, so let’s assume I can get the data and it’s usable. Did you document all the steps and settings you used in analysis?”

“Well I didn’t, grad student X ran that data 5 years ago. She’s a whiz at it, talk to her.”

The conversation could go many different ways. The data is buried on a hard drive, it’s backed up but not readily accessible. The primary author’s email expires or he moves institutions. The scientist forgets how to navigate that messy spreadsheet, let alone the analysis protocol.  And so on.

But what I found most discouraging as I talk to stream ecologists young and old was simply a lack of awareness (the bleeding heart problem of all worthy causes). There are a few believers, mostly the macro-ecologists interested in meta-analyses, but they don’t know where to turn for data and mostly rely on government databases. The majority understand genbank but many don’t use genetics in their work. Others see datasharing as another hoop to jump through and briefly recalled a journal that encouraged that they do it but couldn’t remember if they had or not. Still more hadn’t heard of it and were worried about it’s implications, primarily data stealing/mining that wouldn’t give them appropriate credit. Resolving these issues in the minds of the scientists themselves is perhaps the major philosophical goal of this project.

Maybe the real problem is that I didn’t explain my project well. So, just for the record, here it is (which is really the whole point of this post):

From an article-centric approach, I’m looking at practices of data reuse and sharing. Basically, I spend my days reading articles and noting if and how datasets are cited (or the lack thereof!). I also collect data about whether the article is open access, what discipline it stems from, and where the data is supposedly deposited. All this will be used to look at trends over time, between disciplines, and related to journal dataset citation policies. I also make qualitative notes on oddities I encounter (i.e. super bad citation habits), which will contribute to a list of recommended best practices for dataset citation…the dos and don’ts for authors and a baseline measure for journals of how well their policies are followed “on the ground”.

For the long version, see here and here.

And with that, I leave you all to comment away. We’d like to encourage all readers to comment on this blog and/or on our OpenWetWare notebooks. That can be as simple as leaving your name/affiliation and what brought you here (especially if you are outside of DataONE), to comments on our proposed protocols and outcomes. Thanks for reading!

Posted in: Data Sharing