Why Share Data?

Posted on June 29, 2010 by


Raw data is the currency of science. It “pays” the publication bills, it fuels the inquiry. But once the publication is obtained and logged on the tenure checklist, what good is data? A busy scientist might say the data has served its purpose and bury it in their cemetery of computer files. Another could argue that the real meaning of the data is the results and future work. But to a macro-ecologist, evolutionary biologist, economist, or other long-term and large-scale trend analyst, the raw data is everything. In such fields, there is never enough raw data.

And thus, scientists go about mining data. Pilfering published articles, scanning internet resources, pestering their colleagues for dormant data. At times there is a preponderance of data, but it is un-standardized and therefore unusable without performing an obstacle course of statistics and normalizations. Other times, there is no accessible data even though numerous publications allude to existing data. So, what is a meta-analyst to do?

Well, it is our hope in this project that we can establish good practices of data reuse and sharing: good metadata for data reusability, common depositories for data access, proper citation of data to give credit where it is due. Also, it is our long term hope that data will be cited and recognized on the same level of publications, both for the benefit of the meta-analyst seeking data and the professor seeking tenure, not to mention the amateur and citizen scientists seeking to solve local issues.

Now, let me step back from the philosophical benefits of data reuse and recount some examples I have encountered that illustrate the need for good practices (which inspired this post in the first place). Some of the very first articles I analyzed aptly illustrated the shortcomings in current data reuse, but also the sustained interest in utilizing such a vast resource. A number of articles haphazardly credited data sources and more predominantly cited their own previous data at the expense of other sources. There are too many of these situations to count. Overall, there is a lack of consistency in how to cite the dataset and make it retraceable for future use.

A more concrete and poignant example is an instance of an author seeking to utilize Treebase (http://www.treebase.org) datasets. In order to test a proposed method, the author sought the original data matrices of a number of phylogenetic studies. In doing so, the author provided a mini commentary on the arduous search-and-seive process of finding quality data. First, effective keywords had to be ascertained. This is essential to all literature searches, but as Valerie well knows, it is even more critical in elucidating datasets. Then, the author sifted through the initial results, only to find that many files posted on Treebase lacked necessary metadata that precluded re-analysis. After examination of the available data, the author settled on a pitifully few case studies to illustrate the proposed method. Ironically, after all the trouble he went through, the author did not provide references or accession numbers of the datasets utilized, nor did he share the compiled dataset.

Often, authors are discouraged from articulating the “dead-ends” in their research as it distracts from the intent of the paper. But in this case, and for my purposes, this was the most informative part of the manuscript. It illustrates the struggle of finding, accessing, and sharing data. There is interest in reusing data and this author expressed frustration that it was not more accessible, yet it feels like there is so far to go. At some level, despite the efforts of depositories, journals, and entities like DataONE, it is ultimately the job of each scientist to take responsibility for this desire by posting their data, utilizing and crediting other data, and being transparent in this process. Though this author lamented the sad state of Treebase reuse, he did not also do a “good turn” and reciprocally share his data or explicitly articulate data sources. I believe that the best remedy to the demand for data is openness and sharing on the part of individual scientists.

Luckily, examples of this exist as well. Two articles which I recently read described the difficult process of extracting various parameters from 50+ publications. I sympathized from my experience performing the same process. In both articles, it at first seemed as though the authors were detailing their extraction methods in such a way that would encourage others to perform a similar process. As I read through the lengthy methods, I fully anticipated that since they covered their methodology so thoroughly, they surely wouldn’t post the actual dataset. Yet, in both instances, I was delighted when a simple sentence indicated that the data was available online. I was even more pleased when I downloaded the files, discovering that they were in a useable format that also indicated the original manuscripts. The detailed methods and crossreferences were the necessary metadata for replication or re-analysis, and the raw data now sits comfortably on my personal desktop (should I ever decide to investigate the ecology of orchids or bats).

So, there is hope. And much of it stems from the initiative of scientists. Though I have not quantitatively accessed trends as of yet, it seems that many instances of reuse or sharing occur for personal, unarticulated reasons. Perhaps, a scientist benefitted from data sharing in the past or simply believes in the transparency of science. Conversely, they may have had their data preemptively published by a greedy colleague and therefore hoard their data. On both sides of the fence, there are risks and benefits. Regardless of individual experience, with each publication there are more calls for data to answer deeper and broader questions. I have hope that scientists will rise to the challenge of meeting this demand, and that our work can aid them in navigating the jungle of raw data that awaits.