IDCC poster submission: Data citation in the wild

Posted on September 13, 2010 by


The summer 2010 internships have concluded, but the pulling-it-all-together work continues.  Here’s a poster abstract we’ve just submitted to the 6th International Digital Curation Conference 2010,  6 – 8 December 2010, Chicago, USA.

Data citation in the wild
Valerie Enriquez, Sarah Walker Judson, Nicholas M. Weber, Suzie Allard, Robert B. Cook, Heather A. Piwowar, Robert J. Sandusky, Todd J. Vision, Bruce Wilson

Consistent attribution of research data upon reuse is necessary to reward the original data-producing investigators, reconstruct provenance, and inform data sharing policies, tool requirements, and funding decisions.  Unfortunately, norms for data attribution are varied and often weak.  As part of the DataONE 2010 summer internship program, three interns studied the policies, practice, and implications of current data attribution behavior in the environmental sciences.  We found that few policies recommend robust data citation practices: in our preliminary evaluation, only one-third of repositories (n=26), 6% of journals (n=307), and 1 of 53 funders suggested a best practice for data citation.  We manually reviewed 500 papers published between 2000 and 2010 across six journals; of the 198 papers that reused datasets, only 14% reported a unique dataset identifier in their dataset attribution, and a partially-overlapping 12% mentioned the author name and repository name.  Few citations to datasets themselves were made in the article references section.  In multivariate analysis, citation patterns were more correlated with repository (with citations to Genbank being most complete) than journal or datatype. Attribution patterns were found to be steady over time.  Consistent with these findings, dataset reuse was difficult to track through standard retrieval resources.  Searching by repository name retrieved many instances of data submission rather than data reuse, combing the citation history of data creation articles was time consuming, and searching citation databases for the few early-adopter dataset DOIs and HDLs in reference lists failed due to apparent limitations in database query capabilities and structured extraction of DOIs.  We hope these descriptions of the current data attribution environment will highlight outstanding issues and motivate change in policy, tools, and practice.  This research was done as open science (  ask us about it!

Posted in: oadata, oanew