tracing data through the archives

Posted on July 13, 2010 by


Upon tedious perusal of many articles, I’ve come to form some strong opinions on what makes data traceable.

At a basic level, I simply wish authors were more clear (it sure would make my data extraction easier!). This requires transparency on the part of the scientist. Not just a willingness to be open, but explicitly clear articulation of what data is available. And on top of that, where it can be found.  There is a general assumption in the scientific community that if someone really wants the data, they will contact the original author. Authors commonly include this caveat, especially when their data is not posted elsewhere.

But the fact of the matter, accumulating data from multifarious sources is a difficult process. Personally, I have spent many hours trying to track down GIS layers by emailing the original authors or digging through web archives. Given, I was working in a relatively understudied area, but my literature searches produced a fair number of potential sources and most lead to dead ends…non-existent urls, out dated email addresses, etc. Even when working within my established group of collaborators (not DataONE of course!), it could take me weeks of emails and reminders to get my hands on a simple spreadsheet. If I couldn’t get an email response and accompanying dataset from a colleague, what are my chances of getting it from a scientist I’ve never met? Therefore, it’s nice if an author handwavingly says they’ll provide their data, but its much better if they post it, document it, and state in their publication that they did so.

With that said, there are a number of better practices that can be implemented on the journal, editor, and author level to facilitate less cumbersome data sharing. Here are some of my preliminary suggestions on good citation practices that enable data to be traceable and truely reusable:

  • Accession numbers and Authors of each dataset (reused and shared) given in the Methods
  • Alternatively, for large datasets posted in multiple places, a Table or Appendix referenced in the Methods should be given
  • Authors should be not charged extra page fees for including all Bibliographic citations for all original data authors or a table of relevant accession numbers and reference
  • Proposed Bibliographic Citation Format:
  • Author. Year. Article Title. Journal. Pages. Depository. Accession Number.
  • This is probably old news in this field, but this format seems most intuitive to me and would have been helpful in my data extraction. I think it would be useful to lump the original article and the dataset since most reuse instances typically give the Author Year citation when they are really referring to the dataset, but at the same time the article is needed for context of the dataset. Also, this would also authors to track their dataset reuse through ISI and other aggregators.
  • Editorial enforcement
  • I envision this as a simple checklist for the copy editors (not reviewers) to confirm that accession numbers and/or an author reference are given for all reused and shared datasets.
  • As an initial step, this is most easily done for GenBank and Treebase which have clear cut dataset types (respectively sequences and alignments/trees). Especially in journals like Molecular Ecology and Systematic Biology, nearly every paper should have at least one of these citations.
  • Internal (Journal) Depositories made more accessible
  • Data should be made available in usable formats, not just static pdfs.
  • Data should have a unique and stable URL. This is especially a problem in Systematic Biology where all data is said to be stored at which is not unique for an article. Furthermore, data previous to 2008 cannot be found at this site (which is referenced in all pre-2008 articles), but can be found at  under Appendices and data. Both entry points require searching for the data rather than a direct connection such as the urls provided by Ecological Archives.
  • Separate Supplementary Data Section –
  • In general, there is confusion among authors about what “Supplementary data” is. Often, it is treated as a data dump for extra figures and statistical outputs, not raw data.
  • Journals could and should have a separate section for accompanying raw data.
  • For example: Molecular Ecology has a Supplementary data section at the end of each article which has a one sentence description for what is contained in each appendix.   American Naturalist recently added an “Online enhancements” header at the top of each article which often provides links to shared datasets. Systematic Biology has a separate section at the end of the article which is a good first step, but as it is now just contains the same blanket statement along the lines of: ‘data is available at the SysBio website…’
  • I propose that this additional section contain brief summary info and URL/accession for shared datasets, as well as reused data.