On keeping an open science journal

Posted on June 23, 2010 by

1


This certainly isn’t the diary I had in middle school with the broken lock from when I lost the key. For one thing, I wouldn’t mind if anyone actually reads it.

It’s hard for me to imagine a time before the internet. Then again, what I have pictured usually involves Thomas Edison being a jerk and trying to vandalize Nicola Tesla’s OpenWetware pages. With the rise of Open Notebook Science does this mean that the days of academic rivalry are done? I would like to think that the new openness encourages collaboration as opposed to competition. I like checking my talk pages to see if anyone’s left any comments or suggestions, considering how much of a n00b I am at this sort of thing.

Ok, this is veering on the edge of my insecure middle school ramblings and is thus not relevant to science.

I’ve only kept this notebook for about two weeks and it’s interesting to see how notes and casual observations I had made are slowly coming together to form solid ideas about how scientific data is re-used and why it’s so gosh-darn hard to track. Granted, some of my search strings are probably laughable to the more seasoned informaticians out there, but they’re growing and changing every day.

Now that I’m in the collection and analysis stage, what do I do with all these searches (and lists of results for that matter)? I consulted the chapter on Bibliometrics in Practical Research Methods for Librarians and Information Professionals (ISBN: 978-1-55570-591-6) and found that there are various effects and biases I should look out for when interpreting what I’ve found. Some interpretations can be problematic in that they may include presumptions that may have alternate explanations or factors outside of what I have observed. For one thing, is finding data reuse citations really difficult or am I just doing my searches wrong? A look in my notebook entries will show the level of thoroughness for each search and lengths I went through to avoid exclusion with the risk of false drops. Yet was I thorough enough? Are the search results large enough sample sizes?

Yet that is the wonderful thing about open science and science in general. Anyone else could step up to the plate to either help support my claims or completely refute them with new evidence. Or, as my fellow interns and mentors have done, provide commentary and suggestions throughout my research process.

My methodology for finding citations for three repositories (TreeBASE, Pangaea and ORNL DAAC) tended to go from extremely broad (basic keyword search for repository names mentioned in articles) to extremely narrow (names of individual data authors or particular DOIs or study accession numbers) in case I missed something important. While I couldn’t exactly read the fulltext of every search that pulled 1000 articles for content in two weeks, I could run word searches or skim abstracts for any mention of repositories or data authors. So far, I have been more successful in finding citations for individual data authors as opposed to general mentions of data repositories. Perhaps this is because many datasets supplement studies or articles. Or, maybe I’m getting ahead of myself and making the sort of assumptions I’ve been warned against making. For one thing, I readily admit a slight bias against Google; lo and behold, some of my “worst/least helpful” search results have come from Google Scholar.

Ah yes, the demotivational poster, for when you absolutely, positively have to say it in a simple yet pompous manner.

Image Courtesy of: Respectful Insolence @ Science Blogs

I could be doing things wrong or really wrong, but fortunately, I can get called out for it.  On another note, the design of ISI Web of Science’s Cited Reference Search is aimed more toward finding individual articles or article authors as opposed to datasets. Is it possible that I am demonstrating a potential need for tracking? I don’t imagine I’m the only person out there checking to see how often datasets are (re)used. If scientists and researchers find it useful to track how often their articles are cited, I imagine it could only help the flow of information if data citations are also tracked. For one thing, this sort of tracking could help bring more prestige to the idea of  datasets as valuable publications, not to mention bring prestige to the scientists and researchers themselves.

Soon I’ll be quantifying all of this beyond “this was hard to find” and “some were found this way, but more were found this way.” I’ll also have to keep in mind the basic statistical mantra of “correlation does not necessarily mean causation” while writing my interpretations and conclusions. I know there’s certainly a lot more to this than can be covered in a two-month project, so I can’t wait to see what develops.