Information Science Detectives

Posted on June 30, 2010 by


“It’s kind of like trying to find someone on Facebook only knowing their hair color and favorite breakfast cereal.”

This has been my clumsy way of explaining to my friends exactly what it is I’ve been doing lately and why I can’t come out and play in the tree fort with the other kids.

Not to say that work can’t also be fun. If this project didn’t sound fun, I probably wouldn’t have signed up for it in the first place. Being curious about everything drew me to being a science hobbyist, writer, amateur journalist and eventually an Archives grad student. I was the kid who asked the tour guide obnoxious questions on the field trip when all anyone wanted to do was nap on the bus, read a book or watch a movie during class trips.

Exactly why is it so hard to find citations of reused scientific data? What is an example of a “good” citation of repository data? So far I’m finding that at least one “good” attribute is being unique and easy to find. Citing by author and year can bring up different articles by that author in the same year or even the wrong author if only found via last name+et al. DOIs are proving their value as far as retrievability is concerned, but when half of the findings are “data for this project is deposited/is accessible at at doi: 10..XX.XXX.etc.,” how do we exclude those from the reuse citations? While repositories have their own recommendations: TreeBASE has recommend citations on each page that includes data author and study title (example here); Pangaea recommends including the author, year, title, institution, and DOI;  ORNL DAAC has a similar policy with its own formatting rules. This doesn’t even go into the different recommendation each journal/publication has for citing data, which is another story that Nic’s covering.

It also seems that every question has several other questions that arise with it. If I ask “What is the most common/best way to cite data?” then I also have to consider “Do the best practices/recommendations of each repository actually contribute to making the dataset easy to track and find across other articles?” How much narrower of a focus do I gain by using boolean operators and controlled vocabulary? How much do I risk losing by using such narrow searches? For every search I construct, I feel like I’m missing five others. As a visual person, I even created a table to help me keep track as I went through my earlier journal entries. While I might not have all possible leads, I have a records of the ones I previously tracked. Clues lead to clues. Questions lead to more questions. Eventually there are answers… or at least, I hope there will be answers.

That may have just been the wannabe gumshoe talking in me: the kid who always watched Where in the World is Carmen Sandiego on PBS and played all the games, the teenager who wrote amateur detective stories and loved Patricia Highsmith, Dashiell Hammett and film noir a bit too much. I’d put my trenchcoat and fedora on if it weren’t so darn hot outside.

To follow Sarah’s post, I agree that the dead ends are just as important as the big leads. Anyone who’s ever gotten lost in an unfamiliar town knows that they have to get stuck in a few places and pull over into at least one gas station before they get back to the highway. Of course, now there’s GPS and the like, but there aren’t always shortcuts like that in finding information. Will DOIs be the GPS for data citation? So far, I’m finding that they have a high hit count compared to the other methods of sampling I’ve used (author name or repository name).

Back to my opening analogy,  DOIs have a lot more structure and standards than something as arbitrary as hair color or favorite breakfast cereal. However, they’re not always used in citation. I’ve found articles that either only mention the author or the name of the repository from which the data came. Yet with all the variation in names (last name only, last name and first initial, full name last and first, etc.), it might as well be like trying to find someone on Facebook or MySpace by breakfast cereal name (Cap’n Crunch? Puffins? Sweetened puffed corn rectangle pillows? Peanut butter puffs?). I feel strongly that if people won’t cite names consistently, the least they could do is use the DOI, which is more or less the equivalent of finding someone by their phone number or email address or class schedule on Facebook (except more helpful and not creepy).

I feel like the amount of attention I’ve been paying to small details like search strings is just part of an even greater picture outside of data reuse. I’m sure that once Nic, Sarah and I combine and compare data, we’ll have a better look at that bigger picture. Perhaps it won’t be the whole picture, but it’ll certainly be more than the random handful of puzzle pieces I have in hand now.