Recycling in the data world

Hi, I'm Valerie

Unlike Nic, I don’t have an excitement of introducing myself or going first. Fortunately, I get to post on Wednesday, the day when a good deal of us have shaken off the early lethargy of Monday and Tuesday. Like Sarah, I had an interesting time explaining to colleagues (as well as friends, families and random people who would listen) exactly what it was I would be doing at my computer or in various libraries for the next couple of months. As a current Archives Management student, I have an interest in preservation, whether in the analogue form of acid-free, lignin-free boxes and folders or the digital form of refreshing and migrating data. Thus, this internship with its examinations of the data life-cycle: creation, use and reuse, caught my attention. The fact that I would be working with scientific data was just icing on the cupcake since I’m a science hobbyist, enjoying ScienceBlogs and reading nonfiction science books.

What exactly am I doing with this data? The short version is that I’m trying to find out how much data stored in repositories like TreeBASE, Pangaea, and ORNL DAAC is reused by researchers and how they attribute, or cite this data. The long version can be found in the entries of my OpenWetware Lab Notebook.

Now, how would I go about finding this reused data? As an information scientist-in-training, I’m supposed to have a well-developed ability to search and find information, something I refer to as “information-fu.” I admit, I still have a long way to go before I reach the super-human strength of the reference librarians I’ve met, but it will be interesting to develop my abilities by learning from my mentors and fellow interns. While some repositories have their own methods of identification, including study numbers (TreeBASE) or DOIs (Pangaea), there is not a standard way for researchers to cite these datasets. I started out with a very basic search for mentions of the repositories, then narrowed it down by using terminology like “study accession number” or by using wild-cards with DOI prefixes like “doi:10.1594/PANGAEA*” After a suggestion from one of our web meetings, I’m looking for citations by author in TreeBASE, as the data is often cited in the form of studies published in other journals.

For the most part, my project is similar to Sarah’s Project, but concentrating on data reuse at the article level. It will be interesting to compare data with Nic’s Project to see how repository and project funding sources inform how data is used, reused and cited.

It amazes me how much the internet has changed science and research. For one thing, all of the interns and mentors are operating independently in different regions across the country, touching base via email, group chat and phone conference. The OpenWetware notebooks allow me to record every detail so that I can go back to previous searches in case I missed something. It’s also great to see what everyone else is working on. So many articles are available through Open Access. So much data are available in digital repositories.

This brings me to my final point, as a librarian/archivist/information scientist-in-training, I often get the question “if everything’s online, then why do we still need librarians?” My answer is pretty much in my lab notebook entries. While many advanced search functions include fulltext, there’s still the chance you’re going to miss something. If “everything is online” (which it sometimes isn’t, but that’s another story), then that’s a lot of information to slog through. It might help to bring a guide, because frankly, it’s a jungle out there.

I look forward to sharing my findings, whether on this blog or in my OWW Notebook.