This is my first blog post about my Smithsonian Libraries internship, although I actually have been working on it since October, because it’s been an interesting semester (I think fall semester is always interesting, and then everyone settles down for spring semester).
The focus of my internship is on preparing and working with large datasets in order to link and match them. I’ve been working with a dataset of authors whose books have been digitized and are in the Smithsonian Libraries Digital Library. We’re trying to attach the authors to records in VIAF, the Virtual International Authority File, so that users of the Digital Library can easily locate more information about the authors, and those authors can be connected to their other works. My mentor has pulled the data, and I’ve worked on cleaning it (removing organizational authors, etc.) and reconciling it with VIAF via OpenRefine, which is an open source tool for working with messy data and linking it with external, web-based datasets like VIAF. This has meant I’ve spent some quality time with Excel and OpenRefine tutorials, and have also been revisiting how to query databases. My most recent work with the author dataset also played around with ways to improve on the automatic matching/reconciliation performed by OpenRefine by coming up with an heuristic that matches more names based on match similarity scores. This involved a good amount of spot-checking of individual names, which took a lot of time but was also pretty interesting (there are a lot of neat books in the Digital Library).
I’ve been working with the authors dataset primarily to familiarize myself with OpenRefine and data cleaning and reconciliation and will be working with other datasets next semester. The one I am particularly excited about is the art and artist vertical files from the Smithsonian American Art Museum, which is one of my favorite museums. In October, I visited the museum and saw the vertical files. There is so much fascinating material in the files, but the files are underused because they are not cataloged. The fact that materials are only findable if they have a record or some sort of representation that can be searched for (which generally means text) is something I’ve found myself pointing out in more and more of my instruction, both individual and group, because it’s something that a lot of students and faculty don’t think about. My final project for the previous class in the certificate program focused on pushing students to think about primary sources as having their own histories, and sought to emphasize the creation of collections and records as part of this history. Working with datasets like the authors dataset, which is pretty straightforward really (names and birthdates), and VIAF, which has split authority files and sometimes the best record is from the least-obvious institution, really points to the historical contingency and inconsistency of data, despite our best efforts. This aspect is also something I’ve been interested in, since data often takes on the appearance of empirical truth.
That was kind of rambling, but the dehistoricization of libraries, collections, and information systems is pervasive and does political work that I find troubling, so I spend a lot of time thinking about it. On a more practical note, I’m happy to be learning how to clean, prepare, and manipulate data. I’m thinking about working in some data science tutorials on an online course next semester, but I tend to overcommit.