Category Archives: DPH Internship

DPH Internship: Post #5

As befits a history class, for this post I went back and looked at the portfolio I prepared in May 2016 for this internship (I can’t believe that was a year ago). The first items I have listed that I want to learn are about working with data, and indeed, I did learn a good amount about how to clean and manipulate data in both Excel and (primarily) OpenRefine. I would like to build on what I’ve learned and have been looking at courses on Excel and data science, especially since I keep seeing humanities data curation and management in librarian job descriptions. This internship confirmed that this is an area I would like to do more with professionally, and I very much enjoyed trying to figure out how best to enhance records. As a public services librarian, I am often frustrated by incomplete records and uninformative metadata, so it was fun being on the other side and drawing on my own experiences working with faculty and students in thinking what might be most useful to the user.

One of the drawbacks of the internship, and perhaps this would be true with any internship at any organization as large as the Smithsonian Institution, is that the project felt like a very small, very narrow, very specific thing. The work I did still has to go through layers of approval before it’s public, and that can be a little frustrating. Going forward, the program might want to consider internships with smaller institutions that are likely to be underfunded and have fewer volunteers or interns. It’s definitely impressive to work with the Smithsonian, but it can also be limiting in terms of the type of work you’re able to do. As a local, it was great to be able to visit and meet the people I was working with (and whose data I was working with).

The internship filled in a gap I felt in the coursework – preparing all of that lovely data for digital humanities tools – and the coursework helped me figure out how to use the data I had access to in maps and timelines. The coursework on user needs also helped me think through both the possibilities of and problems with linked data. As a librarian, I was already pretty familiar with metadata, and very happy that so much of our coursework emphasized the importance of it, and the internship work reiterated the value of clean and complete as possible metadata. I keep returning to Sam Wineburg’s notion of the “jagged edges” of history, and what also struck me in this project is the fundamental unknowability of some things. Is this author in the Smithsonian Digital Library actually the same person as in this VIAF authority record? Sometimes it is just not possible to tell based on the information we have. Another idea I keep returning to is the labor that is behind digital public humanities work, especially in regards to things that are less visible, like the creation of metadata or linking data. Relying on OpenRefine or a similar technology to automatically match names will, at this moment, likely result in unacceptable amounts of errors, unless the data is fairly complete. It takes a lot of time and effort to figure out what can’t be automated and then to do that work manually. It also takes human judgement (and often additional research) to make decisions in a lot of cases.

Playing Around with Timeline JS

I’ve been working on reconciling a dataset of botanists’ names, which unlike the other datasets I’ve worked with, includes a lot of birth and death dates. I decided to try putting it in Timeline JS, mostly to see if I could figure out how to use it. I ended up using Open Refine to split columns and transform numbers into dates, and then regular old Excel to get rid of empty rows (here’s the Google Sheet I ended up with). I didn’t go through and add media or links to the timeline, so it’s pretty boring, and I haven’t poked around to see why “January” is above each name. It is interesting to see who is contemporaneous, though, and I’m pleased I was able to make this (mostly) work.

DPH Internship: Post #4

My mentor asked me to write a summary of my project to date, so that is this month’s post.

Project Summary (to date)

This project began with working with a smallish dataset of authors’ names from the Smithsonian Libraries’ Digital Library. Using Refine VIAF, I tried reconciling the authors’ names with VIAF records. Reconciling entailed running the automated reconciliation process, working with probable matches so as to reconcile more records, and spot-checking for accuracy. I also used the authors’ names dataset to try and get a sense of how Open Refine and Refine VIAF work. While working the authors’ names in Refine VIAF, I discovered that although probable matches are assigned numeric scores based on similarity of letters, Refine VIAF doesn’t necessarily pick the highest score when you ask it make the best match. This issue continued throughout the project, and we still haven’t figured out why this is. Refine VIAF also doesn’t always recognize a score of one (which is the highest possible score, indicating an exact match) as the highest score when it automatically reconciles. This is also a problem as it means that it’s difficult to automate the reconciliation of suggested matches, for which there are almost always more than exact matches.

After working with the authors’ names dataset, I moved on to a much larger dataset of artists’ names. This dataset included organizational names, which were reconciled separately and much more successfully (almost 50% were reconciled automatically and when I spot checked, they were very accurate). The artists’ names dataset proved to be more complicated. I had to clean the data a bit, as there were records that lacked first or last name. Within Open Refine, I had to combine the name and dates columns, as reconciling with dates was much more accurate. In order to discover both of these things, I had to first reconcile the dataset and then figure out what went wrong, and reconciling such a large dataset often took at least a couple of hours. After each iteration of reconciliation, I also spot checked the results for accuracy. Once I started working with the full name and dates, I began trying to figure out how to work with probable matches. The scores for this dataset were significantly lower than for the authors’ names, and in order to get more matches, we have to tolerate more errors. For the artists’ names dataset, I also did a lot of manual reconciliation, focusing on names that include dates rather than all names. I did this because there were fewer, and it was much easier to reconcile accurately with dates. The same problems I noted above – Refine VIAF not recognizing exact matches with a score of one and often choosing the lowest scored worst match – continued.

While researching Refine VIAF, I noticed that the same developed came out with a different version at the very end of 2016, called Concilator. I repeated the same reconciliation process for artists’ names with Concilator, including spot checking, reconciliation based on similarity scores, and manual reconciliation. Overall, it did not seem as though much had changed and Conciliator had the same problems with similarity scores as Refine VIAF.

Conciliator, however, does offer the option of reconciling with specific data sources in VIAF, so I tried this with both Getty’s Union List of Artist Names and ISNI. ULAN was more successful, but did not automatically match any names. There were about 26,000 suggested matches, and those seemed like the types of names that would have automatically matched using the entire VIAF database. The similarity scores were higher, but Concilator continues to mysteriously disregard high similarity scores. Also, for the majority of names, there was no suggested match when reconciling with ULAN, which makes manual reconciliation impossible. I was excited about the possibility of working with specific data sources, and it still seems like it might work with more targeted/specific datasets, but it was not appropriate in this case.

During this project, I created two visualizations based on a smaller set of the artists’ names dataset – artists who were associated with specific countries. For one, I created a heat map of the number of artists associated with countries. For the other, I used the birthdate info associated with some of these artists to create an animated map of when and where they were born.

Going forward, if geographic information and life dates are added to this dataset, similar visualizations could be created. It might also be interesting to bring in related names and then attempt some network visualization. Using this data to enhance existing records will also, of course, make reconciliation and linking easier and more accurate. Linking data offers the possibility of connecting disparate pieces of information about an individual or specific thing, but it also will always run into the inherent fuzziness of language and the ways in which some things are not knowable.

DPH Internship: Post #3

This is my third blog post about my Smithsonian Libraries internship. I have continued to work with the artists’ names spreadsheet, which consists of about 85,000 names. The organizational names are ready to be loaded, but we’ve been discovering weird things in the dataset, like dates with no names and random blank lines. The reconciliation program more or less ignores these, fortunately, but it does stand out when I’m scanning the data. I’ve also been using the newer version of Refine Viaf, which is called Conciliator and allows more targeted reconciliation – with ORCID numbers, or just VIAF records from the Library of Congress.

I have also hand reconciled a fair amount of records in this dataset, and it is making me think more about the imprecision of language and how that plays out in text searches. And the ongoing importance of human labor because of those shortcomings. I am able to quickly reconcile a lot of artists because the birth year and death year is in the VIAF record, but not in the title of the record that Conciliator is using to reconcile. People with common names are conflated both within VIAF and in the reconciliation of the artists’ names; sometimes I am not certain whether the VIAF record and the artist are the same person (usually due to the types of works they’re associated with), but the name is good enough for Conciliator. There is a good deal of uncertainty in this process that can’t be removed, and it is inevitable that there will be mistakes, but these aspects are hard to wrap our heads around, I think, because we expect less squishyness and more clarity when we interact with technology. I spent the last year or so reading critiques of technology and technological determinism for a writing project, and when working with these datasets, it’s very apparent that humans have had their grubby little hands all over, because there is so much variation, even though the information seems incredibly straightforward: name and life dates.

The next piece I will be working on is writing up best practices/lessons learned from working with Open Refine, Refine VIAF, and Conciliator. I have also been thinking about ways to get better results, generally by slicing up the data, or using a specific set of records, like LC or Getty. And then trying to figure out where to go from here.

I did do a map of artist nationalities (only about 5000 entries, but still neat):



DPH Internship: Post #2

This is my second blog post about my Smithsonian Libraries Internship. I continued to work with the authors’ names dataset in OpenRefine, which was derived from the books in the Smithsonian’s Digital Library, throughout December and January. I discovered that for some reason, the algorithm wasn’t recognizing exact matches. That is, when I asked it to make matches, it would often choose the worst possible match, even if there was an exact match. My mentor recently discovered that the algorithm generates a ratio based on the Levenshtein Distance between the search term and the results from VIAF. The Levenshtein Distance is the number of single-letter changes to get from one word to another. For example:

“cat” and “bat” have a distance of 1. (c changes to b)
“fish” and “fresh” have a distance of 2. (i changes to e, and we add an r)
“frederick” to “bill” is 8. (5 to remove “frede” and 3 to change “rick” to “bill”)

The algorithm generates a score for each match from 0 to 1, 1 being the closer match. I discovered it was ignoring scores of 1. We’re currently trying to figure out how to get around this, possibly via some sort of custom code. FUN.

We’ve also been thinking and talking about how many errors are acceptable. It’s really difficult to match names that are common in English (both American and more broadly) and in at least one case, even the VIAF record conflated what seemed to be two different people. I tend to err on the side of providing as much information as possible, but I also don’t want to be sloppy and provide inaccurate information.

I also recently started working with the artists’ names dataset, which is from the Smithsonian American Art Museum. This dataset is much, much larger than the authors’ names (85,000 names) and also includes other data for some of the artists, such as nationality/country. My initial reconciliation of the data was quite disappointing – a very low percentage of names matched – so I started breaking the dataset into chunks (artists with dates, artists without dates, organizational “artists”) in order to get better results. The organizational artists actually worked really well, and over 50% matched a VIAF record exactly. The artists with dates worked fairly well, but the artists without dates did not. One interesting aspect is that this dataset contains a lot of African, Asian, and Middle Eastern artists, and those tended to match VIAF records exactly. This is likely because VIAF (being populated primarily but not entirely American/European institutions) just has fewer African, Asian, and Middle Eastern records in it. I’m thinking that if we don’t link to all of the artists, non-Western artists might be a subset we could separate out based on nationality/country and focus on, since the linking is more accurate. Since there is less likely to be information about these artists floating around online, these links might also be more useful and interesting for users (and would work towards decentering Western art at the same time, which is a win in my book).

I’m also toying with the idea of doing some sort of geographic visualization of this dataset, since it would be neat to be able to see the geographic breadth of the collection. That’s not officially part of my project, but I think it might be fun, and it would be nice also work on something public-facing, even while I’m immersed in spreadsheets.

DPH Internship: Post #1

This is my first blog post about my Smithsonian Libraries internship, although I actually have been working on it since October, because it’s been an interesting semester (I think fall semester is always interesting, and then everyone settles down for spring semester).

The focus of my internship is on preparing and working with large datasets in order to link and match them. I’ve been working with a dataset of authors whose books have been digitized and are in the Smithsonian Libraries Digital Library. We’re trying to attach the authors to records in VIAF, the Virtual International Authority File, so that users of the Digital Library can easily locate more information about the authors, and those authors can be connected to their other works. My mentor has pulled the data, and I’ve worked on cleaning it (removing organizational authors, etc.) and reconciling it with VIAF via OpenRefine, which is an open source tool for working with messy data and linking it with external, web-based datasets like VIAF. This has meant I’ve spent some quality time with Excel and OpenRefine tutorials, and have also been revisiting how to query databases. My most recent work with the author dataset also played around with ways to improve on the automatic matching/reconciliation performed by OpenRefine by coming up with an heuristic that matches more names based on match similarity scores. This involved a good amount of spot-checking of individual names, which took a lot of time but was also pretty interesting (there are a lot of neat books in the Digital Library).

I’ve been working with the authors dataset primarily to familiarize myself with OpenRefine and data cleaning and reconciliation and will be working with other datasets next semester. The one I am particularly excited about is the art and artist vertical files from the Smithsonian American Art Museum, which is one of my favorite museums. In October, I visited the museum and saw the vertical files. There is so much fascinating material in the files, but the files are underused because they are not cataloged. The fact that materials are only findable if they have a record or some sort of representation that can be searched for (which generally means text) is something I’ve found myself pointing out in more and more of my instruction, both individual and group, because it’s something that a lot of students and faculty don’t think about. My final project for the previous class in the certificate program focused on pushing students to think about primary sources as having their own histories, and sought to emphasize the creation of collections and records as part of this history. Working with datasets like the authors dataset, which is pretty straightforward really (names and birthdates), and VIAF, which has split authority files and sometimes the best record is from the least-obvious institution, really points to the historical contingency and inconsistency of data, despite our best efforts. This aspect is also something I’ve been interested in, since data often takes on the appearance of empirical truth.

That was kind of rambling, but the dehistoricization of libraries, collections, and information systems is pervasive and does political work that I find troubling, so I spend a lot of time thinking about it. On a more practical note, I’m happy to be learning how to clean, prepare, and manipulate data. I’m thinking about working in some data science tutorials on an online course next semester, but I tend to overcommit.