DPH Internship: Post #2

This is my second blog post about my Smithsonian Libraries Internship. I continued to work with the authors’ names dataset in OpenRefine, which was derived from the books in the Smithsonian’s Digital Library, throughout December and January. I discovered that for some reason, the algorithm wasn’t recognizing exact matches. That is, when I asked it to make matches, it would often choose the worst possible match, even if there was an exact match. My mentor recently discovered that the algorithm generates a ratio based on the Levenshtein Distance between the search term and the results from VIAF. The Levenshtein Distance is the number of single-letter changes to get from one word to another. For example:

“cat” and “bat” have a distance of 1. (c changes to b)
“fish” and “fresh” have a distance of 2. (i changes to e, and we add an r)
“frederick” to “bill” is 8. (5 to remove “frede” and 3 to change “rick” to “bill”)

The algorithm generates a score for each match from 0 to 1, 1 being the closer match. I discovered it was ignoring scores of 1. We’re currently trying to figure out how to get around this, possibly via some sort of custom code. FUN.

We’ve also been thinking and talking about how many errors are acceptable. It’s really difficult to match names that are common in English (both American and more broadly) and in at least one case, even the VIAF record conflated what seemed to be two different people. I tend to err on the side of providing as much information as possible, but I also don’t want to be sloppy and provide inaccurate information.

I also recently started working with the artists’ names dataset, which is from the Smithsonian American Art Museum. This dataset is much, much larger than the authors’ names (85,000 names) and also includes other data for some of the artists, such as nationality/country. My initial reconciliation of the data was quite disappointing – a very low percentage of names matched – so I started breaking the dataset into chunks (artists with dates, artists without dates, organizational “artists”) in order to get better results. The organizational artists actually worked really well, and over 50% matched a VIAF record exactly. The artists with dates worked fairly well, but the artists without dates did not. One interesting aspect is that this dataset contains a lot of African, Asian, and Middle Eastern artists, and those tended to match VIAF records exactly. This is likely because VIAF (being populated primarily but not entirely American/European institutions) just has fewer African, Asian, and Middle Eastern records in it. I’m thinking that if we don’t link to all of the artists, non-Western artists might be a subset we could separate out based on nationality/country and focus on, since the linking is more accurate. Since there is less likely to be information about these artists floating around online, these links might also be more useful and interesting for users (and would work towards decentering Western art at the same time, which is a win in my book).

I’m also toying with the idea of doing some sort of geographic visualization of this dataset, since it would be neat to be able to see the geographic breadth of the collection. That’s not officially part of my project, but I think it might be fun, and it would be nice also work on something public-facing, even while I’m immersed in spreadsheets.

Leave a Reply

Your email address will not be published. Required fields are marked *