DPH Internship: Post #4

My mentor asked me to write a summary of my project to date, so that is this month’s post.

Project Summary (to date)

This project began with working with a smallish dataset of authors’ names from the Smithsonian Libraries’ Digital Library. Using Refine VIAF, I tried reconciling the authors’ names with VIAF records. Reconciling entailed running the automated reconciliation process, working with probable matches so as to reconcile more records, and spot-checking for accuracy. I also used the authors’ names dataset to try and get a sense of how Open Refine and Refine VIAF work. While working the authors’ names in Refine VIAF, I discovered that although probable matches are assigned numeric scores based on similarity of letters, Refine VIAF doesn’t necessarily pick the highest score when you ask it make the best match. This issue continued throughout the project, and we still haven’t figured out why this is. Refine VIAF also doesn’t always recognize a score of one (which is the highest possible score, indicating an exact match) as the highest score when it automatically reconciles. This is also a problem as it means that it’s difficult to automate the reconciliation of suggested matches, for which there are almost always more than exact matches.

After working with the authors’ names dataset, I moved on to a much larger dataset of artists’ names. This dataset included organizational names, which were reconciled separately and much more successfully (almost 50% were reconciled automatically and when I spot checked, they were very accurate). The artists’ names dataset proved to be more complicated. I had to clean the data a bit, as there were records that lacked first or last name. Within Open Refine, I had to combine the name and dates columns, as reconciling with dates was much more accurate. In order to discover both of these things, I had to first reconcile the dataset and then figure out what went wrong, and reconciling such a large dataset often took at least a couple of hours. After each iteration of reconciliation, I also spot checked the results for accuracy. Once I started working with the full name and dates, I began trying to figure out how to work with probable matches. The scores for this dataset were significantly lower than for the authors’ names, and in order to get more matches, we have to tolerate more errors. For the artists’ names dataset, I also did a lot of manual reconciliation, focusing on names that include dates rather than all names. I did this because there were fewer, and it was much easier to reconcile accurately with dates. The same problems I noted above – Refine VIAF not recognizing exact matches with a score of one and often choosing the lowest scored worst match – continued.

While researching Refine VIAF, I noticed that the same developed came out with a different version at the very end of 2016, called Concilator. I repeated the same reconciliation process for artists’ names with Concilator, including spot checking, reconciliation based on similarity scores, and manual reconciliation. Overall, it did not seem as though much had changed and Conciliator had the same problems with similarity scores as Refine VIAF.

Conciliator, however, does offer the option of reconciling with specific data sources in VIAF, so I tried this with both Getty’s Union List of Artist Names and ISNI. ULAN was more successful, but did not automatically match any names. There were about 26,000 suggested matches, and those seemed like the types of names that would have automatically matched using the entire VIAF database. The similarity scores were higher, but Concilator continues to mysteriously disregard high similarity scores. Also, for the majority of names, there was no suggested match when reconciling with ULAN, which makes manual reconciliation impossible. I was excited about the possibility of working with specific data sources, and it still seems like it might work with more targeted/specific datasets, but it was not appropriate in this case.

During this project, I created two visualizations based on a smaller set of the artists’ names dataset – artists who were associated with specific countries. For one, I created a heat map of the number of artists associated with countries. For the other, I used the birthdate info associated with some of these artists to create an animated map of when and where they were born.

Going forward, if geographic information and life dates are added to this dataset, similar visualizations could be created. It might also be interesting to bring in related names and then attempt some network visualization. Using this data to enhance existing records will also, of course, make reconciliation and linking easier and more accurate. Linking data offers the possibility of connecting disparate pieces of information about an individual or specific thing, but it also will always run into the inherent fuzziness of language and the ways in which some things are not knowable.

Maura Seale

DPH Internship: Post #4

Leave a Reply Cancel reply