DPH Internship: Post #5

As befits a history class, for this post I went back and looked at the portfolio I prepared in May 2016 for this internship (I can’t believe that was a year ago). The first items I have listed that I want to learn are about working with data, and indeed, I did learn a good amount about how to clean and manipulate data in both Excel and (primarily) OpenRefine. I would like to build on what I’ve learned and have been looking at courses on Excel and data science, especially since I keep seeing humanities data curation and management in librarian job descriptions. This internship confirmed that this is an area I would like to do more with professionally, and I very much enjoyed trying to figure out how best to enhance records. As a public services librarian, I am often frustrated by incomplete records and uninformative metadata, so it was fun being on the other side and drawing on my own experiences working with faculty and students in thinking what might be most useful to the user.

One of the drawbacks of the internship, and perhaps this would be true with any internship at any organization as large as the Smithsonian Institution, is that the project felt like a very small, very narrow, very specific thing. The work I did still has to go through layers of approval before it’s public, and that can be a little frustrating. Going forward, the program might want to consider internships with smaller institutions that are likely to be underfunded and have fewer volunteers or interns. It’s definitely impressive to work with the Smithsonian, but it can also be limiting in terms of the type of work you’re able to do. As a local, it was great to be able to visit and meet the people I was working with (and whose data I was working with).

The internship filled in a gap I felt in the coursework – preparing all of that lovely data for digital humanities tools – and the coursework helped me figure out how to use the data I had access to in maps and timelines. The coursework on user needs also helped me think through both the possibilities of and problems with linked data. As a librarian, I was already pretty familiar with metadata, and very happy that so much of our coursework emphasized the importance of it, and the internship work reiterated the value of clean and complete as possible metadata. I keep returning to Sam Wineburg’s notion of the “jagged edges” of history, and what also struck me in this project is the fundamental unknowability of some things. Is this author in the Smithsonian Digital Library actually the same person as in this VIAF authority record? Sometimes it is just not possible to tell based on the information we have. Another idea I keep returning to is the labor that is behind digital public humanities work, especially in regards to things that are less visible, like the creation of metadata or linking data. Relying on OpenRefine or a similar technology to automatically match names will, at this moment, likely result in unacceptable amounts of errors, unless the data is fairly complete. It takes a lot of time and effort to figure out what can’t be automated and then to do that work manually. It also takes human judgement (and often additional research) to make decisions in a lot of cases.

Playing Around with Timeline JS

I’ve been working on reconciling a dataset of botanists’ names, which unlike the other datasets I’ve worked with, includes a lot of birth and death dates. I decided to try putting it in Timeline JS, mostly to see if I could figure out how to use it. I ended up using Open Refine to split columns and transform numbers into dates, and then regular old Excel to get rid of empty rows (here’s the Google Sheet I ended up with). I didn’t go through and add media or links to the timeline, so it’s pretty boring, and I haven’t poked around to see why “January” is above each name. It is interesting to see who is contemporaneous, though, and I’m pleased I was able to make this (mostly) work.

DPH Internship: Post #4

My mentor asked me to write a summary of my project to date, so that is this month’s post.

Project Summary (to date)

This project began with working with a smallish dataset of authors’ names from the Smithsonian Libraries’ Digital Library. Using Refine VIAF, I tried reconciling the authors’ names with VIAF records. Reconciling entailed running the automated reconciliation process, working with probable matches so as to reconcile more records, and spot-checking for accuracy. I also used the authors’ names dataset to try and get a sense of how Open Refine and Refine VIAF work. While working the authors’ names in Refine VIAF, I discovered that although probable matches are assigned numeric scores based on similarity of letters, Refine VIAF doesn’t necessarily pick the highest score when you ask it make the best match. This issue continued throughout the project, and we still haven’t figured out why this is. Refine VIAF also doesn’t always recognize a score of one (which is the highest possible score, indicating an exact match) as the highest score when it automatically reconciles. This is also a problem as it means that it’s difficult to automate the reconciliation of suggested matches, for which there are almost always more than exact matches.

After working with the authors’ names dataset, I moved on to a much larger dataset of artists’ names. This dataset included organizational names, which were reconciled separately and much more successfully (almost 50% were reconciled automatically and when I spot checked, they were very accurate). The artists’ names dataset proved to be more complicated. I had to clean the data a bit, as there were records that lacked first or last name. Within Open Refine, I had to combine the name and dates columns, as reconciling with dates was much more accurate. In order to discover both of these things, I had to first reconcile the dataset and then figure out what went wrong, and reconciling such a large dataset often took at least a couple of hours. After each iteration of reconciliation, I also spot checked the results for accuracy. Once I started working with the full name and dates, I began trying to figure out how to work with probable matches. The scores for this dataset were significantly lower than for the authors’ names, and in order to get more matches, we have to tolerate more errors. For the artists’ names dataset, I also did a lot of manual reconciliation, focusing on names that include dates rather than all names. I did this because there were fewer, and it was much easier to reconcile accurately with dates. The same problems I noted above – Refine VIAF not recognizing exact matches with a score of one and often choosing the lowest scored worst match – continued.

While researching Refine VIAF, I noticed that the same developed came out with a different version at the very end of 2016, called Concilator. I repeated the same reconciliation process for artists’ names with Concilator, including spot checking, reconciliation based on similarity scores, and manual reconciliation. Overall, it did not seem as though much had changed and Conciliator had the same problems with similarity scores as Refine VIAF.

Conciliator, however, does offer the option of reconciling with specific data sources in VIAF, so I tried this with both Getty’s Union List of Artist Names and ISNI. ULAN was more successful, but did not automatically match any names. There were about 26,000 suggested matches, and those seemed like the types of names that would have automatically matched using the entire VIAF database. The similarity scores were higher, but Concilator continues to mysteriously disregard high similarity scores. Also, for the majority of names, there was no suggested match when reconciling with ULAN, which makes manual reconciliation impossible. I was excited about the possibility of working with specific data sources, and it still seems like it might work with more targeted/specific datasets, but it was not appropriate in this case.

During this project, I created two visualizations based on a smaller set of the artists’ names dataset – artists who were associated with specific countries. For one, I created a heat map of the number of artists associated with countries. For the other, I used the birthdate info associated with some of these artists to create an animated map of when and where they were born.

Going forward, if geographic information and life dates are added to this dataset, similar visualizations could be created. It might also be interesting to bring in related names and then attempt some network visualization. Using this data to enhance existing records will also, of course, make reconciliation and linking easier and more accurate. Linking data offers the possibility of connecting disparate pieces of information about an individual or specific thing, but it also will always run into the inherent fuzziness of language and the ways in which some things are not knowable.

DPH Internship: Post #3

This is my third blog post about my Smithsonian Libraries internship. I have continued to work with the artists’ names spreadsheet, which consists of about 85,000 names. The organizational names are ready to be loaded, but we’ve been discovering weird things in the dataset, like dates with no names and random blank lines. The reconciliation program more or less ignores these, fortunately, but it does stand out when I’m scanning the data. I’ve also been using the newer version of Refine Viaf, which is called Conciliator and allows more targeted reconciliation – with ORCID numbers, or just VIAF records from the Library of Congress.

I have also hand reconciled a fair amount of records in this dataset, and it is making me think more about the imprecision of language and how that plays out in text searches. And the ongoing importance of human labor because of those shortcomings. I am able to quickly reconcile a lot of artists because the birth year and death year is in the VIAF record, but not in the title of the record that Conciliator is using to reconcile. People with common names are conflated both within VIAF and in the reconciliation of the artists’ names; sometimes I am not certain whether the VIAF record and the artist are the same person (usually due to the types of works they’re associated with), but the name is good enough for Conciliator. There is a good deal of uncertainty in this process that can’t be removed, and it is inevitable that there will be mistakes, but these aspects are hard to wrap our heads around, I think, because we expect less squishyness and more clarity when we interact with technology. I spent the last year or so reading critiques of technology and technological determinism for a writing project, and when working with these datasets, it’s very apparent that humans have had their grubby little hands all over, because there is so much variation, even though the information seems incredibly straightforward: name and life dates.

The next piece I will be working on is writing up best practices/lessons learned from working with Open Refine, Refine VIAF, and Conciliator. I have also been thinking about ways to get better results, generally by slicing up the data, or using a specific set of records, like LC or Getty. And then trying to figure out where to go from here.

I did do a map of artist nationalities (only about 5000 entries, but still neat):



DPH Internship: Post #2

This is my second blog post about my Smithsonian Libraries Internship. I continued to work with the authors’ names dataset in OpenRefine, which was derived from the books in the Smithsonian’s Digital Library, throughout December and January. I discovered that for some reason, the algorithm wasn’t recognizing exact matches. That is, when I asked it to make matches, it would often choose the worst possible match, even if there was an exact match. My mentor recently discovered that the algorithm generates a ratio based on the Levenshtein Distance between the search term and the results from VIAF. The Levenshtein Distance is the number of single-letter changes to get from one word to another. For example:

“cat” and “bat” have a distance of 1. (c changes to b)
“fish” and “fresh” have a distance of 2. (i changes to e, and we add an r)
“frederick” to “bill” is 8. (5 to remove “frede” and 3 to change “rick” to “bill”)

The algorithm generates a score for each match from 0 to 1, 1 being the closer match. I discovered it was ignoring scores of 1. We’re currently trying to figure out how to get around this, possibly via some sort of custom code. FUN.

We’ve also been thinking and talking about how many errors are acceptable. It’s really difficult to match names that are common in English (both American and more broadly) and in at least one case, even the VIAF record conflated what seemed to be two different people. I tend to err on the side of providing as much information as possible, but I also don’t want to be sloppy and provide inaccurate information.

I also recently started working with the artists’ names dataset, which is from the Smithsonian American Art Museum. This dataset is much, much larger than the authors’ names (85,000 names) and also includes other data for some of the artists, such as nationality/country. My initial reconciliation of the data was quite disappointing – a very low percentage of names matched – so I started breaking the dataset into chunks (artists with dates, artists without dates, organizational “artists”) in order to get better results. The organizational artists actually worked really well, and over 50% matched a VIAF record exactly. The artists with dates worked fairly well, but the artists without dates did not. One interesting aspect is that this dataset contains a lot of African, Asian, and Middle Eastern artists, and those tended to match VIAF records exactly. This is likely because VIAF (being populated primarily but not entirely American/European institutions) just has fewer African, Asian, and Middle Eastern records in it. I’m thinking that if we don’t link to all of the artists, non-Western artists might be a subset we could separate out based on nationality/country and focus on, since the linking is more accurate. Since there is less likely to be information about these artists floating around online, these links might also be more useful and interesting for users (and would work towards decentering Western art at the same time, which is a win in my book).

I’m also toying with the idea of doing some sort of geographic visualization of this dataset, since it would be neat to be able to see the geographic breadth of the collection. That’s not officially part of my project, but I think it might be fun, and it would be nice also work on something public-facing, even while I’m immersed in spreadsheets.

Five Colleges Innovative Learning Symposium

The Five Colleges invited me to come talk about assessment at their Innovative Learning Symposium, and I did. They were a lovely audience, and the discussions we had after my talk were interesting and useful (I’m very taken with the notion of adding  “repeat student” and “referred by professor” checkboxes to our LibAnalytics form, and love the idea of tracking “feelings,” although I think that will be a bit of a harder sell).

At any rate, here are the text  and slides for my talk, “Efficiency or Jagged Edges: The Logics and Possibilities of Assessment.”

DPH Internship: Post #1

This is my first blog post about my Smithsonian Libraries internship, although I actually have been working on it since October, because it’s been an interesting semester (I think fall semester is always interesting, and then everyone settles down for spring semester).

The focus of my internship is on preparing and working with large datasets in order to link and match them. I’ve been working with a dataset of authors whose books have been digitized and are in the Smithsonian Libraries Digital Library. We’re trying to attach the authors to records in VIAF, the Virtual International Authority File, so that users of the Digital Library can easily locate more information about the authors, and those authors can be connected to their other works. My mentor has pulled the data, and I’ve worked on cleaning it (removing organizational authors, etc.) and reconciling it with VIAF via OpenRefine, which is an open source tool for working with messy data and linking it with external, web-based datasets like VIAF. This has meant I’ve spent some quality time with Excel and OpenRefine tutorials, and have also been revisiting how to query databases. My most recent work with the author dataset also played around with ways to improve on the automatic matching/reconciliation performed by OpenRefine by coming up with an heuristic that matches more names based on match similarity scores. This involved a good amount of spot-checking of individual names, which took a lot of time but was also pretty interesting (there are a lot of neat books in the Digital Library).

I’ve been working with the authors dataset primarily to familiarize myself with OpenRefine and data cleaning and reconciliation and will be working with other datasets next semester. The one I am particularly excited about is the art and artist vertical files from the Smithsonian American Art Museum, which is one of my favorite museums. In October, I visited the museum and saw the vertical files. There is so much fascinating material in the files, but the files are underused because they are not cataloged. The fact that materials are only findable if they have a record or some sort of representation that can be searched for (which generally means text) is something I’ve found myself pointing out in more and more of my instruction, both individual and group, because it’s something that a lot of students and faculty don’t think about. My final project for the previous class in the certificate program focused on pushing students to think about primary sources as having their own histories, and sought to emphasize the creation of collections and records as part of this history. Working with datasets like the authors dataset, which is pretty straightforward really (names and birthdates), and VIAF, which has split authority files and sometimes the best record is from the least-obvious institution, really points to the historical contingency and inconsistency of data, despite our best efforts. This aspect is also something I’ve been interested in, since data often takes on the appearance of empirical truth.

That was kind of rambling, but the dehistoricization of libraries, collections, and information systems is pervasive and does political work that I find troubling, so I spend a lot of time thinking about it. On a more practical note, I’m happy to be learning how to clean, prepare, and manipulate data. I’m thinking about working in some data science tutorials on an online course next semester, but I tend to overcommit.

Final Project: Thinking Historically about Primary Sources

Thinking Historically about Primary Sources is my final project for the Teaching & Learning History in the Digital Age course. Below is my essay explaining the process of creating the project.

My final project is a LibGuide that focuses on helping students think about primary sources (and really, any source) historically and contextually. This is something I often talk about in both undergraduate and graduate library research instruction sessions, but I hadn’t really figured out a way to talk about it clearly and systematically. I often would describe how primary sources might not exist, might not be accessible, might not be translated, or might not say what you would like them to say, and suggest that students remain somewhat open in terms of their topic, approach, and argument until finding, accessing, and analyzing primary sources.

The guide provides a framework for thinking about primary sources historically and contextually – what I call the primary source lifecycle. There are moments where the lifecycle can be broken or interrupted, as well as other barriers that can stand in the way of finding and accessing primary sources. I included both because I want students to have some sort of model for thinking about primary sources but also to appreciate the complexity of their histories and to confront the idea that the historical record is fragmentary and incomplete. Wineburg’s idea of the jagged edges of history has stayed with me throughout this semester, and I think the idea that not everything is knowable, that there is always uncertainty when studying history is something students often struggle and are uncomfortable with. The primary source lifecycle, with all its caveats, is an attempt to “uncover,” in Calder’s terminology, that uncertainty, specifically in regards to primary sources, but also in regards to history more broadly. Highlighting uncertainty might also help students grapple with understanding historical writing as interpretation and argument instead of objectively describing what happened. Wineburg talks about the strangeness, otherness of history, of “what we cannot see” and “the congenital blurriness of our vision;” getting students to think about what they can and cannot know, and the reasons behind that, helps convey this.

The guide includes three case studies that speak to the complex histories of primary sources: silent film, medieval bestiaries, and early American books. These are all histories I have some familiarity with, although I did spend some time researching each of them in order to find relevant readings and to develop good questions. When I initially conceived of this project, I had hoped to also incorporate some content around how information systems have histories and must be understood contextually, but as I began working on it, this element felt unwieldy and really like a distinct project. However, I did try to bring some aspects of this topic out in the case studies, and made sure to include “record creation” in the primary source lifecycle. This topic is also something I regularly talk about in my instruction sessions, and unlike the topic of my final project, it is something I have repeatedly thought through and could probably talk about while asleep.

Each case study includes readings and websites to explore and a series of questions. I tried to embed the primary sources themselves in the guide to encourage students to explore them as well, because I want them to experience their strangeness for themselves. The first case study is of silent film. The readings include a description of the Edison film collection at the Library of Congress, a record for a specific film, and a couple of short articles on how and why most silent films are gone forever. The description of the collection points to the contingency of what objects survive; the collection exists because of copyright rules at the time. The record highlights the need for record creation and the constructedness of information systems. The articles provide context about what might have happened to similar objects. The questions ask students to sketch the lifecycles of both the Edison films and silent films more broadly, to consider interruptions in those lifecycles, to think about how the incompleteness of the historical record affects what we know, and to think about how what happened to silent film might inform how we approach a perhaps even more ephemeral medium, digital film and video.

The second case study again emphasizes the framework of the primary source lifecycle, but also builds on the first case study. It is a comparison of two medieval bestiaries. The Northumberland bestiary was preserved by the Duke of Northumberland for 700 years but then sold to a private collector. The Getty recently purchased it from the collector and then digitized it as part of its Open Content Program. The Aberdeen bestiary, in contrast, seems to have been in the same location since its creation and its digital version is much more extensive (that is, it includes an in-depth history of the object, comparison to similar bestiaries, and the manuscript has been translated and transcribed and is searchable). The readings include one article on the history of the Northumberland that focuses on its history and accessibility, the About page for the Getty Open Content program, and the websites for both bestiaries, which students are to explore on their own. The questions again have students sketch out the lifecycles of both bestiaries, and also to think about the work that a program Getty Open Content does in regards to findability and accessibility. The final question asks students to pay attention to the functionality of the web versions of both bestiaries. I included this to help get at the different pieces of digitization, the labor involved, and how those decisions can impact the ways in which digital surrogates can be used. This is part of my effort to destabilize information systems particularly and information technology more broadly.

The third and final case study is of the Evans bibliography, which eventually became the Readex database, Early American Imprints, Series 1. The readings include the Wikipedia entry on Charles Evans, marketing materials from Readex, and a review of the Readex database from History Matters. Students are also asked to explore the Readex website, the Text Creation Partnership version of the website, and digitized volumes of the American Bibliography on Hathitrust. This case study also asks students to outline the lifecycles of these materials, but it is more complicated and confusing than the other two case studies. Like the bestiaries case study, it also asks students to carefully examine the different features of information systems – the Readex database, the TCP version, and Hathitrust – in order to denaturalize them, and to think about how projects like the TCP affect the primary source lifecycle. I like this example because it also highlights the history of information systems themselves, as the Evans collection moves from print bibliography to full-text microfilm to full-text online database and almost full-text online bibliography. Finally, this case study emphasizes the ethical questions around digitizing material, charging for access, and copyright. Some volumes of the American Bibliography are still in copyright and so not available on Hathitrust or Internet Archive. The History Matters review strongly questions Readex’s position that it is expanding access, when the database is so expensive and out of reach for many institutions. The ethos of the TCP is very different and the TCP also provides another model for doing these sorts of mass digitization projects. I obviously have opinions about this, but the goal of this case study is merely for students to start becoming familiar with and thinking about these ethical issues.

I also developed six additional prompts focused on primary sources that are more open-ended and require students to research the materials on their own. These do not include readings and the questions are less targeted. Like the case studies, these prompts emphasize the lifecycle and contingency of sources, denaturalizing and destabilizing information systems, the ethics of commercial digitization and copyright, and non-profit models of digitization and access. The case studies and prompts also help familiarize students with some of the primary source resources available online and introduce the many different primary sources they can work with. I would also hope that the guide would spark curiosity and interest in using the sources.

The core themes and questions of my final project are issues that I have been mulling over for some time, but this is the first time I’ve tried to articulate a coherent approach. The case studies are appropriate in scope for my instruction sessions, but given the other material I generally need to cover in those sessions, they might work better as homework. Although this LibGuide does spend some time on information systems, I would like to develop a similar guide that more closely focuses on information systems and concepts such as metadata and the history of indices and databases.

Projects: Other Peoples’ and My Own

The videos were quite helpful in both practical and theoretical ways. Sleeter reminded me that I originally conceived of my final project in terms of “uncovering” the history and context of primary sources and library systems. Wieringa and Sharpe talked about working backwards from the overarching learning objectives, which I think will be helpful for me, as those are much more defined for me right now. Sleeter also talked about how he wanted his site to model historical thinking, which is also helpful because I keep getting fixated on how my site won’t be as interactive as my library instruction sessions usually are; it’s enough to frame it as modeling and moving students toward historical thinking. My instructional goals are usually fairly modest, but I seem to think this project needs to do way more. Practically, it was good to hear that these students ended up scaling their projects back some, or that it didn’t turn out exactly the way they wanted. I’m trying to focus more on it being proof of concept, so hearing that other students had to do the same thing was helpful. These videos will probably help me procrastinate less on the final project, more than anything else (maybe we should have watched them earlier?).