OpenRefine Exercise
Working through the Programming Historian's OpenRefine tutorial
Practice data
OpenRefine Tutorial
Tips on following the tutorial:
- Make sure to import the data properly according to the tutorial instructions - they walk you through adjusting some of the default import settings to read the data correctly. If you miss this, then a lot of the following steps will give very different results than are shown in the tutorial.
- Remember to click the “Remove All” button under the “Facet / Filter” menu between steps. This doesn’t remove data, it just removes the facet view, letting you start fresh with the next step in the tutorial.
- While the
Split multi-valued cells...
command is crucial for many data cleaning operations in OpenRefine, remember toJoin multi-valued cells...
again before exporting a CSV of your tidied data.
Looking for more?
This tutorial only covers a small portion of what OpenRefine can do. A more complete tutorial is available from Library Carpentry that you can start to look through if you have time.
If you have extra time, I recommend experimenting with the Knoedler data we used for the Palladio exercise to reconcile the terms to a controlled vocabulary.
Reconciliation is the process of linking your own internal lists of people, places, and concepts, to widely-used authorities such as the Library of Congress Subject Headings, the Virtual Internet Authority File, or the Getty Vocabularies. This makes your data more re-usable by others because they can match your exact authority IDs rather than having to scan your text data and hope that you used the exact same formatting and spelling as their own data does.
- Load the Knoelder data into OpenRefine.
- Split the
genre
column, which uses semicolons;
as a delimiter - Click on the menu for the
genre
column, and select Reconcile > Start reconciling… - Click the Add Standard Service… button, and paste this url in to the box:
https://services.getty.edu/vocab/reconcile
Then click Add Service - Select the “Getty Vocabularies Reconciliation Service” that should now appear in the menu.
- Choose to reconcile the cells to the AAT search, and then click Start Reconciling which will compare the terms in the
genre
column to the Getty’s Art & Architecture Thesaurus. - After a short wait, you should see options underneath each value. Mouse over each term to see the full definition from the Getty. Once you find the appropriate term to reconcile to, click the double-check-mark box that will match that term to every cell that has the same value. If you don’t see an appropriate term, click the double-check-box by Create new item which will mark records with that term as not having a reconcilable match.
(Note: because of some eccentricities with the way that the Getty’s reconciliation service currently works, some of the more general terms like Landscape
don’t make it into the top displayed reconciliation results.)
- Continue down the list, and take advantage of the facets that have been created during reconciliation, selecting
none
from thegenre: judgment
facet to show only those records that haven’t been reconciled yet. - Once you have reconciled the terms, you can use the dropdown menu to choose Reconcile > Add entity identifiers column to add a column with the unique identifier for that term from the AAT.