ulanr 0.3

11 Mar 2016

Matching personal names is a perennial difficulty in the study of art history, as researchers must deal not only with variant spellings in collections around the world (looking for “Rembrandt van Rijn”? Don’t forget to check for mentions of “رامبرانت”!) but also variant spellings that have appeared in historical documents (e.g. “Rembrandt Hermanszoon van Rijn”). The Getty’s Union List of Artist Names was expressly designed to catalog both modern-day spellings as well as historical variants and occurrences, providing museums an authority list to aid in describing their own collections. However, unless the ULAN has the exact spelling of an artist’s name in their directory, finding candidate matches still involves a lot of manual work.

This package aims to partially automate the identification of matching candidates in the ULAN by fuzzy matching.

Local and Remote Methods

ulan_match takes a character vector of names, and returns a named list of data frames with the attributes of candidate matches.

method = "local" will search for results from a table of the ULAN’s alternate names (available through the ulanrdata package), returning a named list with one data.frame per input name, each listing ULAN entities with alternate names that have a high character-level cosine similarity to the input names. When the input name matches one of the ULAN alternate names exactly, the local method will return only those matches, without running costly string similarity measurements. (Note that it is possible for two separate individuals, e.g. Vincent van Gogh (1820-1888) and Vincent van Gogh (1853-1890) to match the same variant name spelling.)

library(ulanr)
ulan_match(c("Rembrandt van Rijn", "Vincent van Gogh"), method = "local")
#> $`Rembrandt van Rijn`
#> Source: local data frame [1 x 7]
#>
#>          id          pref_name birth_year death_year gender nationality
#>       (int)              (chr)      (int)      (int)  (chr)       (chr)
#> 1 500011051 Rembrandt van Rijn       1606       1669   male       Dutch
#> Variables not shown: score (dbl)
#>
#> $`Vincent van Gogh`
#> Source: local data frame [2 x 7]
#>
#>          id         pref_name birth_year death_year gender nationality
#>       (int)             (chr)      (int)      (int)  (chr)       (chr)
#> 1 500337743 Gogh, Vincent van       1820       1888   male       Dutch
#> 2 500115588 Gogh, Vincent van       1853       1890   male       Dutch
#> Variables not shown: score (dbl)

An alternate method, method = "sparql", works by directly querying the Getty’s live SPARQL endpoint, relying on their Lucene index to search for similar matches across the alternate and preferred names for all artists.

ulan_match(c("Rembrandt", "Vincent van Gogh"), method = "sparql")
#> $Rembrandt
#> Source: local data frame [4 x 7]
#>
#>          id           pref_name birth_year death_year gender nationality
#>       (int)               (chr)      (int)      (int)  (chr)       (chr)
#> 1 500006691  Bugatti, Rembrandt       1884       1916   male     Italian
#> 2 500049481 Lockwood, Rembrandt       1815       1889   male    American
#> 3 500019719    Peale, Rembrandt       1778       1860   male    American
#> 4 500011051  Rembrandt van Rijn       1606       1669   male       Dutch
#> Variables not shown: score (dbl)
#>
#> $`Vincent van Gogh`
#> Source: local data frame [5 x 7]
#>
#>          id         pref_name birth_year death_year gender nationality
#>       (int)             (chr)      (int)      (int)  (chr)       (chr)
#> 1 500337743 Gogh, Vincent van       1820       1888   male       Dutch
#> 2 500341187   Gogh, V. W. van       1890       2010   male       Dutch
#> 3 500115588 Gogh, Vincent van       1853       1890   male       Dutch
#> 4 500339434    Gogh, Theo van       1857       1891   male       Dutch
#> 5 500099450   Gogh, Peter van       1900       2050   male       Dutch
#> Variables not shown: score (dbl)

This format plays nicely with dplyr::bind_rows(), whose .id argument will allow you to create a column of these list names in one unified data frame:

suppressPackageStartupMessages(library(dplyr))
ulan_match(c("Rembrandt", "Vincent van Gogh"), method = "sparql") %>%
  bind_rows(.id = "original_name")
#> Source: local data frame [9 x 8]
#>
#>      original_name        id           pref_name birth_year death_year
#>              (chr)     (int)               (chr)      (int)      (int)
#> 1        Rembrandt 500006691  Bugatti, Rembrandt       1884       1916
#> 2        Rembrandt 500049481 Lockwood, Rembrandt       1815       1889
#> 3        Rembrandt 500019719    Peale, Rembrandt       1778       1860
#> 4        Rembrandt 500011051  Rembrandt van Rijn       1606       1669
#> 5 Vincent van Gogh 500337743   Gogh, Vincent van       1820       1888
#> 6 Vincent van Gogh 500341187     Gogh, V. W. van       1890       2010
#> 7 Vincent van Gogh 500115588   Gogh, Vincent van       1853       1890
#> 8 Vincent van Gogh 500339434      Gogh, Theo van       1857       1891
#> 9 Vincent van Gogh 500099450     Gogh, Peter van       1900       2050
#> Variables not shown: gender (chr), nationality (chr), score (dbl)

Date restrictions

You may have more than just a name when searching for an artist - you may also know when they were alive. Use the early_year and late_year arguments to establish bounds for match candidates. When strictly_between = FALSE (the default), matches will be allowed when the input lifespan intersects with the lifespan defined by the ULAN. When strictly_between = TRUE, then ULAN matches’ life dates must fall completely within the early_year:late_year range.

ulan_match("Rembrandt", early_year = 1600, late_year = 1800, strictly_between = FALSE, method = "sparql")
#> $Rembrandt
#> Source: local data frame [2 x 7]
#>
#>          id          pref_name birth_year death_year gender nationality
#>       (int)              (chr)      (int)      (int)  (chr)       (chr)
#> 1 500019719   Peale, Rembrandt       1778       1860   male    American
#> 2 500011051 Rembrandt van Rijn       1606       1669   male       Dutch
#> Variables not shown: score (dbl)

ulan_match("Rembrandt", early_year = 1600, late_year = 1800, strictly_between = TRUE, method = "sparql")
#> $Rembrandt
#> Source: local data frame [1 x 7]
#>
#>          id          pref_name birth_year death_year gender nationality
#>       (int)              (chr)      (int)      (int)  (chr)       (chr)
#> 1 500011051 Rembrandt van Rijn       1606       1669   male       Dutch
#> Variables not shown: score (dbl)

You may supply vectors to early_year and late_year of the same length as names, or you can alternately provide them with a single value that will be recycled for all queries.

`cutoff_score` and `max_results`

The data.frame returned for each name given to ulan_match contains a score column with the similarity score returned by either the cosine similarity metric used for the local method, or the Lucene index used by the sparql method. These scores are not directly comparable across methods. The cosine similarity score may range between 0 to 1, while the Lucene result score, in this particular query environment, tends to range between between 0 and 12 or more. (Exercise care when interpreting the Lucene result scores!)

If cutoff_score is set to NULL, then sane defaults are used based on the given method (0.95 for local and 3 for sparql) that will sift out many false positive matches. If not match is found above the cutoff score, ulan_match will return a data frame with 1 row of NA values.

Set cutoff_score to 0 to return results of any score.

You may also set the maximum number of results to be returned via max_results. This defaults to 5, though for method = "sparql" any number higher than 50 will be ignored, out of politeness towards the Getty’s endpoint.

ulan_match(c("Vincent van Gogh", "qwerty"), cutoff_score = 0, max_results = 10, method = "local")
#> $`Vincent van Gogh`
#> Source: local data frame [2 x 7]
#>
#>          id         pref_name birth_year death_year gender nationality
#>       (int)             (chr)      (int)      (int)  (chr)       (chr)
#> 1 500337743 Gogh, Vincent van       1820       1888   male       Dutch
#> 2 500115588 Gogh, Vincent van       1853       1890   male       Dutch
#> Variables not shown: score (dbl)
#>
#> $qwerty
#> Source: local data frame [10 x 7]
#>
#>           id           pref_name birth_year death_year      gender
#>        (int)               (chr)      (int)      (int)       (chr)
#> 1  500179597       Watty, Werner       1952       2072        male
#> 2  500371234               Query       1200       2080 unavailable
#> 3  500048153        Werner, Woty       1903       1971        male
#> 4  500094738          Reyset, W.       1571       1730 unavailable
#> 5  500196019      Stewart, Kerry       1965       2085        male
#> 6  500189669      Whybrow, Terry       1932       2052        male
#> 7  500000039      Winters, Terry       1949       2069        male
#> 8  500061411         West, Jerry       1933       2033        male
#> 9  500069312 Towry Whyte Painter       -650       -480 unavailable
#> 10 500242784   Stewart, Jerry W.       1850       2080        male
#> Variables not shown: nationality (chr), score (dbl)
ulan_match(c("Vincent van Gogh", "qwerty"), cutoff_score = NULL, max_results = 10, method = "local")
#> Warning in construct_results(results = NA, name = name): No matches found
#> for qwerty
#> $`Vincent van Gogh`
#> Source: local data frame [2 x 7]
#>
#>          id         pref_name birth_year death_year gender nationality
#>       (int)             (chr)      (int)      (int)  (chr)       (chr)
#> 1 500337743 Gogh, Vincent van       1820       1888   male       Dutch
#> 2 500115588 Gogh, Vincent van       1853       1890   male       Dutch
#> Variables not shown: score (dbl)
#>
#> $qwerty
#> Source: local data frame [1 x 7]
#>
#>      id pref_name birth_year death_year gender nationality score
#>   (int)     (chr)      (int)      (int)  (chr)       (chr) (dbl)
#> 1    NA        NA         NA         NA     NA          NA    NA

`ulan_id` and `ulan_data`

These are utility wrapper functions that return only the top match of ulan_match. ulan_id returns a vector of ID numbers, while ulan_data returns a single data frame with all the columns from the regular results of ulan_match, along with a name column matching the original vector of names supplied by the user.

ulan_id(c("Rembrandt van Rijn", "Vincent van Gogh"), method = "sparql")
#> [1] 500011051 500337743

ulan_data(c("Rembrandt van Rijn", "Vincent van Gogh"), method = "sparql")
#> Source: local data frame [2 x 8]
#>
#>                names        id          pref_name birth_year death_year
#>                (chr)     (int)              (chr)      (int)      (int)
#> 1 Rembrandt van Rijn 500011051 Rembrandt van Rijn       1606       1669
#> 2   Vincent van Gogh 500337743  Gogh, Vincent van       1820       1888
#> Variables not shown: gender (chr), nationality (chr), score (dbl)

Lincoln, Matthew D. "ulanr 0.3." Matthew Lincoln, PhD (blog), 11 Mar 2016, https://matthewlincoln.net/2016/03/11/ulanr-3-0.html.

Matthew Lincoln, PhD Cultural Heritage Data & Info Architecture

ulanr 0.3

Local and Remote Methods

Date restrictions

cutoff_score and max_results

ulan_id and ulan_data

`cutoff_score` and `max_results`

`ulan_id` and `ulan_data`