ulanr 0.3
Matching personal names is a perennial difficulty in the study of art history, as researchers must deal not only with variant spellings in collections around the world (looking for “Rembrandt van Rijn”? Don’t forget to check for mentions of “رامبرانت”!) but also variant spellings that have appeared in historical documents (e.g. “Rembrandt Hermanszoon van Rijn”). The Getty’s Union List of Artist Names was expressly designed to catalog both modern-day spellings as well as historical variants and occurrences, providing museums an authority list to aid in describing their own collections. However, unless the ULAN has the exact spelling of an artist’s name in their directory, finding candidate matches still involves a lot of manual work.
This package aims to partially automate the identification of matching candidates in the ULAN by fuzzy matching.
Local and Remote Methods
ulan_match
takes a character vector of names, and returns a named list of data frames with the attributes of candidate matches.
method = "local"
will search for results from a table of the ULAN’s alternate names (available through the ulanrdata package), returning a named list with one data.frame per input name, each listing ULAN entities with alternate names that have a high character-level cosine similarity to the input names.
When the input name matches one of the ULAN alternate names exactly, the local
method will return only those matches, without running costly string similarity measurements. (Note that it is possible for two separate individuals, e.g. Vincent van Gogh (1820-1888) and Vincent van Gogh (1853-1890) to match the same variant name spelling.)
library(ulanr)
ulan_match(c("Rembrandt van Rijn", "Vincent van Gogh"), method = "local")
#> $`Rembrandt van Rijn`
#> Source: local data frame [1 x 7]
#>
#> id pref_name birth_year death_year gender nationality
#> (int) (chr) (int) (int) (chr) (chr)
#> 1 500011051 Rembrandt van Rijn 1606 1669 male Dutch
#> Variables not shown: score (dbl)
#>
#> $`Vincent van Gogh`
#> Source: local data frame [2 x 7]
#>
#> id pref_name birth_year death_year gender nationality
#> (int) (chr) (int) (int) (chr) (chr)
#> 1 500337743 Gogh, Vincent van 1820 1888 male Dutch
#> 2 500115588 Gogh, Vincent van 1853 1890 male Dutch
#> Variables not shown: score (dbl)
An alternate method, method = "sparql"
, works by directly querying the Getty’s live SPARQL endpoint, relying on their Lucene index to search for similar matches across the alternate and preferred names for all artists.
ulan_match(c("Rembrandt", "Vincent van Gogh"), method = "sparql")
#> $Rembrandt
#> Source: local data frame [4 x 7]
#>
#> id pref_name birth_year death_year gender nationality
#> (int) (chr) (int) (int) (chr) (chr)
#> 1 500006691 Bugatti, Rembrandt 1884 1916 male Italian
#> 2 500049481 Lockwood, Rembrandt 1815 1889 male American
#> 3 500019719 Peale, Rembrandt 1778 1860 male American
#> 4 500011051 Rembrandt van Rijn 1606 1669 male Dutch
#> Variables not shown: score (dbl)
#>
#> $`Vincent van Gogh`
#> Source: local data frame [5 x 7]
#>
#> id pref_name birth_year death_year gender nationality
#> (int) (chr) (int) (int) (chr) (chr)
#> 1 500337743 Gogh, Vincent van 1820 1888 male Dutch
#> 2 500341187 Gogh, V. W. van 1890 2010 male Dutch
#> 3 500115588 Gogh, Vincent van 1853 1890 male Dutch
#> 4 500339434 Gogh, Theo van 1857 1891 male Dutch
#> 5 500099450 Gogh, Peter van 1900 2050 male Dutch
#> Variables not shown: score (dbl)
This format plays nicely with dplyr::bind_rows()
, whose .id
argument will allow you to create a column of these list names in one unified data frame:
suppressPackageStartupMessages(library(dplyr))
ulan_match(c("Rembrandt", "Vincent van Gogh"), method = "sparql") %>%
bind_rows(.id = "original_name")
#> Source: local data frame [9 x 8]
#>
#> original_name id pref_name birth_year death_year
#> (chr) (int) (chr) (int) (int)
#> 1 Rembrandt 500006691 Bugatti, Rembrandt 1884 1916
#> 2 Rembrandt 500049481 Lockwood, Rembrandt 1815 1889
#> 3 Rembrandt 500019719 Peale, Rembrandt 1778 1860
#> 4 Rembrandt 500011051 Rembrandt van Rijn 1606 1669
#> 5 Vincent van Gogh 500337743 Gogh, Vincent van 1820 1888
#> 6 Vincent van Gogh 500341187 Gogh, V. W. van 1890 2010
#> 7 Vincent van Gogh 500115588 Gogh, Vincent van 1853 1890
#> 8 Vincent van Gogh 500339434 Gogh, Theo van 1857 1891
#> 9 Vincent van Gogh 500099450 Gogh, Peter van 1900 2050
#> Variables not shown: gender (chr), nationality (chr), score (dbl)
Date restrictions
You may have more than just a name when searching for an artist - you may also know when they were alive.
Use the early_year
and late_year
arguments to establish bounds for match candidates.
When strictly_between = FALSE
(the default), matches will be allowed when the input lifespan intersects with the lifespan defined by the ULAN.
When strictly_between = TRUE
, then ULAN matches’ life dates must fall completely within the early_year:late_year
range.
ulan_match("Rembrandt", early_year = 1600, late_year = 1800, strictly_between = FALSE, method = "sparql")
#> $Rembrandt
#> Source: local data frame [2 x 7]
#>
#> id pref_name birth_year death_year gender nationality
#> (int) (chr) (int) (int) (chr) (chr)
#> 1 500019719 Peale, Rembrandt 1778 1860 male American
#> 2 500011051 Rembrandt van Rijn 1606 1669 male Dutch
#> Variables not shown: score (dbl)
ulan_match("Rembrandt", early_year = 1600, late_year = 1800, strictly_between = TRUE, method = "sparql")
#> $Rembrandt
#> Source: local data frame [1 x 7]
#>
#> id pref_name birth_year death_year gender nationality
#> (int) (chr) (int) (int) (chr) (chr)
#> 1 500011051 Rembrandt van Rijn 1606 1669 male Dutch
#> Variables not shown: score (dbl)
You may supply vectors to early_year
and late_year
of the same length as names
, or you can alternately provide them with a single value that will be recycled for all queries.
cutoff_score
and max_results
The data.frame returned for each name given to ulan_match
contains a score
column with the similarity score returned by either the cosine similarity metric used for the local
method, or the Lucene index used by the sparql
method.
These scores are not directly comparable across methods.
The cosine similarity score may range between 0 to 1, while the Lucene result score, in this particular query environment, tends to range between between 0 and 12 or more.
(Exercise care when interpreting the Lucene result scores!)
If cutoff_score
is set to NULL
, then sane defaults are used based on the given method (0.95 for local
and 3 for sparql
) that will sift out many false positive matches.
If not match is found above the cutoff score, ulan_match
will return a data frame with 1 row of NA
values.
Set cutoff_score
to 0 to return results of any score.
You may also set the maximum number of results to be returned via max_results.
This defaults to 5, though for method = "sparql"
any number higher than 50 will be ignored, out of politeness towards the Getty’s endpoint.
ulan_match(c("Vincent van Gogh", "qwerty"), cutoff_score = 0, max_results = 10, method = "local")
#> $`Vincent van Gogh`
#> Source: local data frame [2 x 7]
#>
#> id pref_name birth_year death_year gender nationality
#> (int) (chr) (int) (int) (chr) (chr)
#> 1 500337743 Gogh, Vincent van 1820 1888 male Dutch
#> 2 500115588 Gogh, Vincent van 1853 1890 male Dutch
#> Variables not shown: score (dbl)
#>
#> $qwerty
#> Source: local data frame [10 x 7]
#>
#> id pref_name birth_year death_year gender
#> (int) (chr) (int) (int) (chr)
#> 1 500179597 Watty, Werner 1952 2072 male
#> 2 500371234 Query 1200 2080 unavailable
#> 3 500048153 Werner, Woty 1903 1971 male
#> 4 500094738 Reyset, W. 1571 1730 unavailable
#> 5 500196019 Stewart, Kerry 1965 2085 male
#> 6 500189669 Whybrow, Terry 1932 2052 male
#> 7 500000039 Winters, Terry 1949 2069 male
#> 8 500061411 West, Jerry 1933 2033 male
#> 9 500069312 Towry Whyte Painter -650 -480 unavailable
#> 10 500242784 Stewart, Jerry W. 1850 2080 male
#> Variables not shown: nationality (chr), score (dbl)
ulan_match(c("Vincent van Gogh", "qwerty"), cutoff_score = NULL, max_results = 10, method = "local")
#> Warning in construct_results(results = NA, name = name): No matches found
#> for qwerty
#> $`Vincent van Gogh`
#> Source: local data frame [2 x 7]
#>
#> id pref_name birth_year death_year gender nationality
#> (int) (chr) (int) (int) (chr) (chr)
#> 1 500337743 Gogh, Vincent van 1820 1888 male Dutch
#> 2 500115588 Gogh, Vincent van 1853 1890 male Dutch
#> Variables not shown: score (dbl)
#>
#> $qwerty
#> Source: local data frame [1 x 7]
#>
#> id pref_name birth_year death_year gender nationality score
#> (int) (chr) (int) (int) (chr) (chr) (dbl)
#> 1 NA NA NA NA NA NA NA
ulan_id
and ulan_data
These are utility wrapper functions that return only the top match of ulan_match
.
ulan_id
returns a vector of ID numbers, while ulan_data
returns a single data frame with all the columns from the regular results of ulan_match
, along with a name
column matching the original vector of names supplied by the user.
ulan_id(c("Rembrandt van Rijn", "Vincent van Gogh"), method = "sparql")
#> [1] 500011051 500337743
ulan_data(c("Rembrandt van Rijn", "Vincent van Gogh"), method = "sparql")
#> Source: local data frame [2 x 8]
#>
#> names id pref_name birth_year death_year
#> (chr) (int) (chr) (int) (int)
#> 1 Rembrandt van Rijn 500011051 Rembrandt van Rijn 1606 1669
#> 2 Vincent van Gogh 500337743 Gogh, Vincent van 1820 1888
#> Variables not shown: gender (chr), nationality (chr), score (dbl)