#MetadataMondays at the Getty
The Getty is posting a series of stories about (meta)data on their Iris blog. I have a little contribution there about dealing with missing data in historical sources:
A core challenge in working with metadata in the digital humanities is that often times our historical sources are missing or incomplete. This presents a problem for a lot of quantitative data methods, as they’re built to work with complete cases. We are working on different ways to address this problem. Though it’s impossible to magically recover information that just isn’t there (a dealer records selling a work of art, for example, but doesn’t write down the price), we can try to account for how much uncertainty that “known unknown” should add to the conclusions that we come up with.
So how do we go about visualizing all this uncertainty? In the plot shown above, we wanted to characterize the diversity of genres that art dealer M. Knoedler & Co. was selling at different points in time. Because some entries in our records are missing either a genre, a sale date, or both, we might assign values through randomized (but educated) guessing. Do this fifty times, and we get fifty slightly different timelines, rather than one “definitive” answer.
The fuzziness of this visual is evocative to be sure, but it also helps us understand the discrete range of possible results we might get were we to have complete data. In this case, an overall trend still shines through this added uncertainty: Knoedler began to offer an increasingly heterogeneous array of genres between 1900 and 1925. But smaller, short-lived spikes in some of these timelines (like those around 1940) are not present in every iteration, suggesting that we ought not to build important conclusions on what is likely a spurious blip in our quantitative results.