Matthew Lincoln, PhD Cultural Heritage Data & Info Architecture

Collaboratory Data Mining and Visualization Workshop

Setup

In this workshop we will be using plot.ly, an online platform for sharing datasets and collaborative visualization.

You will need to register an account with plot.ly for this workshop. Go to the plot.ly homepage and click on “Sign up” in the upper right corner of the page. If you already have a Gmail, Facebook, Twitter, or Github account, you will have the option to register using that account instead, so that you don’t have to create and remember a new password just for plot.ly.

Concepts

The data analysis pipeline

Today we are only doing one small step in data analysis.

  1. Acquire
  2. Store
  3. Clean
  4. Mutate
  5. Visualize <- we are here
  6. Narrate

Depending on the source of the data you want to work with, these steps may be easy or quite difficult. Each of these steps, particularly the cleaning step (in which we coerce “messy” values from the original dataset into regular, computable ones) and the mutating step (in which we generate entirely new variables by programmatically manipulating existing and/or cleaned variables), already involve a great deal of interpretation. We’re skipping ahead to what may be the most “fun” step — that of visualization — but bear in mind that this is only part of the process. This is not an entirely linear pipeline, either. You will always iterate your analyses, going back to change your data cleaning and mutation methods based on visualization results, and trying entirely new analyses as you determine the most effective way to compose your narratives.

Variables, observations, values

It helps to be specific about the different elements of a data table.

Data Types

Understanding your two main data types will help you determine the most productive visualizations to try.

Different types of plots are good for looking at ordinal or categorical distribution (histogram), categorical vs. ordinal (bar chart), or ordinal vs. ordinal (scatterplot).

Exercise 1: Dutch Collections at the NGA

Data link: https://plot.ly/~mdlincoln/9

Follow the data link above, and sign in to plot.ly if asked to do so. You should see a preview of the data table. Click on the link that says “Fork and edit”.

Fork and edit button in plot.ly

This will copy the data to your account, and open the editing and plotting interface.

Data Provenance

Where did these data come from?

I scraped these data from the National Gallery of Art’s Online Edition of the 17th Century Dutch Paintings.

Some of the variables in the table are original to the NGA’s website, while some of them I generated during the process of data cleaning and mutation. For example, the NGA lists dimensions as a string (a list of characters) rather than as numeric values. Using a script, I extracted the numeric values from these strings to create the height and width columns. I had to perform similar functions for creation and acquisition dates. Also, I created categories such as set (the collector/curator who added the items to the collection) and genre manually or semi-manually.

Plot exercises

Try to devise plots that could shed light on these questions:

  1. How did the chronology of Dutch art represented in the Dutch galleries change over the course of the twentieth century?
  2. How did the balance of genres in the Dutch galleries change over the course of the twentieth century?
  3. What are the most efficient visualizations for representing the predilections of different collectors and curators for genre? For date? What about scale and orientation? (Don’t forget to try the often-ignored boxplot!)

Exercise 2: British Art Sales from the GPI

Data link:https://plot.ly/~mdlincoln/8

Data Provenance

These data comprise a random sample of about 0.4% (1500 out of 374277) of the British sales records maintained by the Getty Provenance Index. Because plot.ly runs in your browser, it cannot handle very large datasets, so we will be working will a small sample of the database. For larger sets of data, you will want to use a more powerful set of programs, such as R and RStudio.

Many of the original variables in this table are not easily computable in their original form. Scroll to the rightmost columns (starting with the Date column) to see the cleaned/mutated variables that are most easily plotted:

Plot exericses

Try to devise plots that could shed light on these questions:

  1. What was the most popular time of the year to sell artworks? What was the most lucrative?
  2. How do these patterns differ for cheap vs. expensive artworks?
  3. Did these patterns change between 1790 and 1840?

We will quickly run into the limits of the plot.ly interface with these qeustions! In lieu of coming up with effective plots, we should at least try to conceive of how we might operationalize, or make measurable, these questions. More advanced tools may be necessary to visualize the kinds of patterns we want to detect.

Advanced analysis

Using R, I have generated a web interface that implements some measurements of these questions. (I will provide the login information during the session)