Collaboratory Data Mining and Visualization Workshop
In this workshop we will be using plot.ly, an online platform for sharing datasets and collaborative visualization.
You will need to register an account with plot.ly for this workshop. Go to the plot.ly homepage and click on “Sign up” in the upper right corner of the page. If you already have a Gmail, Facebook, Twitter, or Github account, you will have the option to register using that account instead, so that you don’t have to create and remember a new password just for plot.ly.
Today we are only doing one small step in data analysis.
Depending on the source of the data you want to work with, these steps may be easy or quite difficult. Each of these steps, particularly the cleaning step (in which we coerce “messy” values from the original dataset into regular, computable ones) and the mutating step (in which we generate entirely new variables by programmatically manipulating existing and/or cleaned variables), already involve a great deal of interpretation. We’re skipping ahead to what may be the most “fun” step — that of visualization — but bear in mind that this is only part of the process. This is not an entirely linear pipeline, either. You will always iterate your analyses, going back to change your data cleaning and mutation methods based on visualization results, and trying entirely new analyses as you determine the most effective way to compose your narratives.
It helps to be specific about the different elements of a data table.
Understanding your two main data types will help you determine the most productive visualizations to try.
Different types of plots are good for looking at ordinal or categorical distribution (histogram), categorical vs. ordinal (bar chart), or ordinal vs. ordinal (scatterplot).
Data link: https://plot.ly/~mdlincoln/9
Follow the data link above, and sign in to plot.ly if asked to do so. You should see a preview of the data table. Click on the link that says “Fork and edit”.
This will copy the data to your account, and open the editing and plotting interface.
Where did these data come from?
I scraped these data from the National Gallery of Art’s Online Edition of the 17th Century Dutch Paintings.
Some of the variables in the table are original to the NGA’s website, while some of them I generated during the process of data cleaning and mutation.
For example, the NGA lists dimensions as a string (a list of characters) rather than as numeric values.
Using a script, I extracted the numeric values from these strings to create the height and width columns.
I had to perform similar functions for creation and acquisition dates.
Also, I created categories such as set
(the collector/curator who added the items to the collection) and genre
manually or semi-manually.
artist
nationality
title
medium
credit
accession
onview
creation_date
room
genre
height
width
creation_date
acc_date
set
Try to devise plots that could shed light on these questions:
Data link:https://plot.ly/~mdlincoln/8
These data comprise a random sample of about 0.4% (1500 out of 374277) of the British sales records maintained by the Getty Provenance Index. Because plot.ly runs in your browser, it cannot handle very large datasets, so we will be working will a small sample of the database. For larger sets of data, you will want to use a more powerful set of programs, such as R and RStudio.
Many of the original variables in this table are not easily computable in their original form.
Scroll to the rightmost columns (starting with the Date
column) to see the cleaned/mutated variables that are most easily plotted:
Date
(of sale). The following variables are derived from this one:
Year
Month
Week
YDay
(day of the year)MDay
(day of the month)WDay
(day of the week)Transaction.Type
Transaction.Amt
Nationality
Period
(This is a categorical variable derived by grouping observations based on the values of the ordinal variable Year
)Price.Factor
(Because inflation makes it difficult to compare the Transaction.Amt
value for over time, I computed the categorical variable Price.Factor
to help group observations into five classes of value. Within each Period
, I determined the quintile distribution of every object sold, i.e. both the top 20% most expensive objects sold between 1790–1800 and the top 20% most expensive objects sold between 1830–1840 are in the 5th quintile.)Try to devise plots that could shed light on these questions:
We will quickly run into the limits of the plot.ly interface with these qeustions! In lieu of coming up with effective plots, we should at least try to conceive of how we might operationalize, or make measurable, these questions. More advanced tools may be necessary to visualize the kinds of patterns we want to detect.
Using R, I have generated a web interface that implements some measurements of these questions. (I will provide the login information during the session)
https://matthewlincoln.net by Matthew Lincoln is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License • Colophon • Revision history for this page