September 23, 2015

Making a tidy table

What is tidy data?

  1. One variable per column
  2. One observation per row
  3. Consistent data types
  4. If you can't do 1 and 2, that means you need an extra table

Why?

  1. Make it easy to analyze at scale
  2. Make it easy to share
  3. Make it easy to recombine with other data

Messy (but well-intentioned!) data

acq_no artist_1 artist_2 date medium tags
1999.32 Studio of Rembrandt possibly Govaert Flinck after 1636 oil on canvas religious, portrait
1908.54 Jan Vermeer NA c. 1650 oil on panel genre, woman
1955.32 Vermeer, Jan NA c. 1655 oil on canvas woman, window, painting
1955.33 Hals, Frans NA 16220 oil on canvas, relined merry company
  • On the plus side, there is a unique ID column!

One variable per column

acq_no artist_1 artist_2 date medium tags
1999.32 Studio of Rembrandt possibly Govaert Flinck after 1636 oil on canvas religious, portrait
1908.54 Jan Vermeer NA c. 1650 oil on panel genre, woman
1955.32 Vermeer, Jan NA c. 1655 oil on canvas woman, window, painting
1955.33 Hals, Frans NA 16220 oil on canvas, relined merry company
  • artist_1: first and last names combined; conditionals
  • date: conditionals
  • medium: medium and support combined; conditionals

One observation per row

acq_no artist_1 artist_2 date medium tags
1999.32 Studio of Rembrandt possibly Govaert Flinck after 1636 oil on canvas religious, portrait
1908.54 Jan Vermeer NA c. 1650 oil on panel genre, woman
1955.32 Vermeer, Jan NA c. 1655 oil on canvas woman, window, painting
1955.33 Hals, Frans NA 16220 oil on canvas, relined merry company
  • tags and artists need to be their own tables

Consistent types and values

acq_no artist_1 artist_2 date medium tags
1999.32 Studio of Rembrandt possibly Govaert Flinck after 1636 oil on canvas religious, portrait
1908.54 Jan Vermeer NA c. 1650 oil on panel genre, woman
1955.32 Vermeer, Jan NA c. 1655 oil on canvas woman, window, painting
1955.33 Hals, Frans NA 16220 oil on canvas, relined merry company
  • date has typos
  • date could potentially be numeric

Tidied data

Objects

Three conceptual tables: objects, artist links, and tags.

acq_no date date_qual medium support cons_note
1999.32 1636 after oil canvas NA
1908.54 1650 circa oil panel NA
1955.32 1655 circa oil canvas NA
1955.33 1620 NA oil canvas relined

Object-Artist

acq_no name qualification
1999.32 Rembrandt studio of
1999.32 Govaert Flinck possibly
1908.54 Jan Vermeer NA
1955.32 Jan Vermeer NA
1955.33 Frans Hals NA

Object-Tags

acq_no tag
1999.32 religious
1999.32 portrait
1908.54 genre
1908.54 woman
1955.32 woman
1955.32 window
1955.32 painting
1955.33 merry_company

By separating out tables, we can be more flexible with the types of questions we can ask.

# What paintings have the tag "woman"?
clean_data_obj %>% 
  inner_join(clean_data_tags, by = "acq_no") %>% 
  filter(tag == "woman") %>% kable()
acq_no date date_qual medium support cons_note tag
1908.54 1650 circa oil panel NA woman
1955.32 1655 circa oil canvas NA woman

# When was each painter working?
clean_data_obj %>% 
  inner_join(clean_data_artist, by = "acq_no") %>%
  filter(is.na(qualification) | qualification != "studio of") %>% 
  ggplot(aes(x = name, y = date, color = date_qual)) + 
  geom_point(size = 5)

Tips for Historical Datasets

  • Use a structured vocabulary when there are a limited number of possible values for a column. This allows easier counting and better consistency across your data.
  • Be consistent with date formats.
    • If you have day-level info for some entries, and only year level for others, think about putting in 1701-01-01, or have separate columns for year, month, day.
    • Do some records have a start and end date? Have a start_date and end_date
  • A loose "notes" column is fine, but if you find yourself frequently making a similar type of note (e.g. citation) then you probably need to make a new column just for that information.

Tips for Historical Datasets II

  • "Uncertain" values
    • [?] is not informative. What about: illegible, unsubstantiated, approximate?
    • Make an uncertainty vocabulary if you have to, so the terms are controlled, countable, and documented
    • Use separate columns liberally, e.g. date, date_uncertainty
  • You can't document everything! If some tricky field is just not relevant enough to your research, then don't kill yourself trying to capture it with perfect specificity.
  • It is easy to recombine columns later. It is very hard to split them out.

Documenting tidy data

Do it for future-you & for others

  • You will forget what you did in a few months. Or even a few days. Docs will remind you.
  • Docs make writing reports/articles easier
  • Docs make your data reusable:
    • others won't have to guess at what a certain column means
    • or what decisions you made when recording it
    • or how to cite it
    • or if/how they may reuse it

Show your work

  • Describe what you made:
    • Keep a plain text doc in the same directory as your tables
    • Have a heading for each table
    • List every column name and describe what it means
      • Incl. list of possible values, relation to other tables as appropriate
  • Document the process
    • Did you adapt this from another dataset? (incl. original data, or link)
    • Describe the transformations you made, including what software you used

Sharing tidy data

Flat Files, Not Live Servers

Omeka and the like are great for interactively communicating stories and selections from your research. They are not a storage and dissemination solution in the long run. Costly, fragile, limiting.

"Flat files" are decoupled from running software, and can be opened by a regular text editor. Someone without your original software has a better chance of recovering the information.

Plaintext

  • Use plain text file types for tables and docs (.txt, .csv, not .xslx)
    • Free
    • Somewhat more future-proof
    • Track-able
  • Creating in Excel/Google Sheets is fine, you can export it
    • When saving in Excel, use UTF-8 so that accents & special characters are preserved
    • Don't rely on meaningful formatting (colored cells, bold, italics, borders), because that won't be preserved
    • Save multiple versions

Archive it

  • Bundle data and documentation in the same directory and zip them.
  • Distribute
    • Institutional repository (upload it with your dissertation)
    • Journal websites
    • Zenodo
    • Git (works great with all-text files - more and more libraries and journals will be moving towards this method for tracking file versions)

Resources