How to triage your data for tidying
Wed, Jun 16

Miro whiteboard on Data Messes

Column validity errors

  • A column has duplicated values when it is supposed to have unique values
  • A column has null/empty values when it is supposed to have a value for every single row

Semantic mismatch errors

  • Values that belong in one column are in another column
  • Values that belong in one row are in another
  • Entering multiple semantically-different pieces of information into the same column. When in doubt, split it out! It’s always easier to have the computer merge data back into one column than to split it out after you’ve entered it.
  • Using an inconsistent strategy for multi-valued cells

Categorical / controlled columns with misspellings or close-matches

  • Values in a column that you want to match up across records (like people’s names, location names) must have identical spelling and formatting. Small typos will be recognized as entirely different values by a computer.

The ANONYMOUS problem

See the missing data notes in Data Types.

Date validity errors

  • Using inconsistent formatting for dates in different rows
  • Using different levels of precision (e.g. year vs. year+month+day) for dates unless you are using EDTF.
  • Having a date that doesn’t exist, such as 2021-06-31, or 1920-02-29.

The Dreaded Notes Field

It’s very useful to have a free-text notes field to capture info for a human reader that isn’t otherwise structured into other columns. But watch out for these bad notes mistakes:

Putting multiple kinds of notes into one single column

  • Mixing in internal editorial notes with other discursive text that you want to be publicly viewable (Those should be in separate columns)
  • Relying on a careful style guide to structure the text in your notes field. As soon as you find yourself using a punctuation style guide in a free text field, it’s likely that you should be using additional columns.

Putting notes into non-notes columns

subject
trade;government
higher education
anatomy;physical therapy
marketing [need to verify this]
environmentalism

Ideally, a column contains just one kind of information - be it a string value with a title, or a categorical column that may have multiple values from a controlled vocabulary list. Notes to yourself or to users of your data are distinct pieces of information that belong in their own column. If you often need to qualify data in a particular column (like this example of a subject column,) you probably need to make an additional column like subject_notes so that you know the text in that field is a note about the subject field for that record, distinguishing it from a note about some other column, or general notes about the record itself that aren’t specific to an existing field.

Using formatting to encode data