Data Cleaning

From Wiki1

Revision as of 09:23, 4 September 2014 by Hkrause (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Data cleaning, data wrangling and other data "janitor work" is often the key to interesting, reliable analysis. And when being honest, most data scientist will say that these tasks require about 80% of the time spent on a project. The NYT has published an article that provides some insights into the importance of the process and the various ways of handling it. We're collecting a list of data cleaning resources here to help you get started.

  • UCLA has a good beginner's guide to the concepts involved and some first steps
  • A group of PhD students at Stanford have developed Data Wrangler, a good piece of software that will do a lot of the data wrangling for you
  • Lecture notes from a UseR!2013 presentation on Data Cleaning in R. Lots of useful links to R packages and scripts.
  • Some data munging tips and scripts for less traditional statistical data
  • The wonderful Tidy Data paper from Hadley Wickam, a prolific writer of R packages for data manipulation. Lots of examples and details.
  • Jared Knowles has a super useful R Bootcamp which includes cleaning, auditing and sorting data. Includes videos if you like visual learning.
  • Columbia University offers some good examples of data prep - although proceed with caution when removing outliers.
  • McMaster has a good tutorial on cleaning epidemiological data. Useful in how it breaks down the ways to audit different types of data.
Personal tools