Upgrade to Pro — share decks privately, control downloads, hide ads and more …

UBC STAT545 2015 cm103 Data cleaning via Gapminder

UBC STAT545 2015 cm103 Data cleaning via Gapminder

Data cleaning illustrated through preparation of Gapminder data to become an R data package.
Lecture slides from UBC STAT545 2015 are usually intended as complement to, e.g., a hands-on activity.

Jennifer (Jenny) Bryan

November 03, 2015
Tweet

More Decks by Jennifer (Jenny) Bryan

Other Decks in Programming

Transcript

  1. Dr. Jennifer (Jenny) Bryan Department of Statistics and Michael Smith

    Laboratories University of British Columbia [email protected] https://github.com/jennybc http://www.stat.ubc.ca/~jenny/ @JennyBryan ← personal, professional Twitter https://github.com/STAT545-UBC http://stat545-ubc.github.io @STAT545 ← Twitter as lead instructor of this course
  2. Make it easy to revisit your cleaning. Think: when, not

    if, I need to re-run this.... Yes, I am saying clean your data with R. Yes, that will be uncomfortable. No, in the long run, it will NOT be faster to just use Excel. Stay strong young padawan!
  3. Running example: The Gapminder data you’ve grown to know and

    love! An R data package available on CRAN and GitHub: https://cran.r-project.org/web/packages/gapminder/index.html https://github.com/jennybc/gapminder To use the Gapminder data: install.packages(gapminder) ## OR devtools::install_github("jennybc/gapminder") library(gapminder) If you look at package source, e.g. on GitHub, the data-raw directory holds the raw data and all the code and intermediates necessary to bring you the 1704 rows and 6 variables we work with.
  4. Visit the source of the gapminder package on GitHub: https://github.com/jennybc/gapminder

    Copy the the HTTPS clone URL to clipboard ... Create new RStudio Project > Version Control > Git > and paste the URL This will get you the package source. Note this is not the same as installing the package.
  5. Step 0: Save the raw data! I often revoke my

    own write permission. See the original Excel workbooks I downloaded from Gapminder.org in 2008 - 2009: https://github.com/jennybc/gapminder/tree/master/data-raw/xls
  6. Step 1: Get the data out of Excel. Save as

    a delimited file. Or ... Use, e.g., readxl::read_excel(), to access the Excel file directly from R. Or ... Even better, do both!
  7. Step 1: Get the data out of Excel. See the

    delimited data I extracted “by hand” from Excel in 2008 - 2009. https://github.com/jennybc/gapminder/tree/master/data-raw/xls-manual-extract See also my scripts for direct extraction via R. Note: they were written before readxl existed and thus use a different package. https://github.com/jennybc/gapminder/blob/master/data-raw/01_extract-from-excel-pop.r https://github.com/jennybc/gapminder/blob/master/data-raw/02_extract-from-excel-lifeExp.r https://github.com/jennybc/gapminder/blob/master/data-raw/03_extract-from-excel-gdpPercap.r
  8. Excel is great for browsing delimited files. Even OK for

    editing, in certain situations. if the desired file is greyed out when you try to open ... use the “enable” drop-down menu to expand what Excel considers open-able you can even go nuts and enable “All Files” in particular, Excel is unnecessarily squeamish about the *.tsv extension commonly used with tab-delimited files File > Open...
  9. Excel can write delimited files. File > Save As... Format

    Excel will try to freak you out with a warning. For plain ol’ data, ignore and proceed. The sky is falling!
  10. Step 2: Get the data into R. Read the delimited

    export with read.table() or readr substitutes. Or ... Use, e.g., readxl::read_excel(), to access the Excel file directly from R. In either case, don’t settle for a lousy import. USE THE ARGUMENTS! https://github.com/jennybc/gapminder/blob/master/data-raw/01_extract-from-excel-pop.r https://github.com/jennybc/gapminder/blob/master/data-raw/02_extract-from-excel-lifeExp.r https://github.com/jennybc/gapminder/blob/master/data-raw/03_extract-from-excel-gdpPercap.r
  11. read.table(file, header = FALSE, sep = "", quote = "\"'",

    dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown", text, skipNul = FALSE) These arguments are available for a reason. You will eventually need all of them. Ask me how I know.
  12. Get minimally correct, useful data: Drop variables, create new variables,

    rename variables. Drop rows. Convert lead into gold: - weird character dates --> proper dates - numbers with “helpful” commas (1,000) --> numbers (1000) - etc. Reshape data. Turn character into factor or factor into character. Replace obviously bad data with good data or proper NA.
  13. save all this intermediate data! preferably as text use all

    the wisdom imparted in STAT 545 re: writing data to file http://stat545-ubc.github.io/block026_file-out-in.html https://github.com/jennybc/gapminder/blob/master/data-raw/01_pop.tsv https://github.com/jennybc/gapminder/blob/master/data-raw/02_lifeExp.tsv https://github.com/jennybc/gapminder/blob/master/data-raw/03_gdpPercap.tsv
  14. Integrate data that’s spread out in multiple files/ objects: Use

    proper join / merge functions. Scrutinize the result. Is everything OK? https://github.com/jennybc/gapminder/blob/master/data-raw/04_merge-pop-lifeExp-gdpPercap.r https://github.com/jennybc/gapminder/blob/master/data-raw/04_gap-merged.tsv https://github.com/jennybc/gapminder/blob/master/data-raw/05_smell-test-gap-merged.r
  15. this is what happens with manual data entry for “country”

    in three separate Excel workbooks ... chaos!
  16. ## continent gap_dat$continent %>% levels # 7 levels for continent,

    including "" gap_dat$continent %>% summary # Africa Americas Asia Europe FSU Oceania # 301 613 343 557 1302 122 74 ## 301 rows have no continent data :(
  17. Fix data that can be rescued. Discard the rest. https://github.com/jennybc/gapminder/blob/master/data-raw/06_fill-and-fix-continent.r

    https://github.com/jennybc/gapminder/blob/master/data-raw/06_gap-merged-with-continent.tsv
  18. save your final clean, ready-to-analyze data! use all the wisdom

    imparted in STAT 545 re: writing data to file so, probably, as text AND in an R-specific format https://github.com/jennybc/gapminder/blob/master/data-raw/07_filter-every-five-years.r https://github.com/jennybc/gapminder/blob/master/data-raw/07_gap-every-five-years.tsv https://github.com/jennybc/gapminder/tree/master/data
  19. 2014 in-class activity pick an entry point into the Gapminder

    cleaning and walk through a script come up with at least one - question - observation - improvement! for discussion at the end of class
  20. another possible in-class activity re-visit Gapminder cleaning - with readxl

    and/or readr walk through one of the scripts and modernize it come up with at least one - question - observation - improvement! for discussion at the end of class