UBC STAT545 2015 cm103 Data cleaning via Gapminder
Data cleaning illustrated through preparation of Gapminder data to become an R data package.
Lecture slides from UBC STAT545 2015 are usually intended as complement to, e.g., a hands-on activity.
Laboratories University of British Columbia [email protected] https://github.com/jennybc http://www.stat.ubc.ca/~jenny/ @JennyBryan ← personal, professional Twitter https://github.com/STAT545-UBC http://stat545-ubc.github.io @STAT545 ← Twitter as lead instructor of this course
if, I need to re-run this.... Yes, I am saying clean your data with R. Yes, that will be uncomfortable. No, in the long run, it will NOT be faster to just use Excel. Stay strong young padawan!
love! An R data package available on CRAN and GitHub: https://cran.r-project.org/web/packages/gapminder/index.html https://github.com/jennybc/gapminder To use the Gapminder data: install.packages(gapminder) ## OR devtools::install_github("jennybc/gapminder") library(gapminder) If you look at package source, e.g. on GitHub, the data-raw directory holds the raw data and all the code and intermediates necessary to bring you the 1704 rows and 6 variables we work with.
Copy the the HTTPS clone URL to clipboard ... Create new RStudio Project > Version Control > Git > and paste the URL This will get you the package source. Note this is not the same as installing the package.
own write permission. See the original Excel workbooks I downloaded from Gapminder.org in 2008 - 2009: https://github.com/jennybc/gapminder/tree/master/data-raw/xls
delimited data I extracted “by hand” from Excel in 2008 - 2009. https://github.com/jennybc/gapminder/tree/master/data-raw/xls-manual-extract See also my scripts for direct extraction via R. Note: they were written before readxl existed and thus use a different package. https://github.com/jennybc/gapminder/blob/master/data-raw/01_extract-from-excel-pop.r https://github.com/jennybc/gapminder/blob/master/data-raw/02_extract-from-excel-lifeExp.r https://github.com/jennybc/gapminder/blob/master/data-raw/03_extract-from-excel-gdpPercap.r
editing, in certain situations. if the desired file is greyed out when you try to open ... use the “enable” drop-down menu to expand what Excel considers open-able you can even go nuts and enable “All Files” in particular, Excel is unnecessarily squeamish about the *.tsv extension commonly used with tab-delimited files File > Open...
export with read.table() or readr substitutes. Or ... Use, e.g., readxl::read_excel(), to access the Excel file directly from R. In either case, don’t settle for a lousy import. USE THE ARGUMENTS! https://github.com/jennybc/gapminder/blob/master/data-raw/01_extract-from-excel-pop.r https://github.com/jennybc/gapminder/blob/master/data-raw/02_extract-from-excel-lifeExp.r https://github.com/jennybc/gapminder/blob/master/data-raw/03_extract-from-excel-gdpPercap.r
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown", text, skipNul = FALSE) These arguments are available for a reason. You will eventually need all of them. Ask me how I know.
rename variables. Drop rows. Convert lead into gold: - weird character dates --> proper dates - numbers with “helpful” commas (1,000) --> numbers (1000) - etc. Reshape data. Turn character into factor or factor into character. Replace obviously bad data with good data or proper NA.
the wisdom imparted in STAT 545 re: writing data to file http://stat545-ubc.github.io/block026_file-out-in.html https://github.com/jennybc/gapminder/blob/master/data-raw/01_pop.tsv https://github.com/jennybc/gapminder/blob/master/data-raw/02_lifeExp.tsv https://github.com/jennybc/gapminder/blob/master/data-raw/03_gdpPercap.tsv
including "" gap_dat$continent %>% summary # Africa Americas Asia Europe FSU Oceania # 301 613 343 557 1302 122 74 ## 301 rows have no continent data :(
imparted in STAT 545 re: writing data to file so, probably, as text AND in an R-specific format https://github.com/jennybc/gapminder/blob/master/data-raw/07_filter-every-five-years.r https://github.com/jennybc/gapminder/blob/master/data-raw/07_gap-every-five-years.tsv https://github.com/jennybc/gapminder/tree/master/data
and/or readr walk through one of the scripts and modernize it come up with at least one - question - observation - improvement! for discussion at the end of class