Slide 1

Slide 1 text

STAT 545A Data cleaning

Slide 2

Slide 2 text

Dr. Jennifer (Jenny) Bryan Department of Statistics and Michael Smith Laboratories University of British Columbia jenny@stat.ubc.ca https://github.com/jennybc http://www.stat.ubc.ca/~jenny/ @JennyBryan ← personal, professional Twitter https://github.com/STAT545-UBC http://stat545-ubc.github.io @STAT545 ← Twitter as lead instructor of this course

Slide 3

Slide 3 text

Data cleaning

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Make it easy to revisit your cleaning. Think: when, not if, I need to re-run this.... Yes, I am saying clean your data with R. Yes, that will be uncomfortable. No, in the long run, it will NOT be faster to just use Excel. Stay strong young padawan!

Slide 8

Slide 8 text

Running example: The Gapminder data you’ve grown to know and love! An R data package available on CRAN and GitHub: https://cran.r-project.org/web/packages/gapminder/index.html https://github.com/jennybc/gapminder To use the Gapminder data: install.packages(gapminder) ## OR devtools::install_github("jennybc/gapminder") library(gapminder) If you look at package source, e.g. on GitHub, the data-raw directory holds the raw data and all the code and intermediates necessary to bring you the 1704 rows and 6 variables we work with.

Slide 9

Slide 9 text

Visit the source of the gapminder package on GitHub: https://github.com/jennybc/gapminder Copy the the HTTPS clone URL to clipboard ... Create new RStudio Project > Version Control > Git > and paste the URL This will get you the package source. Note this is not the same as installing the package.

Slide 10

Slide 10 text

Step 0: Save the raw data! I often revoke my own write permission. See the original Excel workbooks I downloaded from Gapminder.org in 2008 - 2009: https://github.com/jennybc/gapminder/tree/master/data-raw/xls

Slide 11

Slide 11 text

Step 1: Get the data out of Excel. Save as a delimited file. Or ... Use, e.g., readxl::read_excel(), to access the Excel file directly from R. Or ... Even better, do both!

Slide 12

Slide 12 text

Step 1: Get the data out of Excel. See the delimited data I extracted “by hand” from Excel in 2008 - 2009. https://github.com/jennybc/gapminder/tree/master/data-raw/xls-manual-extract See also my scripts for direct extraction via R. Note: they were written before readxl existed and thus use a different package. https://github.com/jennybc/gapminder/blob/master/data-raw/01_extract-from-excel-pop.r https://github.com/jennybc/gapminder/blob/master/data-raw/02_extract-from-excel-lifeExp.r https://github.com/jennybc/gapminder/blob/master/data-raw/03_extract-from-excel-gdpPercap.r

Slide 13

Slide 13 text

Excel is great for browsing delimited files. Even OK for editing, in certain situations. if the desired file is greyed out when you try to open ... use the “enable” drop-down menu to expand what Excel considers open-able you can even go nuts and enable “All Files” in particular, Excel is unnecessarily squeamish about the *.tsv extension commonly used with tab-delimited files File > Open...

Slide 14

Slide 14 text

Excel can write delimited files. File > Save As... Format Excel will try to freak you out with a warning. For plain ol’ data, ignore and proceed. The sky is falling!

Slide 15

Slide 15 text

Step 2: Get the data into R. Read the delimited export with read.table() or readr substitutes. Or ... Use, e.g., readxl::read_excel(), to access the Excel file directly from R. In either case, don’t settle for a lousy import. USE THE ARGUMENTS! https://github.com/jennybc/gapminder/blob/master/data-raw/01_extract-from-excel-pop.r https://github.com/jennybc/gapminder/blob/master/data-raw/02_extract-from-excel-lifeExp.r https://github.com/jennybc/gapminder/blob/master/data-raw/03_extract-from-excel-gdpPercap.r

Slide 16

Slide 16 text

read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown", text, skipNul = FALSE) These arguments are available for a reason. You will eventually need all of them. Ask me how I know.

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Get minimally correct, useful data: Drop variables, create new variables, rename variables. Drop rows. Convert lead into gold: - weird character dates --> proper dates - numbers with “helpful” commas (1,000) --> numbers (1000) - etc. Reshape data. Turn character into factor or factor into character. Replace obviously bad data with good data or proper NA.

Slide 19

Slide 19 text

save all this intermediate data! preferably as text use all the wisdom imparted in STAT 545 re: writing data to file http://stat545-ubc.github.io/block026_file-out-in.html https://github.com/jennybc/gapminder/blob/master/data-raw/01_pop.tsv https://github.com/jennybc/gapminder/blob/master/data-raw/02_lifeExp.tsv https://github.com/jennybc/gapminder/blob/master/data-raw/03_gdpPercap.tsv

Slide 20

Slide 20 text

Integrate data that’s spread out in multiple files/ objects: Use proper join / merge functions. Scrutinize the result. Is everything OK? https://github.com/jennybc/gapminder/blob/master/data-raw/04_merge-pop-lifeExp-gdpPercap.r https://github.com/jennybc/gapminder/blob/master/data-raw/04_gap-merged.tsv https://github.com/jennybc/gapminder/blob/master/data-raw/05_smell-test-gap-merged.r

Slide 21

Slide 21 text

this is what happens with manual data entry for “country” in three separate Excel workbooks ... chaos!

Slide 22

Slide 22 text

## continent gap_dat$continent %>% levels # 7 levels for continent, including "" gap_dat$continent %>% summary # Africa Americas Asia Europe FSU Oceania # 301 613 343 557 1302 122 74 ## 301 rows have no continent data :(

Slide 23

Slide 23 text

Fix data that can be rescued. Discard the rest. https://github.com/jennybc/gapminder/blob/master/data-raw/06_fill-and-fix-continent.r https://github.com/jennybc/gapminder/blob/master/data-raw/06_gap-merged-with-continent.tsv

Slide 24

Slide 24 text

save your final clean, ready-to-analyze data! use all the wisdom imparted in STAT 545 re: writing data to file so, probably, as text AND in an R-specific format https://github.com/jennybc/gapminder/blob/master/data-raw/07_filter-every-five-years.r https://github.com/jennybc/gapminder/blob/master/data-raw/07_gap-every-five-years.tsv https://github.com/jennybc/gapminder/tree/master/data

Slide 25

Slide 25 text

2014 in-class activity pick an entry point into the Gapminder cleaning and walk through a script come up with at least one - question - observation - improvement! for discussion at the end of class

Slide 26

Slide 26 text

another possible in-class activity re-visit Gapminder cleaning - with readxl and/or readr walk through one of the scripts and modernize it come up with at least one - question - observation - improvement! for discussion at the end of class