Upgrade to Pro — share decks privately, control downloads, hide ads and more …

UBC STAT545 2015 cm103 Data cleaning via Gapminder

UBC STAT545 2015 cm103 Data cleaning via Gapminder

Data cleaning illustrated through preparation of Gapminder data to become an R data package.
Lecture slides from UBC STAT545 2015 are usually intended as complement to, e.g., a hands-on activity.

Jennifer (Jenny) Bryan

November 03, 2015
Tweet

More Decks by Jennifer (Jenny) Bryan

Other Decks in Programming

Transcript

  1. STAT 545A
    Data cleaning

    View Slide

  2. Dr. Jennifer (Jenny) Bryan
    Department of Statistics and Michael Smith Laboratories
    University of British Columbia
    [email protected]
    https://github.com/jennybc
    http://www.stat.ubc.ca/~jenny/
    @JennyBryan ← personal, professional Twitter
    https://github.com/STAT545-UBC
    http://stat545-ubc.github.io
    @STAT545 ← Twitter as lead instructor of this course

    View Slide

  3. Data cleaning

    View Slide

  4. View Slide

  5. View Slide

  6. View Slide

  7. Make it easy to revisit your cleaning.
    Think: when, not if, I need to re-run this....
    Yes, I am saying clean your data with R.
    Yes, that will be uncomfortable.
    No, in the long run, it will NOT be faster to
    just use Excel.
    Stay strong young padawan!

    View Slide

  8. Running example:
    The Gapminder data you’ve grown to know and love!
    An R data package available on CRAN and GitHub:
    https://cran.r-project.org/web/packages/gapminder/index.html
    https://github.com/jennybc/gapminder
    To use the Gapminder data:
    install.packages(gapminder)
    ## OR
    devtools::install_github("jennybc/gapminder")
    library(gapminder)
    If you look at package source, e.g. on GitHub, the data-raw directory
    holds the raw data and all the code and intermediates necessary to
    bring you the 1704 rows and 6 variables we work with.

    View Slide

  9. Visit the source of the gapminder package on GitHub:
    https://github.com/jennybc/gapminder
    Copy the the HTTPS clone URL to clipboard ...
    Create new RStudio Project >
    Version Control >
    Git >
    and paste the URL
    This will get you the package source. Note this is not
    the same as installing the package.

    View Slide

  10. Step 0:
    Save the raw data!
    I often revoke my own write permission.
    See the original Excel workbooks I downloaded
    from Gapminder.org in 2008 - 2009:
    https://github.com/jennybc/gapminder/tree/master/data-raw/xls

    View Slide

  11. Step 1:
    Get the data out of Excel.
    Save as a delimited file. Or ...
    Use, e.g., readxl::read_excel(), to access the Excel
    file directly from R. Or ...
    Even better, do both!

    View Slide

  12. Step 1:
    Get the data out of Excel.
    See the delimited data I extracted “by hand” from Excel in
    2008 - 2009.
    https://github.com/jennybc/gapminder/tree/master/data-raw/xls-manual-extract
    See also my scripts for direct extraction via R. Note: they
    were written before readxl existed and thus use a
    different package.
    https://github.com/jennybc/gapminder/blob/master/data-raw/01_extract-from-excel-pop.r
    https://github.com/jennybc/gapminder/blob/master/data-raw/02_extract-from-excel-lifeExp.r
    https://github.com/jennybc/gapminder/blob/master/data-raw/03_extract-from-excel-gdpPercap.r

    View Slide

  13. Excel is great for browsing delimited files.
    Even OK for editing, in certain situations.
    if the desired file is greyed out when you try to open ...
    use the “enable” drop-down menu to expand what Excel considers open-able
    you can even go nuts and enable “All Files”
    in particular, Excel is unnecessarily squeamish about the *.tsv extension commonly used
    with tab-delimited files
    File > Open...

    View Slide

  14. Excel can write delimited files.
    File > Save As... Format
    Excel will try to freak you out with a warning.
    For plain ol’ data, ignore and proceed.
    The sky is falling!

    View Slide

  15. Step 2:
    Get the data into R.
    Read the delimited export with read.table() or readr
    substitutes. Or ...
    Use, e.g., readxl::read_excel(), to access the Excel
    file directly from R.
    In either case, don’t settle for a lousy import.
    USE THE ARGUMENTS!
    https://github.com/jennybc/gapminder/blob/master/data-raw/01_extract-from-excel-pop.r
    https://github.com/jennybc/gapminder/blob/master/data-raw/02_extract-from-excel-lifeExp.r
    https://github.com/jennybc/gapminder/blob/master/data-raw/03_extract-from-excel-gdpPercap.r

    View Slide

  16. read.table(file, header = FALSE, sep = "", quote = "\"'",
    dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
    row.names, col.names, as.is = !stringsAsFactors,
    na.strings = "NA", colClasses = NA, nrows = -1,
    skip = 0, check.names = TRUE, fill = !blank.lines.skip,
    strip.white = FALSE, blank.lines.skip = TRUE,
    comment.char = "#",
    allowEscapes = FALSE, flush = FALSE,
    stringsAsFactors = default.stringsAsFactors(),
    fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
    These arguments are available for a reason.
    You will eventually need all of them.
    Ask me how I know.

    View Slide

  17. View Slide

  18. Get minimally correct, useful data:
    Drop variables, create new variables, rename variables.
    Drop rows.
    Convert lead into gold:
    - weird character dates --> proper dates
    - numbers with “helpful” commas (1,000) --> numbers (1000)
    - etc.
    Reshape data.
    Turn character into factor or factor into character.
    Replace obviously bad data with good data or proper NA.

    View Slide

  19. save all this intermediate data!
    preferably as text
    use all the wisdom imparted in STAT 545 re: writing data to file
    http://stat545-ubc.github.io/block026_file-out-in.html
    https://github.com/jennybc/gapminder/blob/master/data-raw/01_pop.tsv
    https://github.com/jennybc/gapminder/blob/master/data-raw/02_lifeExp.tsv
    https://github.com/jennybc/gapminder/blob/master/data-raw/03_gdpPercap.tsv

    View Slide

  20. Integrate data that’s spread out in multiple files/
    objects:
    Use proper join / merge functions.
    Scrutinize the result. Is everything OK?
    https://github.com/jennybc/gapminder/blob/master/data-raw/04_merge-pop-lifeExp-gdpPercap.r
    https://github.com/jennybc/gapminder/blob/master/data-raw/04_gap-merged.tsv
    https://github.com/jennybc/gapminder/blob/master/data-raw/05_smell-test-gap-merged.r

    View Slide

  21. this is what happens with manual data entry for
    “country” in three separate Excel workbooks ... chaos!

    View Slide

  22. ## continent
    gap_dat$continent %>% levels # 7 levels for continent,
    including ""
    gap_dat$continent %>% summary
    # Africa Americas Asia Europe FSU Oceania
    # 301 613 343 557 1302 122 74
    ## 301 rows have no continent data :(

    View Slide

  23. Fix data that can be rescued.
    Discard the rest.
    https://github.com/jennybc/gapminder/blob/master/data-raw/06_fill-and-fix-continent.r
    https://github.com/jennybc/gapminder/blob/master/data-raw/06_gap-merged-with-continent.tsv

    View Slide

  24. save your final clean, ready-to-analyze data!
    use all the wisdom imparted in STAT 545 re:
    writing data to file
    so, probably, as text AND in an R-specific format
    https://github.com/jennybc/gapminder/blob/master/data-raw/07_filter-every-five-years.r
    https://github.com/jennybc/gapminder/blob/master/data-raw/07_gap-every-five-years.tsv
    https://github.com/jennybc/gapminder/tree/master/data

    View Slide

  25. 2014 in-class activity
    pick an entry point into the Gapminder
    cleaning and walk through a script
    come up with at least one
    - question
    - observation
    - improvement!
    for discussion at the end of class

    View Slide

  26. another possible in-class activity
    re-visit Gapminder cleaning
    - with readxl and/or readr
    walk through one of the scripts and
    modernize it
    come up with at least one
    - question
    - observation
    - improvement!
    for discussion at the end of class

    View Slide