$30 off During Our Annual Pro Sale. View Details »

Be a Hawk not a Turkey: How a Bird's Eye View of your Data Can Streamline Data Analysis

Be a Hawk not a Turkey: How a Bird's Eye View of your Data Can Streamline Data Analysis

Data cleaning is acknowledged as a major bottleneck in data analysis. Common hurdles include data not in the state expected , and missing data. Two R packages currently under development are presented that assist in widening this bottleneck: ggmissing and visdat. ggmissing extends ggplot2 by allowing it to include missing values, and visdat provides simple functions for visualising whole dataframes, giving a birds eye view of the data. Examples of how these packages can be incorporated into a workflow are then described.

Nicholas Tierney

February 19, 2016
Tweet

More Decks by Nicholas Tierney

Other Decks in Programming

Transcript

  1. Be a Hawk not a Turkey
    How a Bird’s Eye View of your Data Can Streamline Data Analysis
    Nicholas Tierney
    PhD Candidate QUT
    WOMBAT, Melbourne Zoo
    19/02/2016

    View Slide

  2. The Project
    2

    View Slide

  3. View Slide

  4. View Slide

  5. “Can you have a look at the data?”
    What does that mean?

    View Slide

  6. “Looking” at the data
    6

    View Slide

  7. “…Looking?” at the data?
    7
    ggplot(data = data,
    aes(x = IQ,
    y = income)) +
    geom_point()

    View Slide

  8. “…Looking?” at the data?
    8

    View Slide

  9. So…
    What if the data is all weird, and stuff?

    View Slide

  10. Real data is generally real messy
    Dates are not dates
    Gender is not Categorical
    Rows are supposed to be columns
    Missing data
    10

    View Slide

  11. Data Cleaning…janitorial work...munging...
    11
    Data
    Wrangling
    Testing
    Data
    dplyr
    plyr
    data.table
    assertr
    testdat

    View Slide

  12. Data inspection: `dplyr::glimpse(dat)`
    Observations: 300
    Variables: 15
    $ date (date) 2015-03-15, 2015-03-...
    $ name (chr) "Bobby", "Trinidad", ...
    $ age (int) 21, 28, 31, 30, 23, 2...
    $ sex (fctr) Female, Female, Fema...
    $ grade (int) NA, 4, 3, NA, NA, NA,...
    $ height (dbl) 66, 59, 67, 71, 68, 7...
    $ hair (fctr) Brown, Red, Blonde, ...
    $ eye (fctr) Gray, Brown, Blue, H...
    $ smokes (lgl) FALSE, FALSE, FALSE, ...
    $ income (chr) NA, "36157.98", "17307.35”
    $ education (fctr) Regular High School ...
    $ IQ (fctr) 97, 115, 112, 94, 106...
    $ employment (int) NA, 1, 4, NA, 1, NA, ...
    $ race (fctr) Hispanic, Black, Bla...
    $ religion (fctr) Muslim, Christian, N...
    12

    View Slide

  13. Pre-exploratory Visualisations?
    13
    Visualisation
    methods for
    Checking Data?

    View Slide

  14. visdat
    Visualise whole data frames at once

    View Slide

  15. vis_dat(data)
    15

    View Slide

  16. vis_dat(data, sort_type = F)
    16

    View Slide

  17. vis_dat … clean … vis_dat … clean
    17

    View Slide

  18. vis_dat … clean … vis_dat … clean
    18

    View Slide

  19. vis_miss
    19

    View Slide

  20. vis_miss(cluster = TRUE)
    20

    View Slide

  21. Slide missing
    It’s probably not a big deal

    View Slide

  22. ggmissing
    plotting missing data with ggplot

    View Slide

  23. ggmissing
    ggplot(data = dat,
    aes(x = IQ ,
    y = income)) +
    geom_point()
    Warning message:
    Removed 142 rows
    containing missing
    values(geom_point).
    23

    View Slide

  24. ggmissing
    24

    View Slide

  25. ggmissing: how to do it
    25
    dat %>%
    mutate(miss_cat = miss_cat(., "IQ", "income")) %>%
    ggplot(data = .,
    aes(x = shadow_shift(IQ),
    y = shadow_shift(income),
    colour = miss_cat)) +
    geom_point()

    View Slide

  26. ggmissing: how we’d like to do it
    26
    ggplot(data = data,
    aes(x = IQ,
    y = income)) +
    geom_point() +
    geom_missing()
    ggplot(data = data,
    aes(x = IQ,
    y = income)) +
    geom_point(show_missing = T)

    View Slide

  27. Future Work
    ggmissing and visdat

    View Slide

  28. Future Work: visdat
    Colour cells intelligently
    Guess what kind a variable is
    Read in horrible messy data
    Include interactivity
    Think about ways to sensibly encode summary / value information
    Pipe in expectations
    28

    View Slide

  29. Future Work: ggmissing
    Early days yet
    Create a philosophy / grammar of missingness
    Don’t re-write ggplot
    Include rug plot to show missing data
    Develop clear/intuitive ways of visualising missing values
    29

    View Slide

  30. Got an idea or want to help?
    Check out our github
    github.com/tierneyn/visdat
    github.com/tierneyn/ggmissing

    View Slide

  31. Thank you
    Di Cook
    Miles McBain
    Jenny Bryan
    Kerrie Mengersen
    Fiona Harden
    Maurice Harden
    31

    View Slide

  32. Thank you
    32

    View Slide

  33. 33

    View Slide

  34. Questions?
    I caught a glimpse of happiness,
    And saw it was a bird on a branch,
    Fixing to take wing
    - Richard Peck
    34

    View Slide