Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modeling Social Data, Lecture 3: Data manipulation in R

Jake Hofman
February 08, 2019

Modeling Social Data, Lecture 3: Data manipulation in R

Jake Hofman

February 08, 2019
Tweet

More Decks by Jake Hofman

Other Decks in Education

Transcript

  1. Data manipulation in R
    APAM E4990
    Modeling Social Data
    Jake Hofman
    Columbia University
    February 8, 2019
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 1 / 26

    View Slide

  2. The good, the bad, & the ugly
    • R isn’t the best programming language out there
    • But it happens to be great for data analysis
    • The result is a steep learning curve with a high payoff
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 2 / 26

    View Slide

  3. For instance . . .
    • You’ll see a mix of camelCase, this.that, and snake case
    conventions
    • Dots (.) (mostly) don’t mean anything special
    • Likewise, $ gets used in funny ways
    • R is loosely typed, which can lead to unexpected coercions
    and silent fails
    • It also tries to be clever about variable scope, which can
    backfire if you’re not careful
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 3 / 26

    View Slide

  4. But it will help you . . .
    • Do extremely fast exploratory data analysis
    • Easily generate high-quality data visualizations
    • Fit and evaluate pretty much any statistical model you can
    think of
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 4 / 26

    View Slide

  5. But it will help you . . .
    • Do extremely fast exploratory data analysis
    • Easily generate high-quality data visualizations
    • Fit and evaluate pretty much any statistical model you can
    think of
    This will change the way you do data analysis, because you’ll ask
    questions you wouldn’t have bothered to otherwise
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 4 / 26

    View Slide

  6. Basic types
    • int, double: for numbers
    • character: for strings
    • factor: for categorical variables (∼ struct or ENUM)
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 5 / 26

    View Slide

  7. Basic types
    • int, double: for numbers
    • character: for strings
    • factor: for categorical variables (∼ struct or ENUM)
    Factors are handy, but take some getting used to
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 5 / 26

    View Slide

  8. Containers
    • vector: for multiple values of the same type (∼ array)
    • list: for multiple values of different types (∼ dictionary)
    • data.frame: for tables of rectangular data of mixed types (∼
    matrix)
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 6 / 26

    View Slide

  9. Containers
    • vector: for multiple values of the same type (∼ array)
    • list: for multiple values of different types (∼ dictionary)
    • data.frame: for tables of rectangular data of mixed types (∼
    matrix)
    We’ll mostly work with data frames, which themselves are lists of
    vectors
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 6 / 26

    View Slide

  10. The tidyverse
    The tidyverse is a collection of packages that work together to
    make data analysis easier:
    • dplyr for split / apply / combine type counting
    • ggplot2 for making plots
    • tidyr for reshaping and “tidying” data
    • readr for reading and writing files
    • ...
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 7 / 26

    View Slide

  11. Tidy data
    The core philosophy is that your data should be in a “tidy” table
    with:
    • One variable per column
    • One observation per row
    • One measured value per cell
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 8 / 26

    View Slide

  12. Tidy data
    • Most of the work goes into getting your data into shape
    • After which descriptives statistics, modeling, and visualization
    are easy
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 9 / 26

    View Slide

  13. dplyr: a grammar of data manipulation
    dplyr implements the split / apply / combine framework
    discussed in the last lecture
    • Its “grammar” has five main verbs used in the “apply” phase:
    • filter: restrict rows based on a condition (N → N )
    • arrange: reorder rows by a variable (N → N )
    • select: pick out specific columns (K → K )
    • mutate: create new or change existing columns (K → K )
    • summarize: collapse a column into one value (N → 1)
    • The group by function creates indices to take care of the
    split and combine phases
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 10 / 26

    View Slide

  14. dplyr: a grammar of data manipulation
    dplyr implements the split / apply / combine framework
    discussed in the last lecture
    • Its “grammar” has five main verbs used in the “apply” phase:
    • filter: restrict rows based on a condition (N → N )
    • arrange: reorder rows by a variable (N → N )
    • select: pick out specific columns (K → K )
    • mutate: create new or change existing columns (K → K )
    • summarize: collapse a column into one value (N → 1)
    • The group by function creates indices to take care of the
    split and combine phases
    The cost is that you have to think “functionally”, in terms of
    “vectorized” operations
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 10 / 26

    View Slide

  15. filter
    filter(trips, start_station_name == "Broadway & E 14 St")
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 11 / 26

    View Slide

  16. arrange
    arrange(trips, starttime)
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 12 / 26

    View Slide

  17. select
    select(trips, starttime, stoptime,
    start_station_name, end_station_name)
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 13 / 26

    View Slide

  18. mutate
    mutate(trips, time_in_min = tripduration / 60)
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 14 / 26

    View Slide

  19. mutate
    summarize(trips, mean_duration = mean(tripduration) / 60,
    sd_duration = sd(tripduration) / 60)
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 15 / 26

    View Slide

  20. group by
    trips_by_gender <- group_by(trips, gender)
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 16 / 26

    View Slide

  21. group by
    trips_by_gender <- group_by(trips, gender)
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 17 / 26

    View Slide

  22. group by + summarize
    summarize(trips_by_gender,
    count = n(),
    mean_duration = mean(tripduration) / 60,
    sd_duration = sd(tripduration) / 60)
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 18 / 26

    View Slide

  23. %>%: the pipe operator
    trips %>%
    group_by(gender) %>%
    summarize(count = n(),
    mean_duration = mean(tripduration) / 60,
    sd_duration = sd(tripduration) / 60)
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 19 / 26

    View Slide

  24. gather: wide to long
    trips %>%
    gather("variable", "value", starttime, stoptime)
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 20 / 26

    View Slide

  25. gather: wide to long
    trips %>%
    gather("variable", "value", starttime, stoptime) %>%
    arrange(value)
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 21 / 26

    View Slide

  26. gather: wide to long
    trips %>%
    gather("variable", "value", starttime, stoptime) %>%
    arrange(value) %>%
    mutate(delta = ifelse(variable == "starttime", 1, -1))
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 22 / 26

    View Slide

  27. spread: long to wide
    trips_long %>%
    spread(variable, value)
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 23 / 26

    View Slide

  28. spread: long to wide
    trips_long %>%
    head %>%
    spread(variable, value)
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 24 / 26

    View Slide

  29. r4ds.had.co.nz
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 25 / 26

    View Slide

  30. style.tidyverse.org
    Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 26 / 26

    View Slide