Modeling Social Data, Lecture 3: Data manipulation in R

Slide 1

Slide 1 text

Data manipulation in R APAM E4990 Modeling Social Data Jake Hofman Columbia University February 8, 2019 Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 1 / 26

Slide 2

Slide 2 text

The good, the bad, & the ugly • R isn’t the best programming language out there • But it happens to be great for data analysis • The result is a steep learning curve with a high payoﬀ Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 2 / 26

Slide 3

Slide 3 text

For instance . . . • You’ll see a mix of camelCase, this.that, and snake case conventions • Dots (.) (mostly) don’t mean anything special • Likewise, $ gets used in funny ways • R is loosely typed, which can lead to unexpected coercions and silent fails • It also tries to be clever about variable scope, which can backﬁre if you’re not careful Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 3 / 26

Slide 4

Slide 4 text

But it will help you . . . • Do extremely fast exploratory data analysis • Easily generate high-quality data visualizations • Fit and evaluate pretty much any statistical model you can think of Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 4 / 26

Slide 5

Slide 5 text

But it will help you . . . • Do extremely fast exploratory data analysis • Easily generate high-quality data visualizations • Fit and evaluate pretty much any statistical model you can think of This will change the way you do data analysis, because you’ll ask questions you wouldn’t have bothered to otherwise Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 4 / 26

Slide 6

Slide 6 text

Basic types • int, double: for numbers • character: for strings • factor: for categorical variables (∼ struct or ENUM) Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 5 / 26

Slide 7

Slide 7 text

Basic types • int, double: for numbers • character: for strings • factor: for categorical variables (∼ struct or ENUM) Factors are handy, but take some getting used to Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 5 / 26

Slide 8

Slide 8 text

Containers • vector: for multiple values of the same type (∼ array) • list: for multiple values of diﬀerent types (∼ dictionary) • data.frame: for tables of rectangular data of mixed types (∼ matrix) Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 6 / 26

Slide 9

Slide 9 text

Containers • vector: for multiple values of the same type (∼ array) • list: for multiple values of diﬀerent types (∼ dictionary) • data.frame: for tables of rectangular data of mixed types (∼ matrix) We’ll mostly work with data frames, which themselves are lists of vectors Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 6 / 26

Slide 10

Slide 10 text

The tidyverse The tidyverse is a collection of packages that work together to make data analysis easier: • dplyr for split / apply / combine type counting • ggplot2 for making plots • tidyr for reshaping and “tidying” data • readr for reading and writing ﬁles • ... Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 7 / 26

Slide 11

Slide 11 text

Tidy data The core philosophy is that your data should be in a “tidy” table with: • One variable per column • One observation per row • One measured value per cell Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 8 / 26

Slide 12

Slide 12 text

Tidy data • Most of the work goes into getting your data into shape • After which descriptives statistics, modeling, and visualization are easy Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 9 / 26

Slide 13

Slide 13 text

dplyr: a grammar of data manipulation dplyr implements the split / apply / combine framework discussed in the last lecture • Its “grammar” has ﬁve main verbs used in the “apply” phase: • filter: restrict rows based on a condition (N → N ) • arrange: reorder rows by a variable (N → N ) • select: pick out speciﬁc columns (K → K ) • mutate: create new or change existing columns (K → K ) • summarize: collapse a column into one value (N → 1) • The group by function creates indices to take care of the split and combine phases Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 10 / 26

Slide 14

Slide 14 text

dplyr: a grammar of data manipulation dplyr implements the split / apply / combine framework discussed in the last lecture • Its “grammar” has ﬁve main verbs used in the “apply” phase: • filter: restrict rows based on a condition (N → N ) • arrange: reorder rows by a variable (N → N ) • select: pick out speciﬁc columns (K → K ) • mutate: create new or change existing columns (K → K ) • summarize: collapse a column into one value (N → 1) • The group by function creates indices to take care of the split and combine phases The cost is that you have to think “functionally”, in terms of “vectorized” operations Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 10 / 26

Slide 15

Slide 15 text

filter filter(trips, start_station_name == "Broadway & E 14 St") Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 11 / 26

Slide 16

Slide 16 text

arrange arrange(trips, starttime) Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 12 / 26

Slide 17

Slide 17 text

select select(trips, starttime, stoptime, start_station_name, end_station_name) Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 13 / 26

Slide 18

Slide 18 text

mutate mutate(trips, time_in_min = tripduration / 60) Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 14 / 26

Slide 19

Slide 19 text

mutate summarize(trips, mean_duration = mean(tripduration) / 60, sd_duration = sd(tripduration) / 60) Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 15 / 26

Slide 20

Slide 20 text

group by trips_by_gender <- group_by(trips, gender) Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 16 / 26

Slide 21

Slide 21 text

group by trips_by_gender <- group_by(trips, gender) Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 17 / 26

Slide 22

Slide 22 text

group by + summarize summarize(trips_by_gender, count = n(), mean_duration = mean(tripduration) / 60, sd_duration = sd(tripduration) / 60) Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 18 / 26

Slide 23

Slide 23 text

%>%: the pipe operator trips %>% group_by(gender) %>% summarize(count = n(), mean_duration = mean(tripduration) / 60, sd_duration = sd(tripduration) / 60) Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 19 / 26

Slide 24

Slide 24 text

gather: wide to long trips %>% gather("variable", "value", starttime, stoptime) Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 20 / 26

Slide 25

Slide 25 text

gather: wide to long trips %>% gather("variable", "value", starttime, stoptime) %>% arrange(value) Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 21 / 26

Slide 26

Slide 26 text

gather: wide to long trips %>% gather("variable", "value", starttime, stoptime) %>% arrange(value) %>% mutate(delta = ifelse(variable == "starttime", 1, -1)) Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 22 / 26

Slide 27

Slide 27 text

spread: long to wide trips_long %>% spread(variable, value) Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 23 / 26

Slide 28

Slide 28 text

spread: long to wide trips_long %>% head %>% spread(variable, value) Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 24 / 26

Slide 29

Slide 29 text

r4ds.had.co.nz Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 25 / 26

Slide 30

Slide 30 text

style.tidyverse.org Jake Hofman (Columbia University) Data manipulation in R February 8, 2019 26 / 26