SOC 4650/5650 - Lecture-07 - Data Cleaning

Slide 1

Slide 1 text

DATA CLEANING INTRO TO GISC CHRISTOPHER PRENER, PH.D. SPRING 2020 WEEK 07 LECTURE 07

Slide 2

Slide 2 text

AGENDA INTRO TO GISC / WEEK 07 / LECTURE 07 1. Front Matter 2. Types of Data 3. Data Wrangling 4. Back Matter

Slide 3

Slide 3 text

1 FRONT   MATTER

Slide 4

Slide 4 text

Clarifying PS-02 expectations - two conﬂicting copies were available through the course website. Please submit what you have on Opportunity Zones by Thursday at 5pm. 1. FRONT MATTER ANNOUNCEMENTS Next Week: Lab-05 (from last week), Lab-06 (from today), ﬁnal project check-in Two Weeks: PS-03 (from today), Lab-07 (from next week), video lectures over spring break

Slide 5

Slide 5 text

TYPES OF DATA 2

Slide 6

Slide 6 text

2. TYPES OF DATA WHAT ARE SPATIAL DATA? Tabular Geometric

Slide 7

Slide 7 text

2. TYPES OF DATA TABULAR DATA categorical ordinal continuous Tabular

Slide 8

Slide 8 text

2. TYPES OF DATA TABULAR DATA categorical ordinal continuous crime “ucr crime type”  1. murder 2. rape 3. robbery 4. aggravated assault

Slide 9

Slide 9 text

2. TYPES OF DATA TABULAR DATA categorical ordinal continuous crimeStr “ucr crime type, string”  murder rape robbery aggravated assault

Slide 10

Slide 10 text

2. TYPES OF DATA TABULAR DATA categorical ordinal continuous newCrime “new incident”  1. no 2. yes 0. 1.

Slide 11

Slide 11 text

2. TYPES OF DATA TABULAR DATA qualitative ordinal continuous newCrime “new incident”  1. no 2. yes 0. 1.

Slide 12

Slide 12 text

2. TYPES OF DATA TABULAR DATA categorical ordinal continuous leadLevels “rate of exposure”  1. very low 2. low 3. high 4. very high

Slide 13

Slide 13 text

2. TYPES OF DATA TABULAR DATA categorical ordinal continuous leadCount “number of high tests”  1 2 … 85

Slide 14

Slide 14 text

2. TYPES OF DATA GEOMETRIC DATA Tabular Geometric

Slide 15

Slide 15 text

DATA WRANGLING 3

Slide 16

Slide 16 text

4. DATA WRANGLING HIGH LEVEL WORKFLOW 1. Plan 2. Organize 3. Document 4. Execute FOR EACH  STEP:

Slide 17

Slide 17 text

IT IS OFTEN SAID THAT 80% OF DATA ANALYSIS IS SPENT ON THE PROCESS OF CLEANING AND PREPARING THE DATA Hadley Wickham “Tidy Data”  (2014)

Slide 18

Slide 18 text

HAPPY FAMILIES ARE ALL ALIKE; EVERY UNHAPPY FAMILY IS UNHAPPY IN ITS OWN WAY Leo Tolstoy Anna Karenina  (1878)

Slide 19

Slide 19 text

HAPPY FAMILIES ARE ALL ALIKE; EVERY UNHAPPY FAMILY IS UNHAPPY IN ITS OWN WAY Leo Tolstoy Anna Karenina  (1878)

Slide 20

Slide 20 text

LIKE FAMILIES, TIDY DATASETS ARE ALL ALIKE BUT EVERY MESSY DATASET IS MESSY IN ITS OWN WAY. Hadley Wickham “Tidy Data”  (2014)

Slide 21

Slide 21 text

TIDY DATASETS PROVIDE A STANDARDIZED WAY TO LINK THE STRUCTURE OF A DATASET (ITS PHYSICAL LAYOUT) WITH ITS SEMANTICS (ITS MEANING). Hadley Wickham “Tidy Data”  (2014)

Slide 22

Slide 22 text

4. DATA WRANGLING TIDY DATA Each variable should be saved in its own column. A B C D

Slide 23

Slide 23 text

▸ Are variables named consistently? ▸ Are variable names clear? ▸ Are variables stored in the format that makes the most sense for their data? ▸ Do variables represent one and only one construct? ▸ Is there a unique identiﬁcation variable? ▸ Is there missing or incomplete data? 4. DATA WRANGLING A B VARIABLES

Slide 24

Slide 24 text

4. DATA WRANGLING TIDY DATA Each observation should be saved in its own row. A B C D

Slide 25

Slide 25 text

▸ What is the observational unit? 4. DATA WRANGLING A B OBSERVATIONS

Slide 26

Slide 26 text

4. DATA WRANGLING TIDY DATA Each table (i.e. each ﬁle or data frame) should contain a single observational unit. A B C D Car Dealer Brand Cars Dealers Brands

Slide 27

Slide 27 text

▸ What is the observational unit? • Do the data need to be subset into tables with different observational units? ▸ Are there duplicate observations? ▸ Are there “near” duplicate observations? 4. DATA WRANGLING A B OBSERVATIONS

Slide 28

Slide 28 text

▸ readr for reading tabular data ▸ janitor for its data wrangling functions ▸ dplyr for data wrangling functions ▸ naniar for missing data analyses ▸ stringr for detecting patterns in string data 4. DATA WRANGLING PACKAGES

Slide 29

Slide 29 text

3. DATA WRANGLING VERBS FOR DATA CLEANING janitor::clean_names() is used for renaming all columns id Bad name Very long name Meh name id bad_name very_long_name meh_name

Slide 30

Slide 30 text

3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::rename() is used for renaming some columns id a b c id new_name b c

Slide 31

Slide 31 text

3. DATA WRANGLING VERBS FOR DATA CLEANING naniar::miss_var_summary() is used identifying missing data id a b c NA NA NA miss_var_summary(data) #> # A tibble: 4 x 3 #> variable n_miss pct_miss #> #> 1 c 2 50.0 #> 2 d 1 24.0 #> 3 id 0 0 #> 4 a 0 0

Slide 32

Slide 32 text

3. DATA WRANGLING VERBS FOR DATA CLEANING janitor::get_dupes() is used identifying duplicates id a b c 1 high 24 TRUE 1 high 24 TRUE 2 low 67 FALSE 3 low 89 TRUE > get_dupes(data) No variable names specified - using all columns. # A tibble: 2 x 5 id a b c dupe_count 1 1 high 24 TRUE 2 2 1 high 24 TRUE 2

Slide 33

Slide 33 text

3. DATA WRANGLING VERBS FOR DATA CLEANING janitor::get_dupes() is used identifying duplicates id a b c 1 high 24 TRUE 1 high 24 TRUE 2 low 67 FALSE 3 low 89 TRUE > get_dupes(data, id) # A tibble: 2 x 5 id dupe_count a b c 1 1 2 high 24 TRUE 2 1 2 high 24 TRUE

Slide 34

Slide 34 text

3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::distinct() is used for removing duplicates id a b c 1 high 24 TRUE 1 high 24 TRUE 2 low 67 FALSE 3 low 89 TRUE id a b c 1 high 24 TRUE 2 low 67 FALSE 3 low 89 TRUE

Slide 35

Slide 35 text

3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::select() is used for subsetting columns id a b c 1 high 24 TRUE 1 high 24 TRUE 2 low 67 FALSE 3 low 89 TRUE id a b 1 high 24 1 high 24 2 low 67 3 low 89

Slide 36

Slide 36 text

3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::select() is also used for reordering columns id a b c 1 high 24 TRUE 1 high 24 TRUE 2 low 67 FALSE 3 low 89 TRUE id c a b 1 TRUE high 24 1 TRUE high 24 2 FALSE low 67 3 TRUE low 89

Slide 37

Slide 37 text

3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::filter() is used for subsetting observations id a b c 1 high 24 TRUE 1 high 24 TRUE 2 low 67 FALSE 3 low 89 TRUE id a b c 2 low 67 FALSE 3 low 89 TRUE

Slide 38

Slide 38 text

3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::mutate() is used for creating new variables id a b c 1 high 24 TRUE 1 high 24 TRUE 2 low 67 FALSE 3 low 89 TRUE id a b c d 1 high 24 TRUE TRUE 1 high 24 TRUE TRUE 2 low 67 FALSE FALSE 3 low 89 TRUE FALSE

Slide 39

Slide 39 text

3. DATA WRANGLING VERBS FOR DATA CLEANING stringr::str_detect() is used with mutate() to search within strings id name b c 1 Highland Creek 24 TRUE 1 Highland Creek 24 TRUE 2 Lost River 67 FALSE 3 Highland River 89 TRUE id name b c d 1 Highland Creek 24 TRUE TRUE 1 Highland Creek 24 TRUE TRUE 2 Lost River 67 FALSE FALSE 3 Highland River 89 TRUE TRUE

Slide 40

Slide 40 text

3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::arrange() is used to re-order observations id name b c 1 Highland Creek 24 TRUE 1 Highland Creek 24 TRUE 2 Lost River 67 FALSE 3 Highland River 89 TRUE id name b c 3 Highland River 89 TRUE 2 Lost River 67 FALSE 1 Highland Creek 24 TRUE 1 Highland Creek 24 TRUE

Slide 41

Slide 41 text

3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::group() is used to create groups that can be summarized id name b c 1 Highland Creek 24 TRUE 1 Highland Creek 24 TRUE 2 Lost River 67 FALSE 3 Highland Creek 89 TRUE id name b c 1 Highland Creek 24 TRUE 1 Highland Creek 24 TRUE 3 Highland Creek 89 TRUE id name b c 2 Lost River 67 FALSE

Slide 42

Slide 42 text

3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::summarize() is used to then summarize grouped data id name b c 1 Highland Creek 24 TRUE 1 Highland Creek 24 TRUE 2 Lost River 67 FALSE 3 Highland Creek 89 TRUE name n mean_b Highland Creek 3 77.6 Lost River 1 67

Slide 43

Slide 43 text

4 BACK   MATTER

Slide 44

Slide 44 text

AGENDA REVIEW 4. BACK MATTER 2. Types of Data 3. Data Wrangling

Slide 45

Slide 45 text

REMINDERS 4. BACK MATTER Final project data posted; additional vignettes forthcoming Next Week: PS-02 (from last few weeks), Lab-06 (from today) Two Weeks: PS-03 (from today), Lab-07 (from next week), ﬁnal project check-in, video lectures over spring break