Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SOC 4650/5650 - Lecture-07 - Data Cleaning

SOC 4650/5650 - Lecture-07 - Data Cleaning

Christopher Prener

February 24, 2020
Tweet

More Decks by Christopher Prener

Other Decks in Education

Transcript

  1. AGENDA INTRO TO GISC / WEEK 07 / LECTURE 07

    1. Front Matter 2. Types of Data 3. Data Wrangling 4. Back Matter
  2. Clarifying PS-02 expectations - two conflicting copies were available through

    the course website. Please submit what you have on Opportunity Zones by Thursday at 5pm. 1. FRONT MATTER ANNOUNCEMENTS Next Week: Lab-05 (from last week), Lab-06 (from today), final project check-in Two Weeks: PS-03 (from today), Lab-07 (from next week), video lectures over spring break
  3. 2. TYPES OF DATA TABULAR DATA categorical ordinal continuous crime

    “ucr crime type”
 1. murder 2. rape 3. robbery 4. aggravated assault
  4. 2. TYPES OF DATA TABULAR DATA categorical ordinal continuous crimeStr

    “ucr crime type, string”
 murder rape robbery aggravated assault
  5. 2. TYPES OF DATA TABULAR DATA categorical ordinal continuous leadLevels

    “rate of exposure”
 1. very low 2. low 3. high 4. very high
  6. 4. DATA WRANGLING HIGH LEVEL WORKFLOW 1. Plan 2. Organize

    3. Document 4. Execute FOR EACH
 STEP:
  7. IT IS OFTEN SAID THAT 80% OF DATA ANALYSIS IS

    SPENT ON THE PROCESS OF CLEANING AND PREPARING THE DATA Hadley Wickham “Tidy Data”
 (2014)
  8. HAPPY FAMILIES ARE ALL ALIKE; EVERY UNHAPPY FAMILY IS UNHAPPY

    IN ITS OWN WAY Leo Tolstoy Anna Karenina
 (1878)
  9. HAPPY FAMILIES ARE ALL ALIKE; EVERY UNHAPPY FAMILY IS UNHAPPY

    IN ITS OWN WAY Leo Tolstoy Anna Karenina
 (1878)
  10. LIKE FAMILIES, TIDY DATASETS ARE ALL ALIKE BUT EVERY MESSY

    DATASET IS MESSY IN ITS OWN WAY. Hadley Wickham “Tidy Data”
 (2014)
  11. TIDY DATASETS PROVIDE A STANDARDIZED WAY TO LINK THE STRUCTURE

    OF A DATASET (ITS PHYSICAL LAYOUT) WITH ITS SEMANTICS (ITS MEANING). Hadley Wickham “Tidy Data”
 (2014)
  12. ▸ Are variables named consistently? ▸ Are variable names clear?

    ▸ Are variables stored in the format that makes the most sense for their data? ▸ Do variables represent one and only one construct? ▸ Is there a unique identification variable? ▸ Is there missing or incomplete data? 4. DATA WRANGLING A B VARIABLES
  13. 4. DATA WRANGLING TIDY DATA Each table (i.e. each file

    or data frame) should contain a single observational unit. A B C D Car Dealer Brand Cars Dealers Brands
  14. ▸ What is the observational unit? • Do the data

    need to be subset into tables with different observational units? ▸ Are there duplicate observations? ▸ Are there “near” duplicate observations? 4. DATA WRANGLING A B OBSERVATIONS
  15. ▸ readr for reading tabular data ▸ janitor for its

    data wrangling functions ▸ dplyr for data wrangling functions ▸ naniar for missing data analyses ▸ stringr for detecting patterns in string data 4. DATA WRANGLING PACKAGES
  16. 3. DATA WRANGLING VERBS FOR DATA CLEANING janitor::clean_names() is used

    for renaming all columns id Bad name Very long name Meh name id bad_name very_long_name meh_name
  17. 3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::rename() is used

    for renaming some columns id a b c id new_name b c
  18. 3. DATA WRANGLING VERBS FOR DATA CLEANING naniar::miss_var_summary() is used

    identifying missing data id a b c NA NA NA miss_var_summary(data) #> # A tibble: 4 x 3 #> variable n_miss pct_miss #> <chr> <int> <dbl> #> 1 c 2 50.0 #> 2 d 1 24.0 #> 3 id 0 0 #> 4 a 0 0
  19. 3. DATA WRANGLING VERBS FOR DATA CLEANING janitor::get_dupes() is used

    identifying duplicates id a b c 1 high 24 TRUE 1 high 24 TRUE 2 low 67 FALSE 3 low 89 TRUE > get_dupes(data) No variable names specified - using all columns. # A tibble: 2 x 5 id a b c dupe_count <dbl> <chr> <dbl> <lgl> <int> 1 1 high 24 TRUE 2 2 1 high 24 TRUE 2
  20. 3. DATA WRANGLING VERBS FOR DATA CLEANING janitor::get_dupes() is used

    identifying duplicates id a b c 1 high 24 TRUE 1 high 24 TRUE 2 low 67 FALSE 3 low 89 TRUE > get_dupes(data, id) # A tibble: 2 x 5 id dupe_count a b c <dbl> <int> <chr> <dbl> <lgl> 1 1 2 high 24 TRUE 2 1 2 high 24 TRUE
  21. 3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::distinct() is used

    for removing duplicates id a b c 1 high 24 TRUE 1 high 24 TRUE 2 low 67 FALSE 3 low 89 TRUE id a b c 1 high 24 TRUE 2 low 67 FALSE 3 low 89 TRUE
  22. 3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::select() is used

    for subsetting columns id a b c 1 high 24 TRUE 1 high 24 TRUE 2 low 67 FALSE 3 low 89 TRUE id a b 1 high 24 1 high 24 2 low 67 3 low 89
  23. 3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::select() is also

    used for reordering columns id a b c 1 high 24 TRUE 1 high 24 TRUE 2 low 67 FALSE 3 low 89 TRUE id c a b 1 TRUE high 24 1 TRUE high 24 2 FALSE low 67 3 TRUE low 89
  24. 3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::filter() is used

    for subsetting observations id a b c 1 high 24 TRUE 1 high 24 TRUE 2 low 67 FALSE 3 low 89 TRUE id a b c 2 low 67 FALSE 3 low 89 TRUE
  25. 3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::mutate() is used

    for creating new variables id a b c 1 high 24 TRUE 1 high 24 TRUE 2 low 67 FALSE 3 low 89 TRUE id a b c d 1 high 24 TRUE TRUE 1 high 24 TRUE TRUE 2 low 67 FALSE FALSE 3 low 89 TRUE FALSE
  26. 3. DATA WRANGLING VERBS FOR DATA CLEANING stringr::str_detect() is used

    with mutate() to search within strings id name b c 1 Highland Creek 24 TRUE 1 Highland Creek 24 TRUE 2 Lost River 67 FALSE 3 Highland River 89 TRUE id name b c d 1 Highland Creek 24 TRUE TRUE 1 Highland Creek 24 TRUE TRUE 2 Lost River 67 FALSE FALSE 3 Highland River 89 TRUE TRUE
  27. 3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::arrange() is used

    to re-order observations id name b c 1 Highland Creek 24 TRUE 1 Highland Creek 24 TRUE 2 Lost River 67 FALSE 3 Highland River 89 TRUE id name b c 3 Highland River 89 TRUE 2 Lost River 67 FALSE 1 Highland Creek 24 TRUE 1 Highland Creek 24 TRUE
  28. 3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::group() is used

    to create groups that can be summarized id name b c 1 Highland Creek 24 TRUE 1 Highland Creek 24 TRUE 2 Lost River 67 FALSE 3 Highland Creek 89 TRUE id name b c 1 Highland Creek 24 TRUE 1 Highland Creek 24 TRUE 3 Highland Creek 89 TRUE id name b c 2 Lost River 67 FALSE
  29. 3. DATA WRANGLING VERBS FOR DATA CLEANING dplyr::summarize() is used

    to then summarize grouped data id name b c 1 Highland Creek 24 TRUE 1 Highland Creek 24 TRUE 2 Lost River 67 FALSE 3 Highland Creek 89 TRUE name n mean_b Highland Creek 3 77.6 Lost River 1 67
  30. REMINDERS 4. BACK MATTER Final project data posted; additional vignettes

    forthcoming Next Week: PS-02 (from last few weeks), Lab-06 (from today) Two Weeks: PS-03 (from today), Lab-07 (from next week), final project check-in, video lectures over spring break