Upgrade to Pro — share decks privately, control downloads, hide ads and more …

tidystl

 tidystl

Christopher Prener

March 28, 2019
Tweet

More Decks by Christopher Prener

Other Decks in Education

Transcript

  1. THE TIDYSTL ECOSYSTEM MAPPING HEALTH AND SOCIAL DATA IN 


    ST. LOUIS CHRISTOPHER PRENER, PH.D. STATLAB SPEAKER SERIES BROWN SCHOOL OF PUBLIC HEALTH MARCH 27, 2019 SAINT LOUIS UNIVERSITY
  2. AGENDA THE TIDYSTL ECOSYSTEM / STATLAB / MARCH 27, 2019

    1. Preface 2. Knotty Problems with Local Data 3. The TidySTL Ecosystem 4. Opening Doors 5. Why Is This Important? 6. Final Thoughts
  3. 1. PREFACE RESEARCH ↔ TEACHING ↔ SERVICE service teaching Started

    to 
 address data 
 problems identified through 
 neighborhood disorder 
 research.
  4. 2. KNOTTY PROBLEMS WITH LOCAL DATA NO STANDARD API ACCESS

    City of St. Louis 
 Open Data Site Requires manual
 download Spatial data Tabular data Reports
  5. NO STANDARD API ACCESS St. Louis 
 Metropolitan
 Police Department

    Requires manual
 download Tabular data Each month since 2008 is a separate file: = 1 month ‘08 ‘10 ‘12 ‘14 ‘16 ‘18
  6. NO STANDARD API ACCESS St. Louis 
 Metropolitan
 Police Department

    Requires manual
 download html file Each is download with a .html file extension: = 1 month ‘08 ‘10 ‘12 ‘14 ‘16 ‘18
  7. 2. KNOTTY PROBLEMS WITH LOCAL DATA CLOSED FORMATS City of

    St. Louis 
 Open Data Site .shp (ESRI shapefile) .accdb (Access database) Reports (.pdf files)
  8. IT IS OFTEN SAID THAT 80% OF DATA ANALYSIS IS

    SPENT ON THE PROCESS OF CLEANING AND PREPARING THE DATA Hadley Wickham “Tidy Data”
 (2014)
  9. HAPPY FAMILIES ARE ALL ALIKE; EVERY UNHAPPY FAMILY IS UNHAPPY

    IN ITS OWN WAY Leo Tolstoy Anna Karenina
 (1878)
  10. LIKE FAMILIES, TIDY DATASETS ARE ALL ALIKE BUT EVERY MESSY

    DATASET IS MESSY IN ITS OWN WAY. Hadley Wickham “Tidy Data”
 (2014)
  11. SLMPD DATA St. Louis 
 Metropolitan
 Police Department Requires manual


    download Tabular data Different months have different numbers of columns: = 20 cols ‘08 ‘10 ‘12 ‘14 ‘16 ‘18 = 18 cols = 26 cols
  12. 2. KNOTTY PROBLEMS WITH LOCAL DATA GIS DATA VS. GIS

    SKILLS So many of the social science and public health questions 
 we are interested in are located: ‣ Where do these patients live? ‣ What is the neighborhood like? ‣ How segregated is this community? ‣ How many violent crimes occur in each tract? 
 Answering these questions requires specialized skills!
  13. 2. KNOTTY PROBLEMS WITH LOCAL DATA GEOCODING IS EXPENSIVE 6335

    Forsyth Blvd, St. Louis, MO 63015 38.646296, -90.303917 ‣ ESRI: $500 to $3,800 per license per year plus 
 ~$4 per 1,000 geocodes (can be done without programming) ‣ Google: $4 to $5 per 1,000 geocodes depending on volume (also requires programming skills)
  14. 2. KNOTTY PROBLEMS WITH LOCAL DATA GEOCODING IS EXPENSIVE 1.4

    million addresses in the City CSB data ~$5,600 dollars ‣ ESRI: $500 to $3,800 per license per year plus 
 ~$4 per 1,000 geocodes (can be done without programming) ‣ Google: $4 to $5 per 1,000 geocodes depending on volume (also requires programming skills)
  15. 3. THE TIDYSTL ECOSYSTEM TIDYSTL IS… A family of St.

    Louis specific and more general geospatial software implemented in R.
  16. ▸ R is… ▸ Open source (free!) ▸ Originally designed

    for statistical tests and graphics… ▸ … but has rapidly expanded into other domains like data management and scientific communication. 3. THE TIDYSTL ECOSYSTEM UNDERSTANDING R ▸ Modular in design
  17. 3. THE TIDYSTL ECOSYSTEM UNDERSTANDING R There are currently 13,896

    packages available on CRAN (exclusively R). = 200 packages There are currently 3,622 packages available on SSC (mostly Stata and SAS). = 200 packages
  18. ▸ There are dedicated package repositories for: ▸ Bioconductor -

    High throughout genomics (1,649 packages) ▸ Neuroconductor - Computational imaging (51 packages) ▸ rOpenSci - Open science infrastructure (266 packages) 3. THE TIDYSTL ECOSYSTEM UNDERSTANDING R = 200 packages There are currently 13,896 packages available on CRAN (exclusively R).
  19. OPEN SOURCE GEOCODING Address normalization makes all geocoders more accurate.

    id address_original address_norm 1 1416 Delmar Boulevard 1416 Delmar Blvd 2 3803 EAST MLK DRIV 3803 E Dr Martin Luther King Dr 3 3700 LINDELL BLVD 3700 Lindell Blvd 4 LINDELL BLVD / SOUTH VANDEVENTER AVENUE Lindell Blvd at S Vandeventer Ave 5 1420-22 Delmar Boulevard 1420 Delmar Blvd 5 1420-22 Delmar Boulevard 1422 Delmar Blvd
  20. OPEN SOURCE GEOCODING Address normalization makes all geocoders more accurate.

    id 1 2 3 4 5 5 address_norm 1416 Delmar Blvd 3803 E Dr Martin Luther King Dr 3700 Lindell Blvd Lindell Blvd at S Vandeventer Ave 1420 Delmar Blvd 1422 Delmar Blvd
  21. OPEN SOURCE GEOCODING Local geocoding does not use APIs at

    all, so it can be used with
 data protected by HIPPA. id address_norm 1 1416 Delmar Blvd 2 3803 E Dr Martin Luther King Dr 3 3700 Lindell Blvd 4 Lindell Blvd at S Vandeventer Ave 5 1420 Delmar Blvd
  22. OPEN SOURCE GEOCODING Local geocoding does not use APIs at

    all, so it can be used with
 data protected by HIPPA. id address_norm 1 1416 Delmar Blvd 2 3803 E Dr Martin Luther King Dr 3 3700 Lindell Blvd 4 Lindell Blvd at S Vandeventer Ave 5 1420 Delmar Blvd 6 3700 Lindell
  23. Local geocoding does not use APIs at all, so it

    can be used with
 data protected by HIPPA. OPEN SOURCE GEOCODING id address_norm x y source 1 1416 Delmar Blvd -90.2 38.6 local 2 3803 E Dr Martin Luther King Dr -90.2 38.7 local 3 3700 Lindell Blvd -90.2 38.6 local 4 Lindell Blvd at S Vandeventer Ave NA NA NA 5 1420 Delmar Blvd NA NA NA 6 3700 Lindell NA NA NA
  24. The gateway package also offers a “short” local geocoded that


    can geocode addresses missing their suffix data. OPEN SOURCE GEOCODING id address_norm x y source 1 1416 Delmar Blvd NA NA NA 2 3803 E Dr Martin Luther King Dr NA NA NA 3 3700 Lindell Blvd NA NA NA 4 Lindell Blvd at S Vandeventer Ave NA NA NA 5 1420 Delmar Blvd NA NA NA 6 3700 Lindell -90.2 38.6 local, short
  25. COMPOSITE GEOCODING Composite geocoders “chain” together multiple geocoders to
 increase

    the likelihood of a positive match. Local geocoder Local geocoder, short
  26. When we use our local composite geocoder, we get back

    the 
 results from both the local and short form geocoders in one call. COMPOSITE GEOCODING id address_norm x y source 1 1416 Delmar Blvd -90.2 38.6 local 2 3803 E Dr Martin Luther King Dr -90.2 38.7 local 3 3700 Lindell Blvd -90.2 38.6 local 4 Lindell Blvd at S Vandeventer Ave NA NA NA 5 1420 Delmar Blvd NA NA NA 6 3700 Lindell -90.2 38.6 local, short
  27. COMPOSITE GEOCODING For data that can be sent to an

    API, we can increase our 
 likelihood of positive matches by adding additional data sources. Local geocoder Local geocoder, short City Address Candidate API
  28. The gateway package also offers a “short” local geocoded that


    can geocode addresses missing their suffix data. OPEN SOURCE GEOCODING id address_norm x y source 1 1416 Delmar Blvd -90.2 38.6 local 2 3803 E Dr Martin Luther King Dr -90.2 38.7 local 3 3700 Lindell Blvd -90.2 38.6 local 4 Lindell Blvd at S Vandeventer Ave -90.2 38.6 city candidate 5 1420 Delmar Blvd -90.2 38.6 city candidate 6 3700 Lindell -90.2 38.6 local, short
  29. COMPOSITE GEOCODING From 2008 through 2018, there were 1,822 homicides

    in 
 St. Louis. Where were they located? Local geocoder Local geocoder, short City Address Candidate API 1,466 80.46% 50 2.74% 281 15.42% n pct matches We successfully matched 1,797 (98.62%) of the addresses in homicide data set (< 2 minutes runtime).
  30. STREAMLINED DATA ACCESS Instead of manual downloads, a set of

    consistent APIs for downloading and managing data: A single function for: •City of St. Louis open data •SLU openGIS data •Common Census spatial data •Common Census demographic data A set of functions for: •Downloading 10 years of non-emergency 
 call data •Subsetting and wrangling these data
  31. STREAMLINED DATA ACCESS Validation, standardization, and wrangling for SLMPD data

    Different months have different numbers of columns: = 20 cols ‘08 ‘10 ‘12 ‘14 ‘16 ‘18 = 18 cols = 26 cols A single data set of 1,822 homicides 
 (28 seconds of runtime)
  32. 5. WHY IS THIS IMPORTANT? LOWER COSTS 1.4 million addresses

    in the City CSB data ~$5,600 dollars ‣ ESRI: $500 to $3,800 per license per year plus 
 ~$4 per 1,000 geocodes (can be done without programming) ‣ Google: $4 to $5 per 1,000 geocodes depending on volume (also requires programming skills)
  33. 5. WHY IS THIS IMPORTANT? LOWER COSTS 1.4 million addresses

    in the City CSB data $0 dollars ‣ ESRI: $500 to $3,800 per license per year plus 
 ~$4 per 1,000 geocodes (can be done without programming) ‣ Google: $4 to $5 per 1,000 geocodes depending on volume (also requires programming skills)
  34. REPRODUCIBLE WORKFLOWS Standardized
 data wrangling Predictable 
 address normalization Open

    source
 geocoding Straight forward
 mapping Convert data to 
 well-known geographies
  35. 6. FINAL THOUGHTS WHERE WE’RE HEADED Expanding access to: •City

    of St. Louis open data •SLU openGIS data •Common Census spatial data •Common Census demographic data
  36. COMPOSITE GEOCODING Local geocoder Local geocoder, short City batch geocoding

    API City address candidate API Census Bureau 
 geocoding API Open Street Map 
 geocoding API
  37. LEARN MORE THANKS FOR COMING! Slides available via SpeakerDeck
 speakerdeck.com/chrisprener/tidystl18

    [email protected] https://chris-prener.github.io
 , : @chrisprener https://github.com/slu-openGIS Learn more about tidystl https://slu-openGIS.github.io