Upgrade to Pro — share decks privately, control downloads, hide ads and more …

You can't do data science in a GUI

You can't do data science in a GUI

7ba164f40a50bc23dbb2aa825fb7bc16?s=128

Hadley Wickham

March 07, 2018
Tweet

Transcript

  1. Hadley Wickham 
 @hadleywickham
 Chief Scientist, RStudio You can’t do

    data science in a GUI March 2018
  2. Data science is the process by which data becomes understanding,

    knowledge and insight Data science is the process by which data becomes understanding, knowledge and insight
  3. Data science is the process by which data becomes understanding,

    knowledge and insight Data science is the process by which data becomes understanding, knowledge and insight
  4. Tidy Import Store data consistently

  5. Tidy Import Understand Store data consistently

  6. Tidy Import Surprises, but doesn't scale Create new variables &

    new summaries Visualise Transform Model Scales, but doesn't (fundamentally) surprise Store data consistently
  7. Tidy Import Surprises, but doesn't scale Create new variables &

    new summaries Visualise Transform Model Communicate Scales, but doesn't (fundamentally) surprise Automate Store data consistently
  8. Tidy Import Visualise Transform Model Program tibble tidyr purrr magrittr

    dplyr forcats hms ggplot2 broom modelr readr readxl haven xml2 lubridate stringr tidyverse.org r4ds.had.co.nz
  9. Why program?

  10. Cognitive Computational Think it Describe it (precisely) Do it

  11. None
  12. None
  13. table %>% rename(player = X1, team = X2, position =

    X3) %>% filter(player != 'PLAYER') %>% mutate( college = ifelse(player == position, player, NA) ) %>% fill(college) %>% filter(player != college) Programming languages are languages
  14. It’s just text! And this gives you access to two

    extremely powerful techniques
  15. ⌘C ⌘V

  16. None
  17. And provides provenance Reproducible Readable Diffable Open

  18. None
  19. None
  20. None
  21. None
  22. None
  23. None
  24. None
  25. None
  26. I live in fear of clicking the wrong thing

  27. Why program 
 in R?

  28. x <- sample(100, 10) x > 50 #> [1] TRUE

    FALSE FALSE TRUE TRUE #> [6] TRUE TRUE FALSE FALSE TRUE sum(x > 50) #> [1] 6 # (There are no scalars! ) R is a vector language
  29. y <- sample(c(1:5, NA)) y #> [1] 1 NA 2

    3 5 4 y > 2 #> [1] FALSE NA FALSE TRUE TRUE TRUE y == NA #> [1] NA NA NA NA NA NA Missing values are baked in
  30. john_age <- NA mary_age <- NA john_age == mary_age #>

    [1] NA An example makes this clearer
  31. y <- sample(c(1:5, NA)) y #> [1] 1 NA 2

    3 5 4 y > 2 #> [1] FALSE NA FALSE TRUE TRUE TRUE is.na(y) #> [1] FALSE TRUE FALSE FALSE FALSE FALSE Missing values are baked in
  32. data.frame( x = 1:4, y = sample(letters[1:4]), z = runif(4)

    ) #> x y z #> 1 1 c 0.1189635 #> 2 2 a 0.0518956 #> 3 3 b 0.4471441 #> 4 4 d 0.0818547 So are relational tables (aka data frames/tibbles)
  33. # It's well suited to data science but I #

    can't (yet) articulate why # Something about having a standard # container for 80% of problems, and # needing to do something to each element # of that container # Whole object thinking? Functional programming
  34. x <- seq(0, 2 * pi, length = 100) plot(x,

    sin(x), type = "l") Metaprogramming 0 1 2 3 4 5 6 −1.0 −0.5 0.0 0.5 1.0 x sin(x)
  35. None
  36. Which makes it a great place to write DSLs

  37. Why program in R with the ?

  38. Solve complex https://unsplash.com/photos/tjX_sniNzgQ simple pieces combining problems by

  39. library(tidycensus) geo <- get_acs( geography = "metropolitan statistical area...", variables

    = "DP03_0021PE", summary_var = "B01003_001", survey = "acs1", endyear = 2016 ) # Thanks to Kyle Walker (@kyle_e_walker) # For package and example A small example
  40. None
  41. big_metro <- geo %>% filter(summary_est > 2e6) %>% select(-variable) %>%

    mutate( NAME = gsub(" Metro Area", "", NAME) ) %>% separate(NAME, c("city", "state"), ", ") %>% mutate( city = str_extract(city, "^[A-Za-z ]+"), state = str_extract(state, "^[A-Za-z ]+"), name = paste0(city, ", ", state), summary_moe = na_if(summary_moe, -555555555) ) Followed by data munging
  42. None
  43. big_metro %>% ggplot(aes( x = estimate, y = reorder(name, estimate))

    ) + geom_errorbarh( aes( xmin = estimate - moe, xmax = estimate + moe ), width = 0.1 ) + geom_point(color = "navy")
  44. • • • • • • • • • •

    • • • • • • • • • • • • • • • • • • • • • • • • • Indianapolis, IN Kansas City, MO Riverside, CA Charlotte, NC Dallas, TX Tampa, FL Detroit, MI Columbus, OH Phoenix, AZ Cincinnati, OH Houston, TX Orlando, FL Sacramento, CA Austin, TX San Antonio, TX San Juan, PR St, MO San Diego, CA Atlanta, GA Cleveland, OH Las Vegas, NV Miami, FL Denver, CO Minneapolis, MN Los Angeles, CA Pittsburgh, PA Baltimore, MD Portland, OR Philadelphia, PA Seattle, WA Chicago, IL Boston, MA Washington, DC San Francisco, CA New York, NY 0 10 20 30 estimate reorder(name, estimate)
  45. • • • • • • • • • •

    • • • • • • • • • • • • • • • • • • • • • • • • • Indianapolis, IN Kansas City, MO Riverside, CA Charlotte, NC Dallas, TX Tampa, FL Detroit, MI Columbus, OH Phoenix, AZ Cincinnati, OH Houston, TX Orlando, FL Sacramento, CA Austin, TX San Antonio, TX San Juan, PR St, MO San Diego, CA Atlanta, GA Cleveland, OH Las Vegas, NV Miami, FL Denver, CO Minneapolis, MN Los Angeles, CA Pittsburgh, PA Baltimore, MD Portland, OR Philadelphia, PA Seattle, WA Chicago, IL Boston, MA Washington, DC San Francisco, CA New York, NY 0% 10% 20% 30% 2016 1−year ACS estimates Residents who take public transportation to work Source: ACS Data Profile variable DP03_0021P / tidycensus
  46. No matter how complex and polished the individual operations are,

    it is often the quality of the glue that most directly determines the power of the system. — Hal Abelson
  47. My goal is to make a pit of success http://blog.codinghorror.com/falling-into-the-pit-of-success/

  48. But

  49. • • • • • • • • • •

    • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 20 30 40 2 3 4 5 6 7 displ hwy
  50. df %>% select( date = `Date Created`, name = Name,

    plays = `Total Plays`, loads = `Total Loads`, apv = `Average Percent Viewed` ) But this is painful!
  51. None
  52. df %>% filter(n > 1e6) %>% mutate(x = f(y))) %>%

    ??? # How predictable is next step from # previous steps? What next?
  53. Can we do more with autocomplete? Where do dialogs and

    autocomplete intersect?
  54. Learning from examples http://vis.stanford.edu/papers/wrangler

  55. What about deep learning? https://twitter.com/carroll_jono/status/914254139873361920

  56. Conclusion

  57. I believe that: 1. Huge advantages to code 2. R

    provides great environment 3. DSLs help express your thoughts 4. Code should be primary artefact (but might be generated other than typing)