Upgrade to Pro — share decks privately, control downloads, hide ads and more …

You can't do data science in a GUI

You can't do data science in a GUI

Hadley Wickham

March 07, 2018
Tweet

More Decks by Hadley Wickham

Other Decks in Education

Transcript

  1. Hadley Wickham 

    @hadleywickham

    Chief Scientist, RStudio
    You can’t do data
    science in a GUI
    March 2018

    View Slide

  2. Data science is the process
    by which data becomes
    understanding, knowledge
    and insight
    Data science is the process
    by which data becomes
    understanding, knowledge
    and insight

    View Slide

  3. Data science is the process
    by which data becomes
    understanding, knowledge
    and insight
    Data science is the process
    by which data becomes
    understanding, knowledge
    and insight

    View Slide

  4. Tidy
    Import
    Store data
    consistently

    View Slide

  5. Tidy
    Import
    Understand
    Store data
    consistently

    View Slide

  6. Tidy
    Import
    Surprises, but doesn't scale
    Create new variables & new summaries
    Visualise
    Transform
    Model
    Scales, but doesn't (fundamentally) surprise
    Store data
    consistently

    View Slide

  7. Tidy
    Import
    Surprises, but doesn't scale
    Create new variables & new summaries
    Visualise
    Transform
    Model
    Communicate
    Scales, but doesn't (fundamentally) surprise
    Automate
    Store data
    consistently

    View Slide

  8. Tidy
    Import Visualise
    Transform
    Model
    Program
    tibble
    tidyr
    purrr
    magrittr
    dplyr
    forcats
    hms
    ggplot2
    broom
    modelr
    readr
    readxl
    haven
    xml2
    lubridate
    stringr
    tidyverse.org r4ds.had.co.nz

    View Slide

  9. Why program?

    View Slide

  10. Cognitive
    Computational
    Think it Describe it
    (precisely)
    Do it

    View Slide

  11. View Slide

  12. View Slide

  13. table %>%
    rename(player = X1, team = X2, position = X3) %>%
    filter(player != 'PLAYER') %>%
    mutate(
    college = ifelse(player == position, player, NA)
    ) %>%
    fill(college) %>%
    filter(player != college)
    Programming languages are languages

    View Slide

  14. It’s just text!
    And this gives you access to two extremely
    powerful techniques

    View Slide

  15. ⌘C
    ⌘V

    View Slide

  16. View Slide

  17. And provides provenance
    Reproducible
    Readable
    Diffable
    Open

    View Slide

  18. View Slide

  19. View Slide

  20. View Slide

  21. View Slide

  22. View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. I live in fear of clicking the wrong thing

    View Slide

  27. Why program 

    in R?

    View Slide

  28. x <- sample(100, 10)
    x > 50
    #> [1] TRUE FALSE FALSE TRUE TRUE
    #> [6] TRUE TRUE FALSE FALSE TRUE
    sum(x > 50)
    #> [1] 6
    # (There are no scalars! )
    R is a vector language

    View Slide

  29. y <- sample(c(1:5, NA))
    y
    #> [1] 1 NA 2 3 5 4
    y > 2
    #> [1] FALSE NA FALSE TRUE TRUE TRUE
    y == NA
    #> [1] NA NA NA NA NA NA
    Missing values are baked in

    View Slide

  30. john_age <- NA
    mary_age <- NA
    john_age == mary_age
    #> [1] NA
    An example makes this clearer

    View Slide

  31. y <- sample(c(1:5, NA))
    y
    #> [1] 1 NA 2 3 5 4
    y > 2
    #> [1] FALSE NA FALSE TRUE TRUE TRUE
    is.na(y)
    #> [1] FALSE TRUE FALSE FALSE FALSE FALSE
    Missing values are baked in

    View Slide

  32. data.frame(
    x = 1:4,
    y = sample(letters[1:4]),
    z = runif(4)
    )
    #> x y z
    #> 1 1 c 0.1189635
    #> 2 2 a 0.0518956
    #> 3 3 b 0.4471441
    #> 4 4 d 0.0818547
    So are relational tables (aka data frames/tibbles)

    View Slide

  33. # It's well suited to data science but I
    # can't (yet) articulate why
    # Something about having a standard
    # container for 80% of problems, and
    # needing to do something to each element
    # of that container
    # Whole object thinking?
    Functional programming

    View Slide

  34. x <- seq(0, 2 * pi, length = 100)
    plot(x, sin(x), type = "l")
    Metaprogramming
    0 1 2 3 4 5 6
    −1.0 −0.5 0.0 0.5 1.0
    x
    sin(x)

    View Slide

  35. View Slide

  36. Which makes it a great place to write DSLs

    View Slide

  37. Why program in R
    with the ?

    View Slide

  38. Solve complex
    https://unsplash.com/photos/tjX_sniNzgQ
    simple pieces
    combining
    problems by

    View Slide

  39. library(tidycensus)
    geo <- get_acs(
    geography = "metropolitan statistical area...",
    variables = "DP03_0021PE",
    summary_var = "B01003_001",
    survey = "acs1",
    endyear = 2016
    )
    # Thanks to Kyle Walker (@kyle_e_walker)
    # For package and example
    A small example

    View Slide

  40. View Slide

  41. big_metro <- geo %>%
    filter(summary_est > 2e6) %>%
    select(-variable) %>%
    mutate(
    NAME = gsub(" Metro Area", "", NAME)
    ) %>%
    separate(NAME, c("city", "state"), ", ") %>%
    mutate(
    city = str_extract(city, "^[A-Za-z ]+"),
    state = str_extract(state, "^[A-Za-z ]+"),
    name = paste0(city, ", ", state),
    summary_moe = na_if(summary_moe, -555555555)
    )
    Followed by data munging

    View Slide

  42. View Slide

  43. big_metro %>%
    ggplot(aes(
    x = estimate,
    y = reorder(name, estimate))
    ) +
    geom_errorbarh(
    aes(
    xmin = estimate - moe,
    xmax = estimate + moe
    ),
    width = 0.1
    ) +
    geom_point(color = "navy")

    View Slide




































  44. Indianapolis, IN
    Kansas City, MO
    Riverside, CA
    Charlotte, NC
    Dallas, TX
    Tampa, FL
    Detroit, MI
    Columbus, OH
    Phoenix, AZ
    Cincinnati, OH
    Houston, TX
    Orlando, FL
    Sacramento, CA
    Austin, TX
    San Antonio, TX
    San Juan, PR
    St, MO
    San Diego, CA
    Atlanta, GA
    Cleveland, OH
    Las Vegas, NV
    Miami, FL
    Denver, CO
    Minneapolis, MN
    Los Angeles, CA
    Pittsburgh, PA
    Baltimore, MD
    Portland, OR
    Philadelphia, PA
    Seattle, WA
    Chicago, IL
    Boston, MA
    Washington, DC
    San Francisco, CA
    New York, NY
    0 10 20 30
    estimate
    reorder(name, estimate)

    View Slide




































  45. Indianapolis, IN
    Kansas City, MO
    Riverside, CA
    Charlotte, NC
    Dallas, TX
    Tampa, FL
    Detroit, MI
    Columbus, OH
    Phoenix, AZ
    Cincinnati, OH
    Houston, TX
    Orlando, FL
    Sacramento, CA
    Austin, TX
    San Antonio, TX
    San Juan, PR
    St, MO
    San Diego, CA
    Atlanta, GA
    Cleveland, OH
    Las Vegas, NV
    Miami, FL
    Denver, CO
    Minneapolis, MN
    Los Angeles, CA
    Pittsburgh, PA
    Baltimore, MD
    Portland, OR
    Philadelphia, PA
    Seattle, WA
    Chicago, IL
    Boston, MA
    Washington, DC
    San Francisco, CA
    New York, NY
    0% 10% 20% 30%
    2016 1−year ACS estimates
    Residents who take public transportation to work
    Source: ACS Data Profile variable DP03_0021P / tidycensus

    View Slide

  46. No matter how complex and
    polished the individual operations
    are, it is often the quality of the
    glue that most directly determines
    the power of the system.
    — Hal Abelson

    View Slide

  47. My goal is to make
    a pit of success
    http://blog.codinghorror.com/falling-into-the-pit-of-success/

    View Slide

  48. But

    View Slide













  49. ● ●







    ● ●














    ● ●







    ● ●









    ● ●















    ● ●




    ● ●

















































    ● ●





    ● ●








    ● ●




    ● ●










    ● ●
















































    ● ●














    ● ●
    20
    30
    40
    2 3 4 5 6 7
    displ
    hwy

    View Slide

  50. df %>%
    select(
    date = `Date Created`,
    name = Name,
    plays = `Total Plays`,
    loads = `Total Loads`,
    apv = `Average Percent Viewed`
    )
    But this is painful!

    View Slide

  51. View Slide

  52. df %>%
    filter(n > 1e6) %>%
    mutate(x = f(y))) %>%
    ???
    # How predictable is next step from
    # previous steps?
    What next?

    View Slide

  53. Can we do more with autocomplete?
    Where do dialogs and autocomplete intersect?

    View Slide

  54. Learning from examples
    http://vis.stanford.edu/papers/wrangler

    View Slide

  55. What about deep learning?
    https://twitter.com/carroll_jono/status/914254139873361920

    View Slide

  56. Conclusion

    View Slide

  57. I believe that:
    1. Huge advantages to code
    2. R provides great environment
    3. DSLs help express your thoughts
    4. Code should be primary artefact (but
    might be generated other than typing)

    View Slide