Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Should all statistics students 
be programmers?

Hadley Wickham
July 12, 2018
4.1k

Should all statistics students 
be programmers?

A presentation at ICOTS 10 (Kyoto, Japan)

Hadley Wickham

July 12, 2018
Tweet

Transcript

  1. Hadley Wickham 

    @hadleywickham

    Chief Scientist, RStudio
    Should all statistics students

    be programmers?
    July 2018
    No!

    View Slide

  2. Hadley Wickham 

    @hadleywickham

    Chief Scientist, RStudio
    July 2018
    Should all statistics students

    program? Yes!

    View Slide

  3. What should a statistics student be able to do?
    Model

    View Slide

  4. What should a statistics student be able to do?
    Tidy
    Surprises, but doesn't scale
    Create new variables & new summaries
    Visualise
    Transform
    Model
    Communicate
    Scales, but doesn't (fundamentally) surprise
    Automate
    Store data
    consistently
    Import
    Understand

    View Slide

  5. Two primary computer interfaces: point & click GUI

    View Slide

  6. Or: program with a CLI

    View Slide

  7. 1. Code is text
    2. Code is read-able
    3. Code is shareable
    4. Code is open
    Why is programming preferable for statistics?

    View Slide

  8. Code is text
    And this provides for two 

    extremely powerful techniques

    View Slide

  9. ⌘C
    ⌘V

    View Slide

  10. View Slide

  11. By Jenny Bryan
    ⌘⇧R

    View Slide

  12. ⌘⇧K

    View Slide

  13. Code is read-able

    View Slide

  14. Much of my work is developing tools

    View Slide

  15. library(tidycensus)
    geo <- get_acs(
    geography = "metropolitan statistical area...",
    variables = "DP03_0021PE",
    summary_var = "B01003_001",
    survey = "acs1",
    endyear = 2016
    )
    # Thanks to Kyle Walker (@kyle_e_walker)
    # For package and example
    A small example

    View Slide

  16. View Slide

  17. big_metro <- geo %>%
    filter(summary_est > 2e6) %>%
    select(-variable) %>%
    mutate(
    NAME = gsub(" Metro Area", "", NAME)
    ) %>%
    separate(NAME, c("city", "state"), ", ") %>%
    mutate(
    city = str_extract(city, "^[A-Za-z ]+"),
    state = str_extract(state, "^[A-Za-z ]+"),
    name = paste0(city, ", ", state),
    summary_moe = na_if(summary_moe, -555555555)
    )
    Followed by data munging

    View Slide

  18. View Slide

  19. big_metro %>%
    ggplot(aes(
    x = estimate,
    y = reorder(name, estimate))
    ) +
    geom_errorbarh(
    aes(
    xmin = estimate - moe,
    xmax = estimate + moe
    ),
    width = 0.1
    ) +
    geom_point(color = "navy")

    View Slide




































  20. Indianapolis, IN
    Kansas City, MO
    Riverside, CA
    Charlotte, NC
    Dallas, TX
    Tampa, FL
    Detroit, MI
    Columbus, OH
    Phoenix, AZ
    Cincinnati, OH
    Houston, TX
    Orlando, FL
    Sacramento, CA
    Austin, TX
    San Antonio, TX
    San Juan, PR
    St, MO
    San Diego, CA
    Atlanta, GA
    Cleveland, OH
    Las Vegas, NV
    Miami, FL
    Denver, CO
    Minneapolis, MN
    Los Angeles, CA
    Pittsburgh, PA
    Baltimore, MD
    Portland, OR
    Philadelphia, PA
    Seattle, WA
    Chicago, IL
    Boston, MA
    Washington, DC
    San Francisco, CA
    New York, NY
    0 10 20 30
    estimate
    reorder(name, estimate)

    View Slide

  21. library(tidyverse)
    library(magick)
    dir(pattern = ".png") %>%
    map(image_read) %>%
    image_join() %>%
    image_animate(fps = 1, loop = 25) %>%
    image_write("my_animation.gif")
    And hence you can read unfamiliar code
    https://twitter.com/ricardokriebel/status/849626401611411458
    What does this
    code do?

    View Slide

  22. https://twitter.com/ricardokriebel/status/849626401611411458

    View Slide

  23. I live in fear of clicking the wrong thing

    View Slide

  24. Code is shareable

    View Slide

  25. View Slide

  26. View Slide

  27. View Slide

  28. View Slide

  29. View Slide

  30. View Slide

  31. View Slide

  32. Why is sharing so important?
    Learn from others
    Open science
    Easily critique

    View Slide

  33. Code is open

    View Slide

  34. All modern programming languages are open source
    Free
    Students can use same tools as practitioners.

    Anyone can use best tools regardless of wealth.

    Anyone can re-run your analysis You can fix problems

    You can build your own tools
    Fluid

    View Slide

  35. Conclusion

    View Slide

  36. 1. Code is text
    2. Code is read-able
    3. Code is shareable
    4. Code is open
    Why is programming preferable for statistics?

    View Slide