Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Should all statistics students 
be programmers?

Hadley Wickham
July 12, 2018
4.2k

Should all statistics students 
be programmers?

A presentation at ICOTS 10 (Kyoto, Japan)

Hadley Wickham

July 12, 2018
Tweet

Transcript

  1. Hadley Wickham 

    @hadleywickham

    Chief Scientist, RStudio
    Should all statistics students

    be programmers?
    July 2018
    No!

    View full-size slide

  2. Hadley Wickham 

    @hadleywickham

    Chief Scientist, RStudio
    July 2018
    Should all statistics students

    program? Yes!

    View full-size slide

  3. What should a statistics student be able to do?
    Model

    View full-size slide

  4. What should a statistics student be able to do?
    Tidy
    Surprises, but doesn't scale
    Create new variables & new summaries
    Visualise
    Transform
    Model
    Communicate
    Scales, but doesn't (fundamentally) surprise
    Automate
    Store data
    consistently
    Import
    Understand

    View full-size slide

  5. Two primary computer interfaces: point & click GUI

    View full-size slide

  6. Or: program with a CLI

    View full-size slide

  7. 1. Code is text
    2. Code is read-able
    3. Code is shareable
    4. Code is open
    Why is programming preferable for statistics?

    View full-size slide

  8. Code is text
    And this provides for two 

    extremely powerful techniques

    View full-size slide

  9. By Jenny Bryan
    ⌘⇧R

    View full-size slide

  10. Code is read-able

    View full-size slide

  11. Much of my work is developing tools

    View full-size slide

  12. library(tidycensus)
    geo <- get_acs(
    geography = "metropolitan statistical area...",
    variables = "DP03_0021PE",
    summary_var = "B01003_001",
    survey = "acs1",
    endyear = 2016
    )
    # Thanks to Kyle Walker (@kyle_e_walker)
    # For package and example
    A small example

    View full-size slide

  13. big_metro <- geo %>%
    filter(summary_est > 2e6) %>%
    select(-variable) %>%
    mutate(
    NAME = gsub(" Metro Area", "", NAME)
    ) %>%
    separate(NAME, c("city", "state"), ", ") %>%
    mutate(
    city = str_extract(city, "^[A-Za-z ]+"),
    state = str_extract(state, "^[A-Za-z ]+"),
    name = paste0(city, ", ", state),
    summary_moe = na_if(summary_moe, -555555555)
    )
    Followed by data munging

    View full-size slide

  14. big_metro %>%
    ggplot(aes(
    x = estimate,
    y = reorder(name, estimate))
    ) +
    geom_errorbarh(
    aes(
    xmin = estimate - moe,
    xmax = estimate + moe
    ),
    width = 0.1
    ) +
    geom_point(color = "navy")

    View full-size slide




































  15. Indianapolis, IN
    Kansas City, MO
    Riverside, CA
    Charlotte, NC
    Dallas, TX
    Tampa, FL
    Detroit, MI
    Columbus, OH
    Phoenix, AZ
    Cincinnati, OH
    Houston, TX
    Orlando, FL
    Sacramento, CA
    Austin, TX
    San Antonio, TX
    San Juan, PR
    St, MO
    San Diego, CA
    Atlanta, GA
    Cleveland, OH
    Las Vegas, NV
    Miami, FL
    Denver, CO
    Minneapolis, MN
    Los Angeles, CA
    Pittsburgh, PA
    Baltimore, MD
    Portland, OR
    Philadelphia, PA
    Seattle, WA
    Chicago, IL
    Boston, MA
    Washington, DC
    San Francisco, CA
    New York, NY
    0 10 20 30
    estimate
    reorder(name, estimate)

    View full-size slide

  16. library(tidyverse)
    library(magick)
    dir(pattern = ".png") %>%
    map(image_read) %>%
    image_join() %>%
    image_animate(fps = 1, loop = 25) %>%
    image_write("my_animation.gif")
    And hence you can read unfamiliar code
    https://twitter.com/ricardokriebel/status/849626401611411458
    What does this
    code do?

    View full-size slide

  17. https://twitter.com/ricardokriebel/status/849626401611411458

    View full-size slide

  18. I live in fear of clicking the wrong thing

    View full-size slide

  19. Code is shareable

    View full-size slide

  20. Why is sharing so important?
    Learn from others
    Open science
    Easily critique

    View full-size slide

  21. Code is open

    View full-size slide

  22. All modern programming languages are open source
    Free
    Students can use same tools as practitioners.

    Anyone can use best tools regardless of wealth.

    Anyone can re-run your analysis You can fix problems

    You can build your own tools
    Fluid

    View full-size slide

  23. 1. Code is text
    2. Code is read-able
    3. Code is shareable
    4. Code is open
    Why is programming preferable for statistics?

    View full-size slide