Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science in a(n Ever-Evolving) Box - MAA

Data Science in a(n Ever-Evolving) Box - MAA

Computation is foundational to all steps of the modern data analysis pipeline, from importing to tidying to transforming, visualizing, and modeling, to communicating data. However, this fact is generally not reflected in our traditional statistics curricula, which too often assume computing is something students can pick up on their own and courses should primarily cover the instruction of theory and applications. On the other hand, data science curricula put computation front and center. The rise of data science has given us an opportunity to re-examine our traditional curricula through a new, computational lens. In this talk, I will highlight how computation can support and enhance the teaching and learning of fundamental statistical concepts such as uncertainty quantification, and prediction. The talk will place these ideas within the context of a curriculum for an introductory data science and statistical thinking course that emphasizes explicit instruction in computing. Additionally, the talk will contextualize the course and its learning objectives in the larger context of an undergraduate statistics program.

Mine Cetinkaya-Rundel

November 15, 2023
Tweet

More Decks by Mine Cetinkaya-Rundel

Other Decks in Education

Transcript

  1. Data Science
    in a(n Evolving) Box
    Mine Çetinkaya-Rundel
    Duke University
    2023-11-15

    View full-size slide

  2. It All Starts with Math.

    View full-size slide

  3. It All Starts with Math.
    The students don’t all start

    View full-size slide

  4. Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science, 2nd Edition.
    Program
    Import Tidy Transform
    Visualize
    Model
    Communicate
    Understand
    Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science, 2nd Edition.
    MANY START WITH DATA SCIENCE

    View full-size slide

  5. Communicate
    MANY START WITH DATA SCIENCE

    View full-size slide

  6. Content
    Tooling
    Pedagogy

    View full-size slide

  7. Content
    Tooling
    Pedagogy

    View full-size slide

  8. Tooling
    Pedagogy
    Content
    in 3 examples

    View full-size slide

  9. COURSE EVALUATIONS // INFERENCE
    1
    Image by Andreas Breitling from Pixabay

    View full-size slide

  10. score rank ethnicity gender bty_avg

    1 4.7 tenure track minority female 5
    2 4.1 tenure track minority female 5
    3 3.9 tenure track minority female 5
    4 4.8 tenure track minority female 5
    5 4.6 tenured not minority male 3
    6 4.3 tenured not minority male 3
    7 2.8 tenured not minority male 3
    8 4.1 tenured not minority male 3.33
    9 3.4 tenured not minority male 3.33
    10 4.5 tenured not minority female 3.17
    … … … … … …
    463 4.1 tenure track minority female 5.33
    evaluation


    score


    (1-5)
    beauty


    score


    (1-10)
    Hamermesh, Parker. “Beauty in the classroom: instructors pulchritude and putative pedagogical productivity”, Econ of Ed Review, Vol 24-4.

    View full-size slide

  11. Estimate the difference in average evaluation scores of male and female faculty.

    View full-size slide

  12. Approach 1: Using methods based on the Central Limit Theorem.
    t.test(score ~ gender, data = evals)
    Welch Two Sample t
    -
    test


    data: score by gender


    t = -2.7507, df = 398.7, p
    -
    value = 0.006218


    alternative hypothesis: true difference in means between group
    female and group male is not equal to 0


    95 percent conf
    i
    dence interval:


    -0.24264375 -0.04037194


    sample estimates:


    mean in group female mean in group male


    4.092821 4.234328

    View full-size slide

  13. Approach 2: Using computational methods.

    View full-size slide

  14. library(tidyverse)


    library(tidymodels)


    evals


    start with data
    Approach 2: Using computational methods.
    # A tibble: 463 × 23


    course_id prof_id score rank ethnicity gender language age cls_perc_eval





    1 1 1 4.7 tenure… minority female english 36 55.8


    2 2 1 4.1 tenure… minority female english 36 68.8


    3 3 1 3.9 tenure… minority female english 36 60.8


    4 4 1 4.8 tenure… minority female english 36 62.6


    5 5 2 4.6 tenured not mino… male english 59 85


    6 6 2 4.3 tenured not mino… male english 59 87.5


    7 7 2 2.8 tenured not mino… male english 59 88.6


    8 8 3 4.1 tenured not mino… male english 51 100


    9 9 3 3.4 tenured not mino… male english 51 56.9


    10 10 4 4.5 tenured not mino… female english 40 87.0


    # ℹ 453 more rows


    # ℹ 14 more variables: cls_did_eval , cls_students , cls_level ,


    # cls_profs , cls_credits , bty_f1lower , bty_f1upper ,


    # bty_f2upper , bty_m1lower , bty_m1upper , bty_m2upper ,


    # bty_avg , pic_outf
    i
    t , pic_color


    # ℹ Use `print(n =
    ...
    )` to see more rows

    View full-size slide

  15. library(tidyverse)


    library(tidymodels)


    evals
    |>

    specify(score ~ gender)


    Approach 2: Using computational methods.
    Response: score (numeric)


    Explanatory: gender (factor)


    # A tibble: 463 × 2


    score gender





    1 4.7 female


    2 4.1 female


    3 3.9 female


    4 4.8 female


    5 4.6 male


    6 4.3 male


    7 2.8 male


    8 4.1 male


    9 3.4 male


    10 4.5 female


    # ℹ 453 more rows


    # ℹ Use `print(n =
    ...
    )` to see more rows
    specify the model

    View full-size slide

  16. library(tidyverse)


    library(tidymodels)


    evals
    |>

    specify(score ~ gender)
    |>

    generate(reps = 15000, type = "bootstrap")


    Approach 2: Using computational methods.
    Response: score (numeric)


    Explanatory: gender (factor)


    # A tibble: 6,945,000 × 3


    # Groups: replicate [15,000]


    replicate score gender





    1 1 4 female


    2 1 3.1 male


    3 1 5 male


    4 1 4.4 male


    5 1 3.5 female


    6 1 4.5 female


    7 1 4.5 male


    8 1 4.9 male


    9 1 4.4 male


    10 1 3.5 male


    # ℹ 6,944,990 more rows


    # ℹ Use `print(n =
    ...
    )` to see more rows
    generate bootstrap samples

    View full-size slide

  17. library(tidyverse)


    library(tidymodels)


    evals
    |>

    specify(score ~ gender)
    |>

    generate(reps = 15000, type = "bootstrap")
    |>

    calculate(stat = "diff in means", order = c("male", "female"))


    Approach 2: Using computational methods.
    Response: score (numeric)


    Explanatory: gender (factor)


    # A tibble: 15,000 × 2


    replicate stat





    1 1 0.230


    2 2 0.134


    3 3 0.100


    4 4 0.230


    5 5 0.128


    6 6 0.201


    7 7 0.168


    8 8 0.130


    9 9 -0.00490


    10 10 0.123


    # ℹ 14,990 more rows


    # ℹ Use `print(n =
    ...
    )` to see more rows
    calculate sample statistics

    View full-size slide

  18. library(tidyverse)


    library(tidymodels)


    evals
    |>

    specify(score ~ gender)
    |>

    generate(reps = 15000, type = "bootstrap")
    |>

    calculate(stat = "diff in means", order = c("male", "female"))
    |>

    summarize(l = quantile(stat, 0.025), u = quantile(stat, 0.975))


    Approach 2: Using computational methods.
    # A tibble: 1 × 2


    l u





    1 0.0431 0.242
    summarize CI bounds

    View full-size slide

  19. ✴ sampling variability


    ✴ inference via
    bootstrapping and
    randomization


    ✴ interpreting study
    results

    View full-size slide

  20. FISHERIES OF THE WORLD // EXPLORATORY DATA ANALYSIS
    2
    Image by Nghĩa Đặng from Pixabay

    View full-size slide

  21. ✴ data joins
    fisheries |> select(country)


    #> # A tibble: 82 × 1


    #> country


    #>


    #> 1 Angola


    #> 2 Argentina


    #> 3 Australia


    #> 4 Bangladesh


    #> 5 Brazil


    #> 6 Cambodia


    #> 7 Cameroon


    #> 8 Canada


    #> 9 Chad


    #> 10 Chile


    # ℹ 72 more rows
    continents


    #> # A tibble: 245 × 2


    #> country continent


    #>


    #> 1 Afghanistan Asia


    #> 2 Åland Islands Europe


    #> 3 Albania Europe


    #> 4 Algeria Africa


    #> 5 American Samoa Oceania


    #> 6 Andorra Europe


    #> 7 Angola Africa


    #> 8 Anguilla Americas


    #> 9 Antigua & Barbuda Americas


    #> 10 Argentina Americas


    #> # ℹ 235 more rows
    fisheries <- left_join(fisheries, continents)


    Joining with `by = join_by(country)`

    View full-size slide

  22. ✴ data joins


    ✴ data science ethics
    fisheries |>


    filter(is.na(continent))


    #> # A tibble: 3 × 5


    #> country capture aquaculture total continent


    #>


    #> 1 Democratic Republic of the Congo 237372 3161 240533 NA


    #> 2 Hong Kong 142775 4258 147033 NA


    #> 3 Myanmar 2072390 1017644 3090034 NA
    fisheries <- fisheries |>


    mutate(


    continent = case_when(


    country == "Democratic Republic of the Congo" ~ "Africa",


    country == "Hong Kong" ~ "Asia",


    country == "Myanmar" ~ "Asia",


    .default = continent


    )


    )

    View full-size slide

  23. ✴ data joins


    ✴ data science ethics


    ✴ critique


    ✴ improving data
    visualisations

    View full-size slide

  24. ✴ data joins


    ✴ data science ethics


    ✴ critique


    ✴ improving data
    visualisations


    ✴ mapping

    View full-size slide

  25. Project: Regional differences in average GPA and SAT


    Question: Exploring the regional differences in average GPA and SAT score
    across the US and the factors that could potentially explain them.


    Team: Mine’s Minions

    View full-size slide

  26. COVID BRIEFINGS
    2

    View full-size slide

  27. ✴ web scraping


    ✴ text parsing


    ✴ data types


    ✴ regular expressions

    View full-size slide

  28. ✴ web scraping


    ✴ text parsing


    ✴ data types


    ✴ regular expressions


    ✴ functions


    ✴ iteration

    View full-size slide

  29. ✴ web scraping


    ✴ text parsing


    ✴ data types


    ✴ regular expressions


    ✴ functions


    ✴ iteration


    ✴ data visualisation


    ✴ interpretation

    View full-size slide

  30. ✴ web scraping


    ✴ text parsing


    ✴ data types


    ✴ regular expressions


    ✴ functions


    ✴ iteration


    ✴ data visualisation


    ✴ interpretation


    ✴ text analysis

    View full-size slide

  31. ✴ web scraping


    ✴ text parsing


    ✴ data types


    ✴ regular expressions


    ✴ functions


    ✴ iteration


    ✴ data visualisation


    ✴ interpretation


    ✴ text analysis


    ✴ data science ethics
    robotstxt::paths_allowed("https://www.gov.scot")


    #> www.gov.scot


    #> [1] TRUE

    View full-size slide

  32. Project: Factors Most Important to University Ranking


    Question: Explore how various metrics (e.g., SAT/ACT scores, admission
    rate, region, Carnegie classification) predict rankings on the Niche College
    Ranking List.


    Team: 2cool4school

    View full-size slide

  33. SPAM FILTERS // MODELING
    3
    Image by Gerd Altmann from Pixabay

    View full-size slide

  34. ✴ logistic regression


    ✴ prediction

    View full-size slide

  35. ✴ logistic regression


    ✴ prediction


    ✴ decision errors


    ✴ sensitivity /
    specificity


    ✴ intuition around
    loss functions

    View full-size slide

  36. Project: Predicting League of Legends success


    Question: After 10 minutes into the game, whether a gold lead or an
    experienced lead was a better predictor of which team wins?


    Team: Blue Squirrels

    View full-size slide

  37. Project: A Critique of Hollywood Relationship Stereotypes


    Question: How has the average age difference between two actors in an
    on-screen relationship changed over the years? Furthermore, do on-screen
    same-sex relationships have a different average age gap than on-screen
    heterosexual relationships?


    Team: team300

    View full-size slide

  38. Content
    Tooling
    Pedagogy

    View full-size slide

  39. creativity:
    assignments that
    make room for
    creativity
    peer feedback:
    at various stages
    of the project
    teams: weekly
    labs in teams +
    periodic team
    evaluations + term
    project in teams
    “minute paper”:
    weekly online
    quizzes ending
    with a brief
    reflection of the
    week’s material
    live coding: in
    every “lecture”,
    along with time for
    students to
    attempt exercises
    on their own

    View full-size slide

  40. Çetinkaya-Rundel, Mine, Mine
    Dogucu, and Wendy Rummerfield.


    "The 5Ws and 1H of term
    projects in the introductory
    data science classroom."


    Statistics Education Research
    Journal 21.2 (2022): 4-4.

    View full-size slide

  41. Content
    Pedagogy Tooling

    View full-size slide

  42. +

    Browser-based access to

    View full-size slide

  43. ‣ Go to [URL to access RStudio in the browser]


    ‣ Start the project titled UN Votes

    View full-size slide

  44. ‣ Go to [URL to access RStudio in the browser]


    ‣ Start the project titled UN Votes


    ‣ Open the Quarto document called unvotes.qmd

    View full-size slide

  45. ‣ Go to [URL to access RStudio in the browser]


    ‣ Start the project titled UN Votes


    ‣ Open the Quarto document called unvotes.qmd


    ‣ Render the document and review the data visualization you just produced

    View full-size slide

  46. ‣ Go to [URL to access RStudio in the browser]


    ‣ Start the project titled UN Votes


    ‣ Open the Quarto document called unvotes.qmd


    ‣ Render the document and review the data visualization you just produced


    ‣ Then, look for the character string “Turkey” in the code and replace it with
    another country of your choice


    ‣ Render again, and review how the voting patterns of the country you picked
    compare to the United States and the United Kingdom

    View full-size slide

  47. Beckman, M. D., Çetinkaya-Rundel, M.,
    Horton, N. J., Rundel, C. W., Sullivan, A.
    J., & Tackett, M.


    "Implementing version control with
    Git and GitHub as a learning
    objective in statistics and data
    science courses."


    Journal of Statistics and Data Science
    Education 29. (2021): S132-S144.

    View full-size slide

  48. Content
    Tooling
    Pedagogy

    View full-size slide

  49. Openness


    + Scalability

    View full-size slide

  50. datasciencebox.org

    View full-size slide

  51. datasciencebox.org

    View full-size slide

  52. github.com/tidyverse/datascience-box

    View full-size slide

  53. github.com/tidyverse/dsbox

    View full-size slide

  54. AUDIENCE
    I have been teaching with R


    for a while, but I want to update
    my teaching materials
    I’m new to teaching with R and
    need to build up my course
    materials
    This teaching slide deck I
    came across is pretty cool, but I
    have no idea what type of course
    it belongs in

    View full-size slide

  55. sta199-f22-1.github.io
    EXAMPLE

    View full-size slide

  56. Çetinkaya-Rundel, Mine, and Victoria
    Ellison.


    "A fresh look at introductory data
    science."


    Journal of Statistics and Data Science
    Education 29.sup1 (2021): S16-S26.
    SCHOLARSHIP

    View full-size slide

  57. (N EVER-EVOLVING)
    mine-cetinkaya-rundel
    [email protected]
    fosstodon.org/@minecr
    🔗 bit.ly/dsbox-evolving-maa
    thank you!
    minecr.bsky.social

    View full-size slide

  58. Thank you for joining us for this
    MAA Distinguished Lecture
    Series presentation.

    View full-size slide