Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science in a(n Ever-Evolving) Box - MAA

Data Science in a(n Ever-Evolving) Box - MAA

Computation is foundational to all steps of the modern data analysis pipeline, from importing to tidying to transforming, visualizing, and modeling, to communicating data. However, this fact is generally not reflected in our traditional statistics curricula, which too often assume computing is something students can pick up on their own and courses should primarily cover the instruction of theory and applications. On the other hand, data science curricula put computation front and center. The rise of data science has given us an opportunity to re-examine our traditional curricula through a new, computational lens. In this talk, I will highlight how computation can support and enhance the teaching and learning of fundamental statistical concepts such as uncertainty quantification, and prediction. The talk will place these ideas within the context of a curriculum for an introductory data science and statistical thinking course that emphasizes explicit instruction in computing. Additionally, the talk will contextualize the course and its learning objectives in the larger context of an undergraduate statistics program.

Mine Cetinkaya-Rundel

November 15, 2023
Tweet

More Decks by Mine Cetinkaya-Rundel

Other Decks in Education

Transcript

  1. Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for

    Data Science, 2nd Edition. Program Import Tidy Transform Visualize Model Communicate Understand Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science, 2nd Edition. MANY START WITH DATA SCIENCE
  2. score rank ethnicity gender bty_avg <dbl> <chr> <chr> <chr> <dbl>

    1 4.7 tenure track minority female 5 2 4.1 tenure track minority female 5 3 3.9 tenure track minority female 5 4 4.8 tenure track minority female 5 5 4.6 tenured not minority male 3 6 4.3 tenured not minority male 3 7 2.8 tenured not minority male 3 8 4.1 tenured not minority male 3.33 9 3.4 tenured not minority male 3.33 10 4.5 tenured not minority female 3.17 … … … … … … 463 4.1 tenure track minority female 5.33 evaluation score (1-5) beauty score (1-10) Hamermesh, Parker. “Beauty in the classroom: instructors pulchritude and putative pedagogical productivity”, Econ of Ed Review, Vol 24-4.
  3. Approach 1: Using methods based on the Central Limit Theorem.

    t.test(score ~ gender, data = evals) Welch Two Sample t - test data: score by gender t = -2.7507, df = 398.7, p - value = 0.006218 alternative hypothesis: true difference in means between group female and group male is not equal to 0 95 percent conf i dence interval: -0.24264375 -0.04037194 sample estimates: mean in group female mean in group male 4.092821 4.234328
  4. library(tidyverse) library(tidymodels) evals start with data Approach 2: Using computational

    methods. # A tibble: 463 × 23 course_id prof_id score rank ethnicity gender language age cls_perc_eval <int> <int> <dbl> <fct> <fct> <fct> <fct> <int> <dbl> 1 1 1 4.7 tenure… minority female english 36 55.8 2 2 1 4.1 tenure… minority female english 36 68.8 3 3 1 3.9 tenure… minority female english 36 60.8 4 4 1 4.8 tenure… minority female english 36 62.6 5 5 2 4.6 tenured not mino… male english 59 85 6 6 2 4.3 tenured not mino… male english 59 87.5 7 7 2 2.8 tenured not mino… male english 59 88.6 8 8 3 4.1 tenured not mino… male english 51 100 9 9 3 3.4 tenured not mino… male english 51 56.9 10 10 4 4.5 tenured not mino… female english 40 87.0 # ℹ 453 more rows # ℹ 14 more variables: cls_did_eval <int>, cls_students <int>, cls_level <fct>, # cls_profs <fct>, cls_credits <fct>, bty_f1lower <int>, bty_f1upper <int>, # bty_f2upper <int>, bty_m1lower <int>, bty_m1upper <int>, bty_m2upper <int>, # bty_avg <dbl>, pic_outf i t <fct>, pic_color <fct> # ℹ Use `print(n = ... )` to see more rows
  5. library(tidyverse) library(tidymodels) evals |> specify(score ~ gender) Approach 2: Using

    computational methods. Response: score (numeric) Explanatory: gender (factor) # A tibble: 463 × 2 score gender <dbl> <fct> 1 4.7 female 2 4.1 female 3 3.9 female 4 4.8 female 5 4.6 male 6 4.3 male 7 2.8 male 8 4.1 male 9 3.4 male 10 4.5 female # ℹ 453 more rows # ℹ Use `print(n = ... )` to see more rows specify the model
  6. library(tidyverse) library(tidymodels) evals |> specify(score ~ gender) |> generate(reps =

    15000, type = "bootstrap") Approach 2: Using computational methods. Response: score (numeric) Explanatory: gender (factor) # A tibble: 6,945,000 × 3 # Groups: replicate [15,000] replicate score gender <int> <dbl> <fct> 1 1 4 female 2 1 3.1 male 3 1 5 male 4 1 4.4 male 5 1 3.5 female 6 1 4.5 female 7 1 4.5 male 8 1 4.9 male 9 1 4.4 male 10 1 3.5 male # ℹ 6,944,990 more rows # ℹ Use `print(n = ... )` to see more rows generate bootstrap samples
  7. library(tidyverse) library(tidymodels) evals |> specify(score ~ gender) |> generate(reps =

    15000, type = "bootstrap") |> calculate(stat = "diff in means", order = c("male", "female")) Approach 2: Using computational methods. Response: score (numeric) Explanatory: gender (factor) # A tibble: 15,000 × 2 replicate stat <int> <dbl> 1 1 0.230 2 2 0.134 3 3 0.100 4 4 0.230 5 5 0.128 6 6 0.201 7 7 0.168 8 8 0.130 9 9 -0.00490 10 10 0.123 # ℹ 14,990 more rows # ℹ Use `print(n = ... )` to see more rows calculate sample statistics
  8. library(tidyverse) library(tidymodels) evals |> specify(score ~ gender) |> generate(reps =

    15000, type = "bootstrap") |> calculate(stat = "diff in means", order = c("male", "female")) |> summarize(l = quantile(stat, 0.025), u = quantile(stat, 0.975)) Approach 2: Using computational methods. # A tibble: 1 × 2 l u <dbl> <dbl> 1 0.0431 0.242 summarize CI bounds
  9. ✴ data joins fisheries |> select(country) #> # A tibble:

    82 × 1 #> country #> <chr> #> 1 Angola #> 2 Argentina #> 3 Australia #> 4 Bangladesh #> 5 Brazil #> 6 Cambodia #> 7 Cameroon #> 8 Canada #> 9 Chad #> 10 Chile # ℹ 72 more rows continents #> # A tibble: 245 × 2 #> country continent #> <chr> <chr> #> 1 Afghanistan Asia #> 2 Åland Islands Europe #> 3 Albania Europe #> 4 Algeria Africa #> 5 American Samoa Oceania #> 6 Andorra Europe #> 7 Angola Africa #> 8 Anguilla Americas #> 9 Antigua & Barbuda Americas #> 10 Argentina Americas #> # ℹ 235 more rows fisheries <- left_join(fisheries, continents) Joining with `by = join_by(country)`
  10. ✴ data joins ✴ data science ethics fisheries |> filter(is.na(continent))

    #> # A tibble: 3 × 5 #> country capture aquaculture total continent #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 Democratic Republic of the Congo 237372 3161 240533 NA #> 2 Hong Kong 142775 4258 147033 NA #> 3 Myanmar 2072390 1017644 3090034 NA fisheries <- fisheries |> mutate( continent = case_when( country == "Democratic Republic of the Congo" ~ "Africa", country == "Hong Kong" ~ "Asia", country == "Myanmar" ~ "Asia", .default = continent ) )
  11. ✴ data joins ✴ data science ethics ✴ critique ✴

    improving data visualisations ✴ mapping
  12. Project: Regional differences in average GPA and SAT Question: Exploring

    the regional differences in average GPA and SAT score across the US and the factors that could potentially explain them. Team: Mine’s Minions
  13. ✴ web scraping ✴ text parsing ✴ data types ✴

    regular expressions ✴ functions ✴ iteration
  14. ✴ web scraping ✴ text parsing ✴ data types ✴

    regular expressions ✴ functions ✴ iteration ✴ data visualisation ✴ interpretation
  15. ✴ web scraping ✴ text parsing ✴ data types ✴

    regular expressions ✴ functions ✴ iteration ✴ data visualisation ✴ interpretation ✴ text analysis
  16. ✴ web scraping ✴ text parsing ✴ data types ✴

    regular expressions ✴ functions ✴ iteration ✴ data visualisation ✴ interpretation ✴ text analysis ✴ data science ethics robotstxt::paths_allowed("https://www.gov.scot") #> www.gov.scot #> [1] TRUE
  17. Project: Factors Most Important to University Ranking Question: Explore how

    various metrics (e.g., SAT/ACT scores, admission rate, region, Carnegie classification) predict rankings on the Niche College Ranking List. Team: 2cool4school
  18. ✴ logistic regression ✴ prediction ✴ decision errors ✴ sensitivity

    / specificity ✴ intuition around loss functions
  19. Project: Predicting League of Legends success Question: After 10 minutes

    into the game, whether a gold lead or an experienced lead was a better predictor of which team wins? Team: Blue Squirrels
  20. Project: A Critique of Hollywood Relationship Stereotypes Question: How has

    the average age difference between two actors in an on-screen relationship changed over the years? Furthermore, do on-screen same-sex relationships have a different average age gap than on-screen heterosexual relationships? Team: team300
  21. creativity: assignments that make room for creativity peer feedback: at

    various stages of the project teams: weekly labs in teams + periodic team evaluations + term project in teams “minute paper”: weekly online quizzes ending with a brief reflection of the week’s material live coding: in every “lecture”, along with time for students to attempt exercises on their own
  22. Çetinkaya-Rundel, Mine, Mine Dogucu, and Wendy Rummerfield. "The 5Ws and

    1H of term projects in the introductory data science classroom." Statistics Education Research Journal 21.2 (2022): 4-4.
  23. ‣ Go to [URL to access RStudio in the browser]

    ‣ Start the project titled UN Votes
  24. ‣ Go to [URL to access RStudio in the browser]

    ‣ Start the project titled UN Votes ‣ Open the Quarto document called unvotes.qmd
  25. ‣ Go to [URL to access RStudio in the browser]

    ‣ Start the project titled UN Votes ‣ Open the Quarto document called unvotes.qmd ‣ Render the document and review the data visualization you just produced
  26. ‣ Go to [URL to access RStudio in the browser]

    ‣ Start the project titled UN Votes ‣ Open the Quarto document called unvotes.qmd ‣ Render the document and review the data visualization you just produced ‣ Then, look for the character string “Turkey” in the code and replace it with another country of your choice ‣ Render again, and review how the voting patterns of the country you picked compare to the United States and the United Kingdom
  27. Beckman, M. D., Çetinkaya-Rundel, M., Horton, N. J., Rundel, C.

    W., Sullivan, A. J., & Tackett, M. "Implementing version control with Git and GitHub as a learning objective in statistics and data science courses." Journal of Statistics and Data Science Education 29. (2021): S132-S144.
  28. AUDIENCE I have been teaching with R for a while,

    but I want to update my teaching materials I’m new to teaching with R and need to build up my course materials This teaching slide deck I came across is pretty cool, but I have no idea what type of course it belongs in
  29. Çetinkaya-Rundel, Mine, and Victoria Ellison. "A fresh look at introductory

    data science." Journal of Statistics and Data Science Education 29.sup1 (2021): S16-S26. SCHOLARSHIP