Data Science in a(n Ever-Evolving) Box - MAA

Data Science in a(n Evolving) Box Mine Çetinkaya-Rundel Duke University
2023-11-15

It All Starts with Math.

It All Starts with Math. The students don’t all start

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for
Data Science, 2nd Edition. Program Import Tidy Transform Visualize Model Communicate Understand Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science, 2nd Edition. MANY START WITH DATA SCIENCE

Communicate MANY START WITH DATA SCIENCE

Content Tooling Pedagogy

Tooling Pedagogy Content in 3 examples

COURSE EVALUATIONS // INFERENCE 1 Image by Andreas Breitling from
Pixabay

score rank ethnicity gender bty_avg <dbl> <chr> <chr> <chr> <dbl>
1 4.7 tenure track minority female 5 2 4.1 tenure track minority female 5 3 3.9 tenure track minority female 5 4 4.8 tenure track minority female 5 5 4.6 tenured not minority male 3 6 4.3 tenured not minority male 3 7 2.8 tenured not minority male 3 8 4.1 tenured not minority male 3.33 9 3.4 tenured not minority male 3.33 10 4.5 tenured not minority female 3.17 … … … … … … 463 4.1 tenure track minority female 5.33 evaluation score (1-5) beauty score (1-10) Hamermesh, Parker. “Beauty in the classroom: instructors pulchritude and putative pedagogical productivity”, Econ of Ed Review, Vol 24-4.

Estimate the difference in average evaluation scores of male and
female faculty.

Approach 1: Using methods based on the Central Limit Theorem.
t.test(score ~ gender, data = evals) Welch Two Sample t - test data: score by gender t = -2.7507, df = 398.7, p - value = 0.006218 alternative hypothesis: true difference in means between group female and group male is not equal to 0 95 percent conf i dence interval: -0.24264375 -0.04037194 sample estimates: mean in group female mean in group male 4.092821 4.234328

Approach 2: Using computational methods.

library(tidyverse) library(tidymodels) evals start with data Approach 2: Using computational
methods. # A tibble: 463 × 23 course_id prof_id score rank ethnicity gender language age cls_perc_eval <int> <int> <dbl> <fct> <fct> <fct> <fct> <int> <dbl> 1 1 1 4.7 tenure… minority female english 36 55.8 2 2 1 4.1 tenure… minority female english 36 68.8 3 3 1 3.9 tenure… minority female english 36 60.8 4 4 1 4.8 tenure… minority female english 36 62.6 5 5 2 4.6 tenured not mino… male english 59 85 6 6 2 4.3 tenured not mino… male english 59 87.5 7 7 2 2.8 tenured not mino… male english 59 88.6 8 8 3 4.1 tenured not mino… male english 51 100 9 9 3 3.4 tenured not mino… male english 51 56.9 10 10 4 4.5 tenured not mino… female english 40 87.0 # ℹ 453 more rows # ℹ 14 more variables: cls_did_eval <int>, cls_students <int>, cls_level <fct>, # cls_profs <fct>, cls_credits <fct>, bty_f1lower <int>, bty_f1upper <int>, # bty_f2upper <int>, bty_m1lower <int>, bty_m1upper <int>, bty_m2upper <int>, # bty_avg <dbl>, pic_outf i t <fct>, pic_color <fct> # ℹ Use `print(n = ... )` to see more rows

library(tidyverse) library(tidymodels) evals |> specify(score ~ gender) Approach 2: Using
computational methods. Response: score (numeric) Explanatory: gender (factor) # A tibble: 463 × 2 score gender <dbl> <fct> 1 4.7 female 2 4.1 female 3 3.9 female 4 4.8 female 5 4.6 male 6 4.3 male 7 2.8 male 8 4.1 male 9 3.4 male 10 4.5 female # ℹ 453 more rows # ℹ Use `print(n = ... )` to see more rows specify the model

library(tidyverse) library(tidymodels) evals |> specify(score ~ gender) |> generate(reps =
15000, type = "bootstrap") Approach 2: Using computational methods. Response: score (numeric) Explanatory: gender (factor) # A tibble: 6,945,000 × 3 # Groups: replicate [15,000] replicate score gender <int> <dbl> <fct> 1 1 4 female 2 1 3.1 male 3 1 5 male 4 1 4.4 male 5 1 3.5 female 6 1 4.5 female 7 1 4.5 male 8 1 4.9 male 9 1 4.4 male 10 1 3.5 male # ℹ 6,944,990 more rows # ℹ Use `print(n = ... )` to see more rows generate bootstrap samples

15000, type = "bootstrap") |> calculate(stat = "diff in means", order = c("male", "female")) Approach 2: Using computational methods. Response: score (numeric) Explanatory: gender (factor) # A tibble: 15,000 × 2 replicate stat <int> <dbl> 1 1 0.230 2 2 0.134 3 3 0.100 4 4 0.230 5 5 0.128 6 6 0.201 7 7 0.168 8 8 0.130 9 9 -0.00490 10 10 0.123 # ℹ 14,990 more rows # ℹ Use `print(n = ... )` to see more rows calculate sample statistics

15000, type = "bootstrap") |> calculate(stat = "diff in means", order = c("male", "female")) |> summarize(l = quantile(stat, 0.025), u = quantile(stat, 0.975)) Approach 2: Using computational methods. # A tibble: 1 × 2 l u <dbl> <dbl> 1 0.0431 0.242 summarize CI bounds

✴ sampling variability ✴ inference via bootstrapping and randomization ✴
interpreting study results

FISHERIES OF THE WORLD // EXPLORATORY DATA ANALYSIS 2 Image
by Nghĩa Đặng from Pixabay

✴ data joins fisheries |> select(country) #> # A tibble:
82 × 1 #> country #> <chr> #> 1 Angola #> 2 Argentina #> 3 Australia #> 4 Bangladesh #> 5 Brazil #> 6 Cambodia #> 7 Cameroon #> 8 Canada #> 9 Chad #> 10 Chile # ℹ 72 more rows continents #> # A tibble: 245 × 2 #> country continent #> <chr> <chr> #> 1 Afghanistan Asia #> 2 Åland Islands Europe #> 3 Albania Europe #> 4 Algeria Africa #> 5 American Samoa Oceania #> 6 Andorra Europe #> 7 Angola Africa #> 8 Anguilla Americas #> 9 Antigua & Barbuda Americas #> 10 Argentina Americas #> # ℹ 235 more rows fisheries <- left_join(fisheries, continents) Joining with `by = join_by(country)`

✴ data joins ✴ data science ethics fisheries |> filter(is.na(continent))
#> # A tibble: 3 × 5 #> country capture aquaculture total continent #> <chr> <dbl> <dbl> <dbl> <chr> #> 1 Democratic Republic of the Congo 237372 3161 240533 NA #> 2 Hong Kong 142775 4258 147033 NA #> 3 Myanmar 2072390 1017644 3090034 NA fisheries <- fisheries |> mutate( continent = case_when( country == "Democratic Republic of the Congo" ~ "Africa", country == "Hong Kong" ~ "Asia", country == "Myanmar" ~ "Asia", .default = continent ) )

✴ data joins ✴ data science ethics ✴ critique ✴
improving data visualisations

✴ data joins ✴ data science ethics ✴ critique ✴
improving data visualisations ✴ mapping

Project: Regional differences in average GPA and SAT Question: Exploring
the regional differences in average GPA and SAT score across the US and the factors that could potentially explain them. Team: Mine’s Minions

COVID BRIEFINGS 2

✴ web scraping ✴ text parsing ✴ data types ✴
regular expressions

regular expressions ✴ functions ✴ iteration

regular expressions ✴ functions ✴ iteration ✴ data visualisation ✴ interpretation

regular expressions ✴ functions ✴ iteration ✴ data visualisation ✴ interpretation ✴ text analysis

regular expressions ✴ functions ✴ iteration ✴ data visualisation ✴ interpretation ✴ text analysis ✴ data science ethics robotstxt::paths_allowed("https://www.gov.scot") #> www.gov.scot #> [1] TRUE

Project: Factors Most Important to University Ranking Question: Explore how
various metrics (e.g., SAT/ACT scores, admission rate, region, Carnegie classification) predict rankings on the Niche College Ranking List. Team: 2cool4school

SPAM FILTERS // MODELING 3 Image by Gerd Altmann from
Pixabay

✴ logistic regression ✴ prediction

✴ logistic regression ✴ prediction ✴ decision errors ✴ sensitivity
/ specificity ✴ intuition around loss functions

Project: Predicting League of Legends success Question: After 10 minutes
into the game, whether a gold lead or an experienced lead was a better predictor of which team wins? Team: Blue Squirrels

Project: A Critique of Hollywood Relationship Stereotypes Question: How has
the average age difference between two actors in an on-screen relationship changed over the years? Furthermore, do on-screen same-sex relationships have a different average age gap than on-screen heterosexual relationships? Team: team300

creativity: assignments that make room for creativity peer feedback: at
various stages of the project teams: weekly labs in teams + periodic team evaluations + term project in teams “minute paper”: weekly online quizzes ending with a brief reflection of the week’s material live coding: in every “lecture”, along with time for students to attempt exercises on their own

Çetinkaya-Rundel, Mine, Mine Dogucu, and Wendy Rummerfield. "The 5Ws and
1H of term projects in the introductory data science classroom." Statistics Education Research Journal 21.2 (2022): 4-4.

Content Pedagogy Tooling

+ … Browser-based access to

‣ Go to [URL to access RStudio in the browser]
‣ Start the project titled UN Votes

‣ Start the project titled UN Votes ‣ Open the Quarto document called unvotes.qmd

‣ Start the project titled UN Votes ‣ Open the Quarto document called unvotes.qmd ‣ Render the document and review the data visualization you just produced

‣ Start the project titled UN Votes ‣ Open the Quarto document called unvotes.qmd ‣ Render the document and review the data visualization you just produced ‣ Then, look for the character string “Turkey” in the code and replace it with another country of your choice ‣ Render again, and review how the voting patterns of the country you picked compare to the United States and the United Kingdom

Beckman, M. D., Çetinkaya-Rundel, M., Horton, N. J., Rundel, C.
W., Sullivan, A. J., & Tackett, M. "Implementing version control with Git and GitHub as a learning objective in statistics and data science courses." Journal of Statistics and Data Science Education 29. (2021): S132-S144.

Openness + Scalability

datasciencebox.org

github.com/tidyverse/datascience-box

github.com/tidyverse/dsbox

AUDIENCE I have been teaching with R for a while,
but I want to update my teaching materials I’m new to teaching with R and need to build up my course materials This teaching slide deck I came across is pretty cool, but I have no idea what type of course it belongs in

on COMMUNITY

sta199-f22-1.github.io EXAMPLE

Çetinkaya-Rundel, Mine, and Victoria Ellison. "A fresh look at introductory
data science." Journal of Statistics and Data Science Education 29.sup1 (2021): S16-S26. SCHOLARSHIP

(N EVER-EVOLVING) mine-cetinkaya-rundel [email protected] fosstodon.org/@minecr 🔗 bit.ly/dsbox-evolving-maa thank you! minecr.bsky.social

Thank you for joining us for this MAA Distinguished Lecture
Series presentation.

Data Science in a(n Ever-Evolving) Box - MAA

Data Science in a(n Ever-Evolving) Box - MAA

More Decks by Mine Cetinkaya-Rundel

Other Decks in Education

Featured

Transcript