Slide 1

Slide 1 text

Data Science in a(n Evolving) Box Mine Çetinkaya-Rundel Duke University 2023-11-15

Slide 2

Slide 2 text

It All Starts with Math.

Slide 3

Slide 3 text

It All Starts with Math. The students don’t all start

Slide 4

Slide 4 text

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science, 2nd Edition. Program Import Tidy Transform Visualize Model Communicate Understand Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science, 2nd Edition. MANY START WITH DATA SCIENCE

Slide 5

Slide 5 text

Communicate MANY START WITH DATA SCIENCE

Slide 6

Slide 6 text

Content Tooling Pedagogy

Slide 7

Slide 7 text

Content Tooling Pedagogy

Slide 8

Slide 8 text

Tooling Pedagogy Content in 3 examples

Slide 9

Slide 9 text

COURSE EVALUATIONS // INFERENCE 1 Image by Andreas Breitling from Pixabay

Slide 10

Slide 10 text

score rank ethnicity gender bty_avg 1 4.7 tenure track minority female 5 2 4.1 tenure track minority female 5 3 3.9 tenure track minority female 5 4 4.8 tenure track minority female 5 5 4.6 tenured not minority male 3 6 4.3 tenured not minority male 3 7 2.8 tenured not minority male 3 8 4.1 tenured not minority male 3.33 9 3.4 tenured not minority male 3.33 10 4.5 tenured not minority female 3.17 … … … … … … 463 4.1 tenure track minority female 5.33 evaluation score (1-5) beauty score (1-10) Hamermesh, Parker. “Beauty in the classroom: instructors pulchritude and putative pedagogical productivity”, Econ of Ed Review, Vol 24-4.

Slide 11

Slide 11 text

Estimate the difference in average evaluation scores of male and female faculty.

Slide 12

Slide 12 text

Approach 1: Using methods based on the Central Limit Theorem. t.test(score ~ gender, data = evals) Welch Two Sample t - test data: score by gender t = -2.7507, df = 398.7, p - value = 0.006218 alternative hypothesis: true difference in means between group female and group male is not equal to 0 95 percent conf i dence interval: -0.24264375 -0.04037194 sample estimates: mean in group female mean in group male 4.092821 4.234328

Slide 13

Slide 13 text

Approach 2: Using computational methods.

Slide 14

Slide 14 text

library(tidyverse) library(tidymodels) evals start with data Approach 2: Using computational methods. # A tibble: 463 × 23 course_id prof_id score rank ethnicity gender language age cls_perc_eval 1 1 1 4.7 tenure… minority female english 36 55.8 2 2 1 4.1 tenure… minority female english 36 68.8 3 3 1 3.9 tenure… minority female english 36 60.8 4 4 1 4.8 tenure… minority female english 36 62.6 5 5 2 4.6 tenured not mino… male english 59 85 6 6 2 4.3 tenured not mino… male english 59 87.5 7 7 2 2.8 tenured not mino… male english 59 88.6 8 8 3 4.1 tenured not mino… male english 51 100 9 9 3 3.4 tenured not mino… male english 51 56.9 10 10 4 4.5 tenured not mino… female english 40 87.0 # ℹ 453 more rows # ℹ 14 more variables: cls_did_eval , cls_students , cls_level , # cls_profs , cls_credits , bty_f1lower , bty_f1upper , # bty_f2upper , bty_m1lower , bty_m1upper , bty_m2upper , # bty_avg , pic_outf i t , pic_color # ℹ Use `print(n = ... )` to see more rows

Slide 15

Slide 15 text

library(tidyverse) library(tidymodels) evals |> specify(score ~ gender) Approach 2: Using computational methods. Response: score (numeric) Explanatory: gender (factor) # A tibble: 463 × 2 score gender 1 4.7 female 2 4.1 female 3 3.9 female 4 4.8 female 5 4.6 male 6 4.3 male 7 2.8 male 8 4.1 male 9 3.4 male 10 4.5 female # ℹ 453 more rows # ℹ Use `print(n = ... )` to see more rows specify the model

Slide 16

Slide 16 text

library(tidyverse) library(tidymodels) evals |> specify(score ~ gender) |> generate(reps = 15000, type = "bootstrap") Approach 2: Using computational methods. Response: score (numeric) Explanatory: gender (factor) # A tibble: 6,945,000 × 3 # Groups: replicate [15,000] replicate score gender 1 1 4 female 2 1 3.1 male 3 1 5 male 4 1 4.4 male 5 1 3.5 female 6 1 4.5 female 7 1 4.5 male 8 1 4.9 male 9 1 4.4 male 10 1 3.5 male # ℹ 6,944,990 more rows # ℹ Use `print(n = ... )` to see more rows generate bootstrap samples

Slide 17

Slide 17 text

library(tidyverse) library(tidymodels) evals |> specify(score ~ gender) |> generate(reps = 15000, type = "bootstrap") |> calculate(stat = "diff in means", order = c("male", "female")) Approach 2: Using computational methods. Response: score (numeric) Explanatory: gender (factor) # A tibble: 15,000 × 2 replicate stat 1 1 0.230 2 2 0.134 3 3 0.100 4 4 0.230 5 5 0.128 6 6 0.201 7 7 0.168 8 8 0.130 9 9 -0.00490 10 10 0.123 # ℹ 14,990 more rows # ℹ Use `print(n = ... )` to see more rows calculate sample statistics

Slide 18

Slide 18 text

library(tidyverse) library(tidymodels) evals |> specify(score ~ gender) |> generate(reps = 15000, type = "bootstrap") |> calculate(stat = "diff in means", order = c("male", "female")) |> summarize(l = quantile(stat, 0.025), u = quantile(stat, 0.975)) Approach 2: Using computational methods. # A tibble: 1 × 2 l u 1 0.0431 0.242 summarize CI bounds

Slide 19

Slide 19 text

✴ sampling variability ✴ inference via bootstrapping and randomization ✴ interpreting study results

Slide 20

Slide 20 text

FISHERIES OF THE WORLD // EXPLORATORY DATA ANALYSIS 2 Image by Nghĩa Đặng from Pixabay

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

✴ data joins fisheries |> select(country) #> # A tibble: 82 × 1 #> country #> #> 1 Angola #> 2 Argentina #> 3 Australia #> 4 Bangladesh #> 5 Brazil #> 6 Cambodia #> 7 Cameroon #> 8 Canada #> 9 Chad #> 10 Chile # ℹ 72 more rows continents #> # A tibble: 245 × 2 #> country continent #> #> 1 Afghanistan Asia #> 2 Åland Islands Europe #> 3 Albania Europe #> 4 Algeria Africa #> 5 American Samoa Oceania #> 6 Andorra Europe #> 7 Angola Africa #> 8 Anguilla Americas #> 9 Antigua & Barbuda Americas #> 10 Argentina Americas #> # ℹ 235 more rows fisheries <- left_join(fisheries, continents) Joining with `by = join_by(country)`

Slide 23

Slide 23 text

✴ data joins ✴ data science ethics fisheries |> filter(is.na(continent)) #> # A tibble: 3 × 5 #> country capture aquaculture total continent #> #> 1 Democratic Republic of the Congo 237372 3161 240533 NA #> 2 Hong Kong 142775 4258 147033 NA #> 3 Myanmar 2072390 1017644 3090034 NA fisheries <- fisheries |> mutate( continent = case_when( country == "Democratic Republic of the Congo" ~ "Africa", country == "Hong Kong" ~ "Asia", country == "Myanmar" ~ "Asia", .default = continent ) )

Slide 24

Slide 24 text

✴ data joins ✴ data science ethics ✴ critique ✴ improving data visualisations

Slide 25

Slide 25 text

✴ data joins ✴ data science ethics ✴ critique ✴ improving data visualisations ✴ mapping

Slide 26

Slide 26 text

Project: Regional differences in average GPA and SAT Question: Exploring the regional differences in average GPA and SAT score across the US and the factors that could potentially explain them. Team: Mine’s Minions

Slide 27

Slide 27 text

COVID BRIEFINGS 2

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

✴ web scraping ✴ text parsing ✴ data types ✴ regular expressions

Slide 30

Slide 30 text

✴ web scraping ✴ text parsing ✴ data types ✴ regular expressions ✴ functions ✴ iteration

Slide 31

Slide 31 text

✴ web scraping ✴ text parsing ✴ data types ✴ regular expressions ✴ functions ✴ iteration ✴ data visualisation ✴ interpretation

Slide 32

Slide 32 text

✴ web scraping ✴ text parsing ✴ data types ✴ regular expressions ✴ functions ✴ iteration ✴ data visualisation ✴ interpretation ✴ text analysis

Slide 33

Slide 33 text

✴ web scraping ✴ text parsing ✴ data types ✴ regular expressions ✴ functions ✴ iteration ✴ data visualisation ✴ interpretation ✴ text analysis ✴ data science ethics robotstxt::paths_allowed("https://www.gov.scot") #> www.gov.scot #> [1] TRUE

Slide 34

Slide 34 text

Project: Factors Most Important to University Ranking Question: Explore how various metrics (e.g., SAT/ACT scores, admission rate, region, Carnegie classification) predict rankings on the Niche College Ranking List. Team: 2cool4school

Slide 35

Slide 35 text

SPAM FILTERS // MODELING 3 Image by Gerd Altmann from Pixabay

Slide 36

Slide 36 text

✴ logistic regression ✴ prediction

Slide 37

Slide 37 text

✴ logistic regression ✴ prediction ✴ decision errors ✴ sensitivity / specificity ✴ intuition around loss functions

Slide 38

Slide 38 text

Project: Predicting League of Legends success Question: After 10 minutes into the game, whether a gold lead or an experienced lead was a better predictor of which team wins? Team: Blue Squirrels

Slide 39

Slide 39 text

Project: A Critique of Hollywood Relationship Stereotypes Question: How has the average age difference between two actors in an on-screen relationship changed over the years? Furthermore, do on-screen same-sex relationships have a different average age gap than on-screen heterosexual relationships? Team: team300

Slide 40

Slide 40 text

Content Tooling Pedagogy

Slide 41

Slide 41 text

creativity: assignments that make room for creativity peer feedback: at various stages of the project teams: weekly labs in teams + periodic team evaluations + term project in teams “minute paper”: weekly online quizzes ending with a brief reflection of the week’s material live coding: in every “lecture”, along with time for students to attempt exercises on their own

Slide 42

Slide 42 text

Çetinkaya-Rundel, Mine, Mine Dogucu, and Wendy Rummerfield. "The 5Ws and 1H of term projects in the introductory data science classroom." Statistics Education Research Journal 21.2 (2022): 4-4.

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

Content Pedagogy Tooling

Slide 45

Slide 45 text

+ … Browser-based access to

Slide 46

Slide 46 text

‣ Go to [URL to access RStudio in the browser] ‣ Start the project titled UN Votes

Slide 47

Slide 47 text

‣ Go to [URL to access RStudio in the browser] ‣ Start the project titled UN Votes ‣ Open the Quarto document called unvotes.qmd

Slide 48

Slide 48 text

‣ Go to [URL to access RStudio in the browser] ‣ Start the project titled UN Votes ‣ Open the Quarto document called unvotes.qmd ‣ Render the document and review the data visualization you just produced

Slide 49

Slide 49 text

‣ Go to [URL to access RStudio in the browser] ‣ Start the project titled UN Votes ‣ Open the Quarto document called unvotes.qmd ‣ Render the document and review the data visualization you just produced ‣ Then, look for the character string “Turkey” in the code and replace it with another country of your choice ‣ Render again, and review how the voting patterns of the country you picked compare to the United States and the United Kingdom

Slide 50

Slide 50 text

Beckman, M. D., Çetinkaya-Rundel, M., Horton, N. J., Rundel, C. W., Sullivan, A. J., & Tackett, M. "Implementing version control with Git and GitHub as a learning objective in statistics and data science courses." Journal of Statistics and Data Science Education 29. (2021): S132-S144.

Slide 51

Slide 51 text

Content Tooling Pedagogy

Slide 52

Slide 52 text

Openness + Scalability

Slide 53

Slide 53 text

datasciencebox.org

Slide 54

Slide 54 text

datasciencebox.org

Slide 55

Slide 55 text

github.com/tidyverse/datascience-box

Slide 56

Slide 56 text

github.com/tidyverse/dsbox

Slide 57

Slide 57 text

AUDIENCE I have been teaching with R for a while, but I want to update my teaching materials I’m new to teaching with R and need to build up my course materials This teaching slide deck I came across is pretty cool, but I have no idea what type of course it belongs in

Slide 58

Slide 58 text

on COMMUNITY

Slide 59

Slide 59 text

sta199-f22-1.github.io EXAMPLE

Slide 60

Slide 60 text

Çetinkaya-Rundel, Mine, and Victoria Ellison. "A fresh look at introductory data science." Journal of Statistics and Data Science Education 29.sup1 (2021): S16-S26. SCHOLARSHIP

Slide 61

Slide 61 text

(N EVER-EVOLVING) mine-cetinkaya-rundel [email protected] fosstodon.org/@minecr 🔗 bit.ly/dsbox-evolving-maa thank you! minecr.bsky.social

Slide 62

Slide 62 text

Thank you for joining us for this MAA Distinguished Lecture Series presentation.