Slide 1

Slide 1 text

mine-cr.com 🏡 [email protected] ✉ Mine Çetinkaya-Rundel Duke University Statistics in the age of Data Science

Slide 2

Slide 2 text

Let’s read the health section of the New York Times…

Slide 3

Slide 3 text

Indeed, even Haidt, who has also emphasized broader changes to the culture of childhood , estimated that social media use is responsible for only about 10 percent to 15 percent of the variation in teenage well-being — which would be a significant correlation, given the complexities of adolescent life and of social science, but is also a much more measured estimate than you tend to see in headlines trumpeting the connection. … In Britain, the share of young people who reported “feeling down” or experiencing depression grew from 31 percent in 2012 to 38 percent on the eve of the pandemic and to 41 percent in 2021. That is significant, though by other measures British teenagers appear, if more depressed than they were in the 2000s, not much more depressed than they were in the 1990s. …

Slide 4

Slide 4 text

Now let’s do something a lot less “scientific”

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Now let’s generate even more text!

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Program Import Tidy Communicate Understand Transform Model Visualize Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science, 2nd Edition. Doing data science

Slide 10

Slide 10 text

Communicate Teaching data science

Slide 11

Slide 11 text

data visualisation data wrangling, tidying, acquisition exploratory data analysis predictive modeling + uncertainty quantification effective communication of results interactive visualizations text analysis machine learning Bayesian inference … consistent syntax | R + tidyverse reproducibility | Quarto version control and collaboration | Git + GitHub focus on emphasize foray into

Slide 12

Slide 12 text

Example 1 - Country populations Photo by Louis Hansel on Unsplash Visualize Wrangle

Slide 13

Slide 13 text

population # A tibble: 217 Ă— 3 country year population 1 Afghanistan 2022 41129. 2 Albania 2022 2778. 3 Algeria 2022 44903. 4 American Samoa 2022 44.3 5 Andorra 2022 79.8 6 Angola 2022 35589. 7 Antigua and Barbuda 2022 93.8 8 Argentina 2022 46235. 9 Armenia 2022 2780. 10 Aruba 2022 106. # â„ą 207 more rows continents # A tibble: 285 Ă— 4 entity code year continent 1 Abkhazia OWID_ABK 2015 Asia 2 Afghanistan AFG 2015 Asia 3 Akrotiri and Dhekelia OWID_AKD 2015 Asia 4 Aland Islands ALA 2015 Europe 5 Albania ALB 2015 Europe 6 Algeria DZA 2015 Africa 7 American Samoa ASM 2015 Oceania 8 Andorra AND 2015 Europe 9 Angola AGO 2015 Africa 10 Anguilla AIA 2015 North America # â„ą 275 more rows population_continents < - left_join(population, continents, join_by(country == entity)) âś“ data joins

Slide 14

Slide 14 text

population_continents | > f i lter(is.na(continent)) # A tibble: 6 Ă— 6 country year.x population code year.y continent 1 Congo, Dem. Rep. 2022 99010. NA NA NA 2 Congo, Rep. 2022 5970. NA NA NA 3 Hong Kong SAR, China 2022 7346. NA NA NA 4 Korea, Dem. People's Rep. 2022 26069. NA NA NA 5 Korea, Rep. 2022 51628. NA NA NA 6 Kyrgyz Republic 2022 6975. NA NA NA âś“ data joins âś“ data wrangling

Slide 15

Slide 15 text

population_continent < - population | > mutate(country = case_when( country = = "Congo, Dem. Rep." ~ "Democratic Republic of Congo", country = = "Congo, Rep." ~ "Congo", country = = "Hong Kong SAR, China" ~ "Hong Kong", country = = "Korea, Dem. People's Rep." ~ "North Korea", country = = "Korea, Rep." ~ "South Korea", country = = "Kyrgyz Republic" ~ "Kyrgyzstan", .default = country ) ) | > left_join(continents, by = join_by(country = = entity)) âś“ data joins âś“ data wrangling âś“ data cleaning âś“ ethics

Slide 16

Slide 16 text

âś“ data joins âś“ data wrangling âś“ data cleaning âś“ ethics âś“ critique âś“ improving visualizations

Slide 17

Slide 17 text

âś“ data joins âś“ data wrangling âś“ data cleaning âś“ ethics âś“ critique âś“ improving visualizations âś“ mapping âś“ iteration

Slide 18

Slide 18 text

David Spiegelhalter Professor, University of Cambridge “There is no substitute for simply looking at data properly.” from “The Art of Statistics”

Slide 19

Slide 19 text

Example 2 - Opinion pieces in The Chronicle Import

Slide 20

Slide 20 text

✓ web scraping chronicle # A tibble: 500 × 6 title author date abstract column url 1 All the world’s a stage Anna … 2024-02-22 If we a… STUDE… http… 2 Words that matter: For Alexei Navalny Carol… 2024-02-22 In some… STUDE… http… 3 Which would you save: Friend or romantic partn… Jess … 2024-02-22 Love sh… STUDE… http… 4 Happiness is not what you’re looking for Paul … 2024-02-21 We hing… STUDE… http… 5 Closing Duke's Herbarium: A fear of long - term … Matth… 2024-02-21 Without… LETTE… http… 6 CS Majors launch 'ambiguous and labelless rela… Monda… 2024-02-20 Unlike … STUDE… http… 7 The fear of being single Heidi… 2024-02-20 But it … STUDE… http… 8 Save the Duke Herbarium Henry… 2024-02-17 The Duk… LETTE… http… 9 What Duke can learn from retiring ex - president… Rober… 2024-02-17 In Duke… GUEST… http… 10 Love, love Gabri… 2024-02-16 Somehow… STUDE… http… # ℹ 490 more rows

Slide 21

Slide 21 text

âś“ web scraping âś“ terms of use âś“ ethics robotstxt : : paths_allowed("https: / / w w w .dukechronicle.com") w w w .dukechronicle.com [1] TRUE

Slide 22

Slide 22 text

âś“ web scraping âś“ terms of use âś“ ethics âś“ text analysis âś“ data wrangling âś“ data visualization

Slide 23

Slide 23 text

âś“ web scraping âś“ terms of use âś“ ethics âś“ text analysis âś“ data wrangling âś“ data visualization âś“ sentiment analysis

Slide 24

Slide 24 text

Sarah Jarvis Director of Applied Machine Learning and Data Science at Secondmind “Data science is all about asking interesting questions based on the data you have—or often the data you don’t have.”

Slide 25

Slide 25 text

Example 3 - Spam filter Predict Photo by Alexander Grey on Unsplash

Slide 26

Slide 26 text

âś“ logistic regression âś“ classification

Slide 27

Slide 27 text

âś“ logistic regression âś“ classification âś“ decision errors âś“ sensitivity / specificity âś“ intuition around loss functions

Slide 28

Slide 28 text

George Box Statistician “All models are wrong but some are useful” from “Robustness in the strategy of scientific model building”

Slide 29

Slide 29 text

and we could keep going on with examples… but let’s talk resources instead!

Slide 30

Slide 30 text

datasciencebox.org

Slide 31

Slide 31 text

datasciencebox.org

Slide 32

Slide 32 text

datasciencebox.org sta199-s24.github.io

Slide 33

Slide 33 text

sta199-s24.github.io

Slide 34

Slide 34 text

sta199-f24.github.io

Slide 35

Slide 35 text

datasciencebox.org sta199-s24.github.io openintro.org

Slide 36

Slide 36 text

openintro.org

Slide 37

Slide 37 text

coursera.org

Slide 38

Slide 38 text

and a bit on… course structure + pedagogy

Slide 39

Slide 39 text

300 students 30 students 30 students 30 students 30 students 30 students 30 students 30 students 30 students 30 students 30 students Lecture x 2 per week Lab x 1 per week

Slide 40

Slide 40 text

teams: weekly labs in teams + periodic team evaluations + term project in teams peer feedback: at various stages of the project live coding: in every “lecture”, along with time for students to attempt exercises on their own “minute paper”: weekly online quizzes ending with a brief reflection of the week’s material creativity: assignments that make room for creativity nudges: periodically throughout the semester

Slide 41

Slide 41 text

Çetinkaya-Rundel, Mine, Mine Dogucu, and Wendy Rummerfield. "The 5Ws and 1H of term projects in the introductory data science classroom." Statistics Education Research Journal 21.2 (2022): 4-4.

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

đź”— bit.ly/stats-ds-uwaterloo thank you! mine-cr.com 🏡 [email protected] ✉