Statistics in the Age of Data Science

Slide 1

Slide 1 text

Mine Çetinkaya-Rundel Duke University Statistics in the age of Data Science AP Statistics Reading 2024 Professional Night

Slide 2

Slide 2 text

Let’s read the health section of the New York Times…

Slide 3

Slide 3 text

Indeed, even Haidt, who has also emphasized broader changes to the culture of childhood , estimated that social media use is responsible for only about 10 percent to 15 percent of the variation in teenage well-being — which would be a significant correlation, given the complexities of adolescent life and of social science, but is also a much more measured estimate than you tend to see in headlines trumpeting the connection. … In Britain, the share of young people who reported “feeling down” or experiencing depression grew from 31 percent in 2012 to 38 percent on the eve of the pandemic and to 41 percent in 2021. That is significant, though by other measures British teenagers appear, if more depressed than they were in the 2000s, not much more depressed than they were in the 1990s. …

Slide 4

Slide 4 text

Now let’s do something a lot less “scientific”

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Now let’s generate even more text!

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Professor of the Practice + Director of Undergraduate Studies @ Duke University, Department of Statistical Science Developer Educator @ Posit, PBC

Slide 10

Slide 10 text

Program Import Tidy Communicate Understand Transform Model Visualize Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science, 2nd Edition. Doing data science

Slide 11

Slide 11 text

Communicate Teaching data science

Slide 12

Slide 12 text

data visualisation data wrangling, tidying, acquisition exploratory data analysis predictive modeling + uncertainty quantification effective communication of results interactive visualizations text analysis machine learning Bayesian inference … consistent syntax | R + tidyverse reproducibility | Quarto version control and collaboration | Git + GitHub focus on emphasize foray into

Slide 13

Slide 13 text

Example 1 - Country populations Photo by Louis Hansel on Unsplash Visualize Wrangle

Slide 14

Slide 14 text

population # A tibble: 217 × 3 country year population 1 Afghanistan 2022 41129. 2 Albania 2022 2778. 3 Algeria 2022 44903. 4 American Samoa 2022 44.3 5 Andorra 2022 79.8 6 Angola 2022 35589. 7 Antigua and Barbuda 2022 93.8 8 Argentina 2022 46235. 9 Armenia 2022 2780. 10 Aruba 2022 106. # ℹ 207 more rows continents # A tibble: 285 × 4 entity code year continent 1 Abkhazia OWID_ABK 2015 Asia 2 Afghanistan AFG 2015 Asia 3 Akrotiri and Dhekelia OWID_AKD 2015 Asia 4 Aland Islands ALA 2015 Europe 5 Albania ALB 2015 Europe 6 Algeria DZA 2015 Africa 7 American Samoa ASM 2015 Oceania 8 Andorra AND 2015 Europe 9 Angola AGO 2015 Africa 10 Anguilla AIA 2015 North America # ℹ 275 more rows population_continents <- left_join(population, continents, join_by(country == entity)) ✓ data joins

Slide 15

Slide 15 text

population_continents |> f i lter(is.na(continent)) # A tibble: 6 × 6 country year.x population code year.y continent 1 Congo, Dem. Rep. 2022 99010. NA NA NA 2 Congo, Rep. 2022 5970. NA NA NA 3 Hong Kong SAR, China 2022 7346. NA NA NA 4 Korea, Dem. People's Rep. 2022 26069. NA NA NA 5 Korea, Rep. 2022 51628. NA NA NA 6 Kyrgyz Republic 2022 6975. NA NA NA ✓ data joins ✓ data wrangling

Slide 16

Slide 16 text

population_continent <- population |> mutate(country = case_when( country == "Congo, Dem. Rep." ~ "Democratic Republic of Congo", country == "Congo, Rep." ~ "Congo", country == "Hong Kong SAR, China" ~ "Hong Kong", country == "Korea, Dem. People's Rep." ~ "North Korea", country == "Korea, Rep." ~ "South Korea", country == "Kyrgyz Republic" ~ "Kyrgyzstan", .default = country ) ) |> left_join(continents, by = join_by(country == entity)) ✓ data joins ✓ data wrangling ✓ data cleaning ✓ ethics

Slide 17

Slide 17 text

✓ data joins ✓ data wrangling ✓ data cleaning ✓ ethics ✓ critique ✓ improving visualizations

Slide 18

Slide 18 text

✓ data joins ✓ data wrangling ✓ data cleaning ✓ ethics ✓ critique ✓ improving visualizations ✓ mapping ✓ iteration

Slide 19

Slide 19 text

David Spiegelhalter Professor, University of Cambridge “There is no substitute for simply looking at data properly.” from “The Art of Statistics”

Slide 20

Slide 20 text

Example 2 - Opinion pieces in The Chronicle Import

Slide 21

Slide 21 text

✓ web scraping chronicle # A tibble: 500 × 6 title author date abstract column url 1 All the world’s a stage Anna … 2024-02-22 If we a… STUDE… http… 2 Words that matter: For Alexei Navalny Carol… 2024-02-22 In some… STUDE… http… 3 Which would you save: Friend or romantic partn… Jess … 2024-02-22 Love sh… STUDE… http… 4 Happiness is not what you’re looking for Paul … 2024-02-21 We hing… STUDE… http… 5 Closing Duke's Herbarium: A fear of long - term … Matth… 2024-02-21 Without… LETTE… http… 6 CS Majors launch 'ambiguous and labelless rela… Monda… 2024-02-20 Unlike … STUDE… http… 7 The fear of being single Heidi… 2024-02-20 But it … STUDE… http… 8 Save the Duke Herbarium Henry… 2024-02-17 The Duk… LETTE… http… 9 What Duke can learn from retiring ex - president… Rober… 2024-02-17 In Duke… GUEST… http… 10 Love, love Gabri… 2024-02-16 Somehow… STUDE… http… # ℹ 490 more rows

Slide 22

Slide 22 text

✓ web scraping ✓ terms of use ✓ ethics robotstxt :: paths_allowed("https: //www .dukechronicle.com") www .dukechronicle.com [1] TRUE

Slide 23

Slide 23 text

✓ web scraping ✓ terms of use ✓ ethics ✓ text analysis ✓ data wrangling ✓ data visualization

Slide 24

Slide 24 text

✓ web scraping ✓ terms of use ✓ ethics ✓ text analysis ✓ data wrangling ✓ data visualization ✓ sentiment analysis

Slide 25

Slide 25 text

Sarah Jarvis Director of Applied Machine Learning and Data Science at Secondmind “Data science is all about asking interesting questions based on the data you have—or often the data you don’t have.”

Slide 26

Slide 26 text

Example 3 - Spam filter Predict Photo by Alexander Grey on Unsplash

Slide 27

Slide 27 text

✓ logistic regression ✓ classification

Slide 28

Slide 28 text

✓ logistic regression ✓ classification ✓ decision errors ✓ sensitivity / specificity ✓ intuition around loss functions

Slide 29

Slide 29 text

George Box Statistician “All models are wrong but some are useful” from “Robustness in the strategy of scientific model building”

Slide 30

Slide 30 text

and we could keep going on with examples… but let’s talk resources instead!

Slide 31

Slide 31 text

datasciencebox.org

Slide 32

Slide 32 text

datasciencebox.org

Slide 33

Slide 33 text

datasciencebox.org sta199-s24.github.io

Slide 34

Slide 34 text

sta199-s24.github.io

Slide 35

Slide 35 text

datasciencebox.org sta199-s24.github.io openintro.org

Slide 36

Slide 36 text

openintro.org

Slide 37

Slide 37 text

coursera.org

Slide 38

Slide 38 text

and before we wrap up…

Slide 39

Slide 39 text

Who remembers this question? Where is it from?

Slide 40

Slide 40 text

Let’s take a step back… Measurement Oxygen concentration Water temperature 1 11.3 9 2 7 mg/l 10 C 3 2.53 [unable to measure] 4 12.5 8.2 5 5.3 miligrams per liter - … Where are these measurements recorded? What do the raw measurements look like? Measurement Oxygen concentration Water temperature 1 11.3 9 2 7 10 3 2.53 4 12.5 8.2 5 5.3 …

Slide 41

Slide 41 text

Parting thoughts… Can we get our students to work with the real, raw data? Can we at least get them to see the raw data and think through what it takes to get to the fi rst histogram, summary table, etc.?

Slide 42

Slide 42 text

🔗 bit.ly/stats-ds-ap24 thank you! mine-cr.com 🏡 [email protected] ✉