Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Statistics in the Age of Data Science

Statistics in the Age of Data Science

Professional Night Talk at the 2024 AP Statistics Grading in Tampa, FL

Mine Cetinkaya-Rundel

June 02, 2024
Tweet

More Decks by Mine Cetinkaya-Rundel

Other Decks in Education

Transcript

  1. Mine Çetinkaya-Rundel Duke University Statistics in the age of Data

    Science AP Statistics Reading 2024 Professional Night
  2. Indeed, even Haidt, who has also emphasized broader changes to

    the culture of childhood , estimated that social media use is responsible for only about 10 percent to 15 percent of the variation in teenage well-being — which would be a significant correlation, given the complexities of adolescent life and of social science, but is also a much more measured estimate than you tend to see in headlines trumpeting the connection. … In Britain, the share of young people who reported “feeling down” or experiencing depression grew from 31 percent in 2012 to 38 percent on the eve of the pandemic and to 41 percent in 2021. That is significant, though by other measures British teenagers appear, if more depressed than they were in the 2000s, not much more depressed than they were in the 1990s. …
  3. Professor of the Practice + Director of Undergraduate Studies @

    Duke University, Department of Statistical Science Developer Educator @ Posit, PBC
  4. Program Import Tidy Communicate Understand Transform Model Visualize Wickham, H.,

    Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science, 2nd Edition. Doing data science
  5. data visualisation data wrangling, tidying, acquisition exploratory data analysis predictive

    modeling + uncertainty quantification effective communication of results interactive visualizations text analysis machine learning Bayesian inference … consistent syntax | R + tidyverse reproducibility | Quarto version control and collaboration | Git + GitHub focus on emphasize foray into
  6. population # A tibble: 217 × 3 country year population

    <chr> <dbl> <dbl> 1 Afghanistan 2022 41129. 2 Albania 2022 2778. 3 Algeria 2022 44903. 4 American Samoa 2022 44.3 5 Andorra 2022 79.8 6 Angola 2022 35589. 7 Antigua and Barbuda 2022 93.8 8 Argentina 2022 46235. 9 Armenia 2022 2780. 10 Aruba 2022 106. # ℹ 207 more rows continents # A tibble: 285 × 4 entity code year continent <chr> <chr> <dbl> <chr> 1 Abkhazia OWID_ABK 2015 Asia 2 Afghanistan AFG 2015 Asia 3 Akrotiri and Dhekelia OWID_AKD 2015 Asia 4 Aland Islands ALA 2015 Europe 5 Albania ALB 2015 Europe 6 Algeria DZA 2015 Africa 7 American Samoa ASM 2015 Oceania 8 Andorra AND 2015 Europe 9 Angola AGO 2015 Africa 10 Anguilla AIA 2015 North America # ℹ 275 more rows population_continents <- left_join(population, continents, join_by(country == entity)) ✓ data joins
  7. population_continents |> f i lter(is.na(continent)) # A tibble: 6 ×

    6 country year.x population code year.y continent <chr> <dbl> <dbl> <chr> <dbl> <chr> 1 Congo, Dem. Rep. 2022 99010. NA NA NA 2 Congo, Rep. 2022 5970. NA NA NA 3 Hong Kong SAR, China 2022 7346. NA NA NA 4 Korea, Dem. People's Rep. 2022 26069. NA NA NA 5 Korea, Rep. 2022 51628. NA NA NA 6 Kyrgyz Republic 2022 6975. NA NA NA ✓ data joins ✓ data wrangling
  8. population_continent <- population |> mutate(country = case_when( country == "Congo,

    Dem. Rep." ~ "Democratic Republic of Congo", country == "Congo, Rep." ~ "Congo", country == "Hong Kong SAR, China" ~ "Hong Kong", country == "Korea, Dem. People's Rep." ~ "North Korea", country == "Korea, Rep." ~ "South Korea", country == "Kyrgyz Republic" ~ "Kyrgyzstan", .default = country ) ) |> left_join(continents, by = join_by(country == entity)) ✓ data joins ✓ data wrangling ✓ data cleaning ✓ ethics
  9. ✓ data joins ✓ data wrangling ✓ data cleaning ✓

    ethics ✓ critique ✓ improving visualizations
  10. ✓ data joins ✓ data wrangling ✓ data cleaning ✓

    ethics ✓ critique ✓ improving visualizations ✓ mapping ✓ iteration
  11. David Spiegelhalter Professor, University of Cambridge “There is no substitute

    for simply looking at data properly.” from “The Art of Statistics”
  12. ✓ web scraping chronicle # A tibble: 500 × 6

    title author date abstract column url <chr> <chr> <date> <chr> <chr> <chr> 1 All the world’s a stage Anna … 2024-02-22 If we a… STUDE… http… 2 Words that matter: For Alexei Navalny Carol… 2024-02-22 In some… STUDE… http… 3 Which would you save: Friend or romantic partn… Jess … 2024-02-22 Love sh… STUDE… http… 4 Happiness is not what you’re looking for Paul … 2024-02-21 We hing… STUDE… http… 5 Closing Duke's Herbarium: A fear of long - term … Matth… 2024-02-21 Without… LETTE… http… 6 CS Majors launch 'ambiguous and labelless rela… Monda… 2024-02-20 Unlike … STUDE… http… 7 The fear of being single Heidi… 2024-02-20 But it … STUDE… http… 8 Save the Duke Herbarium Henry… 2024-02-17 The Duk… LETTE… http… 9 What Duke can learn from retiring ex - president… Rober… 2024-02-17 In Duke… GUEST… http… 10 Love, love Gabri… 2024-02-16 Somehow… STUDE… http… # ℹ 490 more rows
  13. ✓ web scraping ✓ terms of use ✓ ethics robotstxt

    :: paths_allowed("https: //www .dukechronicle.com") www .dukechronicle.com [1] TRUE
  14. ✓ web scraping ✓ terms of use ✓ ethics ✓

    text analysis ✓ data wrangling ✓ data visualization
  15. ✓ web scraping ✓ terms of use ✓ ethics ✓

    text analysis ✓ data wrangling ✓ data visualization ✓ sentiment analysis
  16. Sarah Jarvis Director of Applied Machine Learning and Data Science

    at Secondmind “Data science is all about asking interesting questions based on the data you have—or often the data you don’t have.”
  17. ✓ logistic regression ✓ classification ✓ decision errors ✓ sensitivity

    / specificity ✓ intuition around loss functions
  18. George Box Statistician “All models are wrong but some are

    useful” from “Robustness in the strategy of scientific model building”
  19. Let’s take a step back… Measurement Oxygen concentration Water temperature

    1 11.3 9 2 7 mg/l 10 C 3 2.53 [unable to measure] 4 12.5 8.2 5 5.3 miligrams per liter - … Where are these measurements recorded? What do the raw measurements look like? Measurement Oxygen concentration Water temperature 1 11.3 9 2 7 10 3 2.53 4 12.5 8.2 5 5.3 …
  20. Parting thoughts… Can we get our students to work with

    the real, raw data? Can we at least get them to see the raw data and think through what it takes to get to the fi rst histogram, summary table, etc.?