Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science in a(n Ever-Evolving) Box

Data Science in a(n Ever-Evolving) Box

What should a first course in data science for students who have limited to no experience with statistics and programming look like? How do we teach it in a way that lends itself to iteration as the landscape of data science evolves and that scales to more students and more instructors? In this talk I will aim to accomplish two goals to answer these questions: (1) Introduce a semester-long, modern introductory data science curriculum, along with its design philosophy, implementation details (particularly as class sizes increase), technical infrastructure, and real examples from course content as well as from student projects. (2) Discuss how I've open-sourced this curriculum at datasciencebox.org for sharing with and re-use / adaptation by other instructors and what it takes to maintain this open-source project as the landscape of data science, data science education curriculum guidelines, and data science tooling evolves.

Mine Cetinkaya-Rundel

June 01, 2023
Tweet

More Decks by Mine Cetinkaya-Rundel

Other Decks in Education

Transcript

  1. mine-cetinkaya-rundel
    [email protected]
    @minebocek
    MINE


    ÇETINKAYA-RUNDEL
    DUKE UNIVERSITY + POSIT
    fosstodon.org/@minecr
    (N EVER-EVOLVING)
    🔗 bit.ly/dsbox-evolving

    View Slide

  2. Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science, 2nd Edition.
    Program
    Import Tidy Transform
    Visualize
    Model
    Communicate
    Understand
    Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science, 2nd Edition.
    DOING DATA SCIENCE

    View Slide

  3. Communicate
    TEACHING DATA SCIENCE

    View Slide

  4. Welcome to


    the first day of class!

    View Slide

  5. ‣ Go to Posit Cloud


    ‣ Start the project titled UN Votes

    View Slide

  6. ‣ Go to Posit Cloud


    ‣ Start the project titled UN Votes


    ‣ Open the Quarto document called unvotes.qmd

    View Slide

  7. ‣ Go to Posit Cloud


    ‣ Start the project titled UN Votes


    ‣ Open the Quarto document called unvotes.qmd


    ‣ Render the document and review the data visualization you just produced

    View Slide

  8. ‣ Go to Posit Cloud


    ‣ Start the project titled UN Votes


    ‣ Open the Quarto document called unvotes.qmd


    ‣ Render the document and review the data visualization you just produced


    ‣ Then, look for the character string “Turkey” in the code and replace it with
    another country of your choice


    ‣ Render again, and review how the voting patterns of the country you picked
    compare to the United States and the United Kingdom

    View Slide

  9. Let’s take a look at the
    rest of the semester!

    View Slide

  10. Content
    Tooling
    Pedagogy

    View Slide

  11. Content
    Tooling
    Pedagogy

    View Slide

  12. Tooling
    Pedagogy
    Content
    in 3 examples

    View Slide

  13. FISHERIES OF THE WORLD
    1

    View Slide

  14. View Slide

  15. ✴ data joins
    fisheries |> select(country)


    #> # A tibble: 82 × 1


    #> country


    #>


    #> 1 Angola


    #> 2 Argentina


    #> 3 Australia


    #> 4 Bangladesh


    #> 5 Brazil


    #> 6 Cambodia


    #> 7 Cameroon


    #> 8 Canada


    #> 9 Chad


    #> 10 Chile


    # ℹ 72 more rows
    continents


    #> # A tibble: 245 × 2


    #> country continent


    #>


    #> 1 Afghanistan Asia


    #> 2 Åland Islands Europe


    #> 3 Albania Europe


    #> 4 Algeria Africa


    #> 5 American Samoa Oceania


    #> 6 Andorra Europe


    #> 7 Angola Africa


    #> 8 Anguilla Americas


    #> 9 Antigua & Barbuda Americas


    #> 10 Argentina Americas


    #> # ℹ 235 more rows
    fisheries <- left_join(fisheries, continents)


    Joining with `by = join_by(country)`

    View Slide

  16. ✴ data joins


    ✴ data science ethics
    fisheries |>


    filter(is.na(continent))


    #> # A tibble: 3 × 5


    #> country capture aquaculture total continent


    #>


    #> 1 Democratic Republic of the Congo 237372 3161 240533 NA


    #> 2 Hong Kong 142775 4258 147033 NA


    #> 3 Myanmar 2072390 1017644 3090034 NA
    fisheries <- fisheries |>


    mutate(


    continent = case_when(


    country == "Democratic Republic of the Congo" ~ "Africa",


    country == "Hong Kong" ~ "Asia",


    country == "Myanmar" ~ "Asia",


    .default = continent


    )


    )

    View Slide

  17. ✴ data joins


    ✴ data science ethics


    ✴ critique


    ✴ improving data
    visualisations

    View Slide

  18. ✴ data joins


    ✴ data science ethics


    ✴ critique


    ✴ improving data
    visualisations


    ✴ mapping

    View Slide

  19. Project: Regional differences in average GPA and SAT


    Question: Exploring the regional differences in average GPA and SAT score
    across the US and the factors that could potentially explain them.


    Team: Mine’s Minions

    View Slide

  20. COVID BRIEFINGS
    2

    View Slide

  21. View Slide

  22. ✴ web scraping


    ✴ text parsing


    ✴ data types


    ✴ regular expressions

    View Slide

  23. ✴ web scraping


    ✴ text parsing


    ✴ data types


    ✴ regular expressions


    ✴ functions


    ✴ iteration

    View Slide

  24. ✴ web scraping


    ✴ text parsing


    ✴ data types


    ✴ regular expressions


    ✴ functions


    ✴ iteration


    ✴ data visualisation


    ✴ interpretation

    View Slide

  25. ✴ web scraping


    ✴ text parsing


    ✴ data types


    ✴ regular expressions


    ✴ functions


    ✴ iteration


    ✴ data visualisation


    ✴ interpretation


    ✴ text analysis

    View Slide

  26. ✴ web scraping


    ✴ text parsing


    ✴ data types


    ✴ regular expressions


    ✴ functions


    ✴ iteration


    ✴ data visualisation


    ✴ interpretation


    ✴ text analysis


    ✴ data science ethics
    robotstxt::paths_allowed("https://www.gov.scot")


    #> www.gov.scot


    #> [1] TRUE

    View Slide

  27. Project: Factors Most Important to University Ranking


    Question: Explore how various metrics (e.g., SAT/ACT scores, admission
    rate, region, Carnegie classification) predict rankings on the Niche College
    Ranking List.


    Team: 2cool4school

    View Slide

  28. SPAM FILTERS
    3

    View Slide

  29. ✴ logistic regression


    ✴ prediction

    View Slide

  30. ✴ logistic regression


    ✴ prediction


    ✴ decision errors


    ✴ sensitivity /
    specificity


    ✴ intuition around
    loss functions

    View Slide

  31. Project: Predicting League of Legends success


    Question: After 10 minutes into the game, whether a gold lead or an
    experienced lead was a better predictor of which team wins?


    Team: Blue Squirrels

    View Slide

  32. Project: A Critique of Hollywood Relationship Stereotypes


    Question: How has the average age difference between two actors in an on-
    screen relationship changed over the years? Furthermore, do on-screen
    same-sex relationships have a different average age gap than on-screen
    heterosexual relationships?


    Team: team300

    View Slide

  33. Content
    Tooling
    Pedagogy

    View Slide

  34. live coding: in
    every “lecture”,
    along with time for
    students to
    attempt exercises
    on their own
    “minute paper”:
    weekly online
    quizzes ending
    with a brief
    reflection of the
    week’s material
    creativity:
    assignments that
    make room for
    creativity
    peer feedback: at
    various stages of
    the project
    teams: weekly labs
    in teams +
    periodic team
    evaluations + term
    project in teams

    View Slide

  35. Çetinkaya-Rundel, Mine, Mine
    Dogucu, and Wendy Rummerfield.


    "The 5Ws and 1H of term projects
    in the introductory data science
    classroom."


    Statistics Education Research
    Journal 21.2 (2022): 4-4.

    View Slide

  36. View Slide

  37. Content
    Pedagogy Tooling

    View Slide

  38. student-facing
    +
    📦
    ghclass
    +
    instructor-facing
    📦
    checklist
    +
    +
    📦
    learnr
    +
    📦
    gradethis
    📦
    learnrhash
    or another browser/
    server-based solution

    View Slide

  39. course
    organization
    students
    members
    assignments
    repos

    View Slide

  40. course
    organization
    teams
    teams
    projects
    repos

    View Slide

  41. View Slide

  42. View Slide

  43. Beckman, M. D., Çetinkaya-Rundel, M.,
    Horton, N. J., Rundel, C. W., Sullivan, A.
    J., & Tackett, M.


    "Implementing version control with Git
    and GitHub as a learning objective in
    statistics and data science courses."


    Journal of Statistics and Data Science
    Education 29. (2021): S132-S144.

    View Slide

  44. Content
    Tooling
    Pedagogy

    View Slide

  45. Openness


    + Scalability

    View Slide

  46. datasciencebox.org

    View Slide

  47. datasciencebox.org

    View Slide

  48. github.com/tidyverse/datascience-box

    View Slide

  49. github.com/tidyverse/dsbox

    View Slide

  50. AUDIENCE
    I have been teaching with R


    for a while, but I want to update
    my teaching materials
    I’m new to teaching with R and
    need to build up my course
    materials
    This teaching slide
    deck I came across on Twitter
    is pretty cool, but I have no idea
    what type of course it belongs
    in

    View Slide

  51. on
    COMMUNITY

    View Slide

  52. sta199-f22-1.github.io
    EXAMPLE

    View Slide

  53. pos.it/conf
    TRAINING

    View Slide

  54. SCHOLARSHIP
    Çetinkaya-Rundel, Mine, and Victoria
    Ellison.


    "A fresh look at introductory data
    science."


    Journal of Statistics and Data Science
    Education 29.sup1 (2021): S16-S26.

    View Slide

  55. (N EVER-EVOLVING)
    mine-cetinkaya-rundel
    [email protected]
    @minebocek
    fosstodon.org/@minecr
    thank you!
    🔗 bit.ly/dsbox-evolving

    View Slide