Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Addressing Open Challenges in Data Science Education

Addressing Open Challenges in Data Science Education

Presentation given at Smith College on April 18, 2019.

Abstract: In this talk, I will address some open challenges to data science education. Despite unprecedented and growing interest in data science education on campuses, there are few courses and course materials that provide meaningful opportunity for students to learn about real-world challenges. Most courses provide unrealistically clean data sets that fit the assumptions of the methods in an unrealistic way. The result is that students are left unable to effectively analyze data and solve real-world challenges outside of the classroom. To address this problem, I am leveraging the idea from Nolan and Speed in 1999, who argued the solution to this problem is to teach courses through in-depth case studies derived from interesting scientific questions with nontrivial solutions that leave room for different analyses of the data. I will share a set of general principles and offer a detailed guide derived from my successful experience developing and teaching graduate-level, introductory data science courses centered entirely on case studies. Furthermore, I will present the Open Case Studies educational resource of case studies that educators can use in the classroom to teach students how to effectively derive knowledge from data derived from real-world challenges.

Stephanie Hicks

April 18, 2019
Tweet

More Decks by Stephanie Hicks

Other Decks in Education

Transcript

  1. Addressing Open Challenges in
    Data Science Education
    Stephanie Hicks
    Assistant Professor, Biostatistics
    Johns Hopkins Bloomberg School of Public Health
    Faculty Member
    Johns Hopkins Data Science Lab
    @stephaniehicks

    View Slide

  2. Teaching: Data Science
    Research: Genomics (analyzing what genes are expressed in individual cells)
    • R/Bioconductor user and developer (since 2009/2010)
    Other fun things about me:
    • Co-founded Baltimore
    • Creating a children’s book featuring women statisticians and data scientists
    ABOUT ME JOHNS HOPKINS BLOOMBERG
    SCHOOL OF PUBLIC HEALTH

    View Slide

  3. https://jhudatascience.org

    View Slide

  4. The “OG”s
    ROGER
    BRIAN
    JEFF
    Joined in 2018
    STEPHANIE
    Who are we?

    View Slide

  5. Why data science?
    Data science is the number
    one rated job by Glassdoor
    and there are more than
    350,000 new data science
    jobs expected by 2020.

    View Slide

  6. https://analytics.ncsu.edu/?page_id=4184

    View Slide

  7. So….
    what is data science?

    View Slide

  8. “We hold a broad view
    of data science – we see it
    as the science of extracting
    meaningful information
    from data.”

    View Slide

  9. View Slide

  10. What if we define
    data science based on what
    a data scientist does?

    View Slide

  11. “Data science” defined by Michael Hochster

    View Slide

  12. “Data science” defined by Michael Hochster

    View Slide

  13. Data science is the science and design of
    1. Actively creating a question to investigate a hypothesis with data
    2. Connecting that question with the collection of appropriate data
    and the application of appropriate methods, algorithms,
    computational tools or languages in a data analysis
    3. Communicating and making decisions based on new or already
    established knowledge derived from the data and data analysis
    Hicks and Peng (2019) arXiv
    (https://arxiv.org/abs/1903.07639)
    …taking this one step further
    (in production or not)
    (with or without user interaction) (weak or strong coders)
    (in particular
    domains or not)
    (One-way communication or feedback loop)

    View Slide

  14. View Slide

  15. Data science is the science and design of
    1. Actively creating a question to investigate a hypothesis with data
    2. Connecting that question with the collection of appropriate data
    and the application of appropriate methods, algorithms,
    computational tools or languages in a data analysis
    3. Communicating and making decisions based on new or already
    established knowledge derived from the data and data analysis
    Hicks and Peng (2019) arXiv
    (https://arxiv.org/abs/1903.07639)

    View Slide

  16. Rest of the talk
    1. Lessons learned when teaching intro data science courses
    Put the problem first; teach with case studies with
    non-trivial solutions solving real-world challenges with data
    2. Some open challenges in data science education
    - How to describe variation across data analyses?
    - How to evaluate quality of data analyses?

    View Slide

  17. https://jhu-advdatasci.github.io/2018/
    http://cs109.github.io/2014/
    http://datasciencelabs.github.io/2016/
    (Harvard University – CS – over 400 students online, in person – 25 TAs – Python)
    (Harvard SPH – Biostats -- 150 students – online, in person – 8 TAs – R)
    (Johns Hopkins SPH – 25 students – in person – PhD Biostats – 2 TAs – R)

    View Slide

  18. •  Teach R software/tools needed for a complete data analysis
    •  Use git/GitHub for assignments to learn version control
    •  Teach collaborative practices with group projects
    •  Final Project: analyze dataset of choice & create website
    and 2 min screencast summarizing results
    •  Focus on key statistical concepts (and less math details)
    •  Minimize “traditional” slides/lectures, note-taking
    •  Maximize hands-on code in class using Rstudio & RMarkdown
    •  Use “mini assessments” & Google Polls to get live feedback
    •  Motivate concepts with real world data problems
    Transforming the Classroom to Teach Statistics and
    Data Science with Active Learning
    What is Active Learning?
    “Anything course-related that all students in a class
    session are called upon to do other than simply
    watching, listening and taking notes.”
    - Felder & Brent (2009)
    http://r4ds.had.co.nz/intro.html
    1.  Data Science (Harvard CS 109) (http://cs109.github.io/2014/)
    2.  Introduction to Data Science (HSPH BIO 260) (http://datasciencelabs.github.io)
    Using Active Learning to teach courses in data science and statistics
    Oct 1, 2015
    Feb 3, 2015
    Goal
    Develop a curriculum for an applied statistics and
    data science course using active learning techniques
    Course websites
    ggplot2
    dplyr
    +dyr
    readr
    stringr
    lubridate
    broom
    h5r
    rvest
    jsonlite
    Stephanie Hicks
    Dana-Farber Cancer Institute, Harvard T.H. Chan School of Public Health
    Contact information:
    @stephaniehicks
    [email protected]
    My poster presented at the Women in Statistics and Data Science Conference in Fall 2016

    View Slide

  19. View Slide

  20. Data Science in Academia?
    • Statistics was born directly from developing solutions to practical
    problems by data analysis problems
    • Galton, Ronald Fisher
    • Wild and Pfannkuch (1999) describe applied statistics as:
    • A department that embraces applied statistics defined above is a natural
    home for data science in academia
    “part of the information gathering and learning process which, in an
    ideal world, is undertaken to inform decisions and actions. With industry,
    medicine and many other sectors of society increasingly relying on data
    for decision making, statistics should be an integral part of the emerging
    information era.”

    View Slide

  21. Got it, but what’s
    missing in current
    statistics curriculum?

    View Slide

  22. What is missing in the current statistics
    curriculum?
    Wild and Pfannkuch (1999) complained that:
    “Large parts of the investigative process, such as problem analysis and
    measurement, have been largely abandoned by statisticians and
    statistics educators to the realm of the particular, perhaps to be
    developed separately within other disciplines.”
    They add that “[t]he arid, context-free landscape on which so many
    examples used in statistics teaching are built ensures that large
    numbers of students never even see, let alone engage in, statistical
    thinking.”

    View Slide

  23. What is missing in the current statistics
    curriculum? Computing
    • Need more computing in the curriculum

    View Slide

  24. What is missing in the current statistics
    curriculum? Computing, Connecting
    • Need more computing in the curriculum
    • Need to teach how to connect the subject matter question to
    appropriate dataset and analysis tools

    View Slide

  25. What is missing in the current statistics
    curriculum? Computing, Connecting, Creating
    • Need more computing in the curriculum
    • Need to teach how to connect the subject matter question to
    appropriate dataset and analysis tools
    • Instead of being passive, teach students to be active and how create
    and formulate questions to investigate hypotheses with data

    View Slide

  26. Bridging the gap in the classroom to teach
    introductory data science courses

    View Slide

  27. Bridging the gap in the classroom to teach
    introductory data science courses
    • Educators need to be experienced themselves in creating, connecting
    and computing
    • Encourage applied statisticians experienced in creating, connecting,
    and computing to become involved in the development of courses
    • Encourage statistics departments to reach out to practicing data
    analysts, perhaps in other departments or from other disciplines, to
    collaborate in developing these courses

    View Slide

  28. Principles of Teaching Data Science
    • Organize the course around a set of diverse case studies
    • Integrate computing into every aspect of the course
    • Teach abstraction, but minimize reliance on mathematical notation
    • Structure course activities to realistically mimic a data scientist’s
    experience
    • Demonstrate the importance of critical thinking / skepticism
    through examples

    View Slide

  29. So you want to teach with case
    studies too but not sure where to
    start?

    View Slide

  30. https://opencasestudies.github.io

    View Slide

  31. https://jhu-advdatasci.github.io/2018/
    http://cs109.github.io/2014/
    http://datasciencelabs.github.io/2016/
    (Harvard University – CS – over 400 students online, in person – 25 TAs – Python)
    (Harvard SPH – Biostats -- 150 students – online, in person – 5 TAs – R)
    (Johns Hopkins SPH – 25 students – in person – PhD Biostats – 2 TAs – R)

    View Slide

  32. This is me
    literally all of fall
    2018

    View Slide

  33. Me in Sept-Dec 2018:
    “Students are actually
    learning how to
    analyze data with case
    studies!”

    View Slide

  34. Me in Jan 2019: after grading
    final projects, realizing it’s not
    sufficient to just teach with
    case studies…
    Me: sad and struggling to put
    into words why I’m frustrated
    when evaluating data analyses

    View Slide

  35. Me in Jan 2019: searching the web and literature for how
    to evaluate quality of data analyses (or even more simply
    describe differences or variation across data analyses)

    View Slide

  36. Can we define a language to
    describe differences or
    variation across data analyses?

    View Slide

  37. Data science is the science and design of
    1. Actively creating a question to investigate a hypothesis with data
    2. Connecting that question with the collection of appropriate data
    and the application of appropriate methods, algorithms,
    computational tools or languages in a data analysis
    3. Communicating and making decisions based on new or already
    established knowledge derived from the data and data analysis
    Hicks and Peng (2019) arXiv
    (https://arxiv.org/abs/1903.07639)

    View Slide

  38. View Slide

  39. What are the elements of a
    data analysis?

    View Slide

  40. Elements of data analysis

    View Slide

  41. Principles of data analysis

    View Slide

  42. View Slide

  43. View Slide

  44. View Slide

  45. View Slide

  46. View Slide

  47. View Slide

  48. What can the elements and
    principles be used for?

    View Slide

  49. How to select informative
    elements?

    View Slide

  50. How to evaluate quality of
    a data analysis?
    - Success?
    - Validity?
    - Honesty?

    View Slide

  51. Feel free to send comments/questions:
    Twitter: @stephaniehicks
    Email: [email protected]
    #rladies
    Thank you!
    Normal distribution
    Weibull distribution
    Poisson distribution

    View Slide