Addressing Open Challenges in Data Science Education

Addressing Open Challenges in Data Science Education

Presentation given at Smith College on April 18, 2019.

Abstract: In this talk, I will address some open challenges to data science education. Despite unprecedented and growing interest in data science education on campuses, there are few courses and course materials that provide meaningful opportunity for students to learn about real-world challenges. Most courses provide unrealistically clean data sets that fit the assumptions of the methods in an unrealistic way. The result is that students are left unable to effectively analyze data and solve real-world challenges outside of the classroom. To address this problem, I am leveraging the idea from Nolan and Speed in 1999, who argued the solution to this problem is to teach courses through in-depth case studies derived from interesting scientific questions with nontrivial solutions that leave room for different analyses of the data. I will share a set of general principles and offer a detailed guide derived from my successful experience developing and teaching graduate-level, introductory data science courses centered entirely on case studies. Furthermore, I will present the Open Case Studies educational resource of case studies that educators can use in the classroom to teach students how to effectively derive knowledge from data derived from real-world challenges.

68c6191fa302627da003b9ac1eaba4b5?s=128

Stephanie Hicks

April 18, 2019
Tweet

Transcript

  1. Addressing Open Challenges in Data Science Education Stephanie Hicks Assistant

    Professor, Biostatistics Johns Hopkins Bloomberg School of Public Health Faculty Member Johns Hopkins Data Science Lab @stephaniehicks
  2. Teaching: Data Science Research: Genomics (analyzing what genes are expressed

    in individual cells) • R/Bioconductor user and developer (since 2009/2010) Other fun things about me: • Co-founded Baltimore • Creating a children’s book featuring women statisticians and data scientists ABOUT ME JOHNS HOPKINS BLOOMBERG SCHOOL OF PUBLIC HEALTH
  3. https://jhudatascience.org

  4. The “OG”s ROGER BRIAN JEFF Joined in 2018 STEPHANIE Who

    are we?
  5. Why data science? Data science is the number one rated

    job by Glassdoor and there are more than 350,000 new data science jobs expected by 2020.
  6. https://analytics.ncsu.edu/?page_id=4184

  7. So…. what is data science?

  8. “We hold a broad view of data science – we

    see it as the science of extracting meaningful information from data.”
  9. None
  10. What if we define data science based on what a

    data scientist does?
  11. “Data science” defined by Michael Hochster

  12. “Data science” defined by Michael Hochster

  13. Data science is the science and design of 1. Actively

    creating a question to investigate a hypothesis with data 2. Connecting that question with the collection of appropriate data and the application of appropriate methods, algorithms, computational tools or languages in a data analysis 3. Communicating and making decisions based on new or already established knowledge derived from the data and data analysis Hicks and Peng (2019) arXiv (https://arxiv.org/abs/1903.07639) …taking this one step further (in production or not) (with or without user interaction) (weak or strong coders) (in particular domains or not) (One-way communication or feedback loop)
  14. None
  15. Data science is the science and design of 1. Actively

    creating a question to investigate a hypothesis with data 2. Connecting that question with the collection of appropriate data and the application of appropriate methods, algorithms, computational tools or languages in a data analysis 3. Communicating and making decisions based on new or already established knowledge derived from the data and data analysis Hicks and Peng (2019) arXiv (https://arxiv.org/abs/1903.07639)
  16. Rest of the talk 1. Lessons learned when teaching intro

    data science courses Put the problem first; teach with case studies with non-trivial solutions solving real-world challenges with data 2. Some open challenges in data science education - How to describe variation across data analyses? - How to evaluate quality of data analyses?
  17. https://jhu-advdatasci.github.io/2018/ http://cs109.github.io/2014/ http://datasciencelabs.github.io/2016/ (Harvard University – CS – over 400

    students online, in person – 25 TAs – Python) (Harvard SPH – Biostats -- 150 students – online, in person – 8 TAs – R) (Johns Hopkins SPH – 25 students – in person – PhD Biostats – 2 TAs – R)
  18. •  Teach R software/tools needed for a complete data analysis

    •  Use git/GitHub for assignments to learn version control •  Teach collaborative practices with group projects •  Final Project: analyze dataset of choice & create website and 2 min screencast summarizing results •  Focus on key statistical concepts (and less math details) •  Minimize “traditional” slides/lectures, note-taking •  Maximize hands-on code in class using Rstudio & RMarkdown •  Use “mini assessments” & Google Polls to get live feedback •  Motivate concepts with real world data problems Transforming the Classroom to Teach Statistics and Data Science with Active Learning What is Active Learning? “Anything course-related that all students in a class session are called upon to do other than simply watching, listening and taking notes.” - Felder & Brent (2009) http://r4ds.had.co.nz/intro.html 1.  Data Science (Harvard CS 109) (http://cs109.github.io/2014/) 2.  Introduction to Data Science (HSPH BIO 260) (http://datasciencelabs.github.io) Using Active Learning to teach courses in data science and statistics Oct 1, 2015 Feb 3, 2015 Goal Develop a curriculum for an applied statistics and data science course using active learning techniques Course websites ggplot2 dplyr +dyr readr stringr lubridate broom h5r rvest jsonlite Stephanie Hicks Dana-Farber Cancer Institute, Harvard T.H. Chan School of Public Health Contact information: @stephaniehicks shicks@jimmy.harvard.edu My poster presented at the Women in Statistics and Data Science Conference in Fall 2016
  19. None
  20. Data Science in Academia? • Statistics was born directly from

    developing solutions to practical problems by data analysis problems • Galton, Ronald Fisher • Wild and Pfannkuch (1999) describe applied statistics as: • A department that embraces applied statistics defined above is a natural home for data science in academia “part of the information gathering and learning process which, in an ideal world, is undertaken to inform decisions and actions. With industry, medicine and many other sectors of society increasingly relying on data for decision making, statistics should be an integral part of the emerging information era.”
  21. Got it, but what’s missing in current statistics curriculum?

  22. What is missing in the current statistics curriculum? Wild and

    Pfannkuch (1999) complained that: “Large parts of the investigative process, such as problem analysis and measurement, have been largely abandoned by statisticians and statistics educators to the realm of the particular, perhaps to be developed separately within other disciplines.” They add that “[t]he arid, context-free landscape on which so many examples used in statistics teaching are built ensures that large numbers of students never even see, let alone engage in, statistical thinking.”
  23. What is missing in the current statistics curriculum? Computing •

    Need more computing in the curriculum
  24. What is missing in the current statistics curriculum? Computing, Connecting

    • Need more computing in the curriculum • Need to teach how to connect the subject matter question to appropriate dataset and analysis tools
  25. What is missing in the current statistics curriculum? Computing, Connecting,

    Creating • Need more computing in the curriculum • Need to teach how to connect the subject matter question to appropriate dataset and analysis tools • Instead of being passive, teach students to be active and how create and formulate questions to investigate hypotheses with data
  26. Bridging the gap in the classroom to teach introductory data

    science courses
  27. Bridging the gap in the classroom to teach introductory data

    science courses • Educators need to be experienced themselves in creating, connecting and computing • Encourage applied statisticians experienced in creating, connecting, and computing to become involved in the development of courses • Encourage statistics departments to reach out to practicing data analysts, perhaps in other departments or from other disciplines, to collaborate in developing these courses
  28. Principles of Teaching Data Science • Organize the course around

    a set of diverse case studies • Integrate computing into every aspect of the course • Teach abstraction, but minimize reliance on mathematical notation • Structure course activities to realistically mimic a data scientist’s experience • Demonstrate the importance of critical thinking / skepticism through examples
  29. So you want to teach with case studies too but

    not sure where to start?
  30. https://opencasestudies.github.io

  31. https://jhu-advdatasci.github.io/2018/ http://cs109.github.io/2014/ http://datasciencelabs.github.io/2016/ (Harvard University – CS – over 400

    students online, in person – 25 TAs – Python) (Harvard SPH – Biostats -- 150 students – online, in person – 5 TAs – R) (Johns Hopkins SPH – 25 students – in person – PhD Biostats – 2 TAs – R)
  32. This is me literally all of fall 2018

  33. Me in Sept-Dec 2018: “Students are actually learning how to

    analyze data with case studies!”
  34. Me in Jan 2019: after grading final projects, realizing it’s

    not sufficient to just teach with case studies… Me: sad and struggling to put into words why I’m frustrated when evaluating data analyses
  35. Me in Jan 2019: searching the web and literature for

    how to evaluate quality of data analyses (or even more simply describe differences or variation across data analyses)
  36. Can we define a language to describe differences or variation

    across data analyses?
  37. Data science is the science and design of 1. Actively

    creating a question to investigate a hypothesis with data 2. Connecting that question with the collection of appropriate data and the application of appropriate methods, algorithms, computational tools or languages in a data analysis 3. Communicating and making decisions based on new or already established knowledge derived from the data and data analysis Hicks and Peng (2019) arXiv (https://arxiv.org/abs/1903.07639)
  38. None
  39. What are the elements of a data analysis?

  40. Elements of data analysis

  41. Principles of data analysis

  42. None
  43. None
  44. None
  45. None
  46. None
  47. None
  48. What can the elements and principles be used for?

  49. How to select informative elements?

  50. How to evaluate quality of a data analysis? - Success?

    - Validity? - Honesty?
  51. Feel free to send comments/questions: Twitter: @stephaniehicks Email: shicks19@jhu.edu #rladies

    Thank you! Normal distribution Weibull distribution Poisson distribution