Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Addressing Open Challenges in Data Science Education

Addressing Open Challenges in Data Science Education

Presentation given at Smith College on April 18, 2019.

Abstract: In this talk, I will address some open challenges to data science education. Despite unprecedented and growing interest in data science education on campuses, there are few courses and course materials that provide meaningful opportunity for students to learn about real-world challenges. Most courses provide unrealistically clean data sets that fit the assumptions of the methods in an unrealistic way. The result is that students are left unable to effectively analyze data and solve real-world challenges outside of the classroom. To address this problem, I am leveraging the idea from Nolan and Speed in 1999, who argued the solution to this problem is to teach courses through in-depth case studies derived from interesting scientific questions with nontrivial solutions that leave room for different analyses of the data. I will share a set of general principles and offer a detailed guide derived from my successful experience developing and teaching graduate-level, introductory data science courses centered entirely on case studies. Furthermore, I will present the Open Case Studies educational resource of case studies that educators can use in the classroom to teach students how to effectively derive knowledge from data derived from real-world challenges.

Stephanie Hicks

April 18, 2019
Tweet

More Decks by Stephanie Hicks

Other Decks in Education

Transcript

  1. Addressing Open Challenges in Data Science Education Stephanie Hicks Assistant

    Professor, Biostatistics Johns Hopkins Bloomberg School of Public Health Faculty Member Johns Hopkins Data Science Lab @stephaniehicks
  2. Teaching: Data Science Research: Genomics (analyzing what genes are expressed

    in individual cells) • R/Bioconductor user and developer (since 2009/2010) Other fun things about me: • Co-founded Baltimore • Creating a children’s book featuring women statisticians and data scientists ABOUT ME JOHNS HOPKINS BLOOMBERG SCHOOL OF PUBLIC HEALTH
  3. Why data science? Data science is the number one rated

    job by Glassdoor and there are more than 350,000 new data science jobs expected by 2020.
  4. “We hold a broad view of data science – we

    see it as the science of extracting meaningful information from data.”
  5. Data science is the science and design of 1. Actively

    creating a question to investigate a hypothesis with data 2. Connecting that question with the collection of appropriate data and the application of appropriate methods, algorithms, computational tools or languages in a data analysis 3. Communicating and making decisions based on new or already established knowledge derived from the data and data analysis Hicks and Peng (2019) arXiv (https://arxiv.org/abs/1903.07639) …taking this one step further (in production or not) (with or without user interaction) (weak or strong coders) (in particular domains or not) (One-way communication or feedback loop)
  6. Data science is the science and design of 1. Actively

    creating a question to investigate a hypothesis with data 2. Connecting that question with the collection of appropriate data and the application of appropriate methods, algorithms, computational tools or languages in a data analysis 3. Communicating and making decisions based on new or already established knowledge derived from the data and data analysis Hicks and Peng (2019) arXiv (https://arxiv.org/abs/1903.07639)
  7. Rest of the talk 1. Lessons learned when teaching intro

    data science courses Put the problem first; teach with case studies with non-trivial solutions solving real-world challenges with data 2. Some open challenges in data science education - How to describe variation across data analyses? - How to evaluate quality of data analyses?
  8. https://jhu-advdatasci.github.io/2018/ http://cs109.github.io/2014/ http://datasciencelabs.github.io/2016/ (Harvard University – CS – over 400

    students online, in person – 25 TAs – Python) (Harvard SPH – Biostats -- 150 students – online, in person – 8 TAs – R) (Johns Hopkins SPH – 25 students – in person – PhD Biostats – 2 TAs – R)
  9. •  Teach R software/tools needed for a complete data analysis

    •  Use git/GitHub for assignments to learn version control •  Teach collaborative practices with group projects •  Final Project: analyze dataset of choice & create website and 2 min screencast summarizing results •  Focus on key statistical concepts (and less math details) •  Minimize “traditional” slides/lectures, note-taking •  Maximize hands-on code in class using Rstudio & RMarkdown •  Use “mini assessments” & Google Polls to get live feedback •  Motivate concepts with real world data problems Transforming the Classroom to Teach Statistics and Data Science with Active Learning What is Active Learning? “Anything course-related that all students in a class session are called upon to do other than simply watching, listening and taking notes.” - Felder & Brent (2009) http://r4ds.had.co.nz/intro.html 1.  Data Science (Harvard CS 109) (http://cs109.github.io/2014/) 2.  Introduction to Data Science (HSPH BIO 260) (http://datasciencelabs.github.io) Using Active Learning to teach courses in data science and statistics Oct 1, 2015 Feb 3, 2015 Goal Develop a curriculum for an applied statistics and data science course using active learning techniques Course websites ggplot2 dplyr +dyr readr stringr lubridate broom h5r rvest jsonlite Stephanie Hicks Dana-Farber Cancer Institute, Harvard T.H. Chan School of Public Health Contact information: @stephaniehicks [email protected] My poster presented at the Women in Statistics and Data Science Conference in Fall 2016
  10. Data Science in Academia? • Statistics was born directly from

    developing solutions to practical problems by data analysis problems • Galton, Ronald Fisher • Wild and Pfannkuch (1999) describe applied statistics as: • A department that embraces applied statistics defined above is a natural home for data science in academia “part of the information gathering and learning process which, in an ideal world, is undertaken to inform decisions and actions. With industry, medicine and many other sectors of society increasingly relying on data for decision making, statistics should be an integral part of the emerging information era.”
  11. What is missing in the current statistics curriculum? Wild and

    Pfannkuch (1999) complained that: “Large parts of the investigative process, such as problem analysis and measurement, have been largely abandoned by statisticians and statistics educators to the realm of the particular, perhaps to be developed separately within other disciplines.” They add that “[t]he arid, context-free landscape on which so many examples used in statistics teaching are built ensures that large numbers of students never even see, let alone engage in, statistical thinking.”
  12. What is missing in the current statistics curriculum? Computing, Connecting

    • Need more computing in the curriculum • Need to teach how to connect the subject matter question to appropriate dataset and analysis tools
  13. What is missing in the current statistics curriculum? Computing, Connecting,

    Creating • Need more computing in the curriculum • Need to teach how to connect the subject matter question to appropriate dataset and analysis tools • Instead of being passive, teach students to be active and how create and formulate questions to investigate hypotheses with data
  14. Bridging the gap in the classroom to teach introductory data

    science courses • Educators need to be experienced themselves in creating, connecting and computing • Encourage applied statisticians experienced in creating, connecting, and computing to become involved in the development of courses • Encourage statistics departments to reach out to practicing data analysts, perhaps in other departments or from other disciplines, to collaborate in developing these courses
  15. Principles of Teaching Data Science • Organize the course around

    a set of diverse case studies • Integrate computing into every aspect of the course • Teach abstraction, but minimize reliance on mathematical notation • Structure course activities to realistically mimic a data scientist’s experience • Demonstrate the importance of critical thinking / skepticism through examples
  16. https://jhu-advdatasci.github.io/2018/ http://cs109.github.io/2014/ http://datasciencelabs.github.io/2016/ (Harvard University – CS – over 400

    students online, in person – 25 TAs – Python) (Harvard SPH – Biostats -- 150 students – online, in person – 5 TAs – R) (Johns Hopkins SPH – 25 students – in person – PhD Biostats – 2 TAs – R)
  17. Me in Jan 2019: after grading final projects, realizing it’s

    not sufficient to just teach with case studies… Me: sad and struggling to put into words why I’m frustrated when evaluating data analyses
  18. Me in Jan 2019: searching the web and literature for

    how to evaluate quality of data analyses (or even more simply describe differences or variation across data analyses)
  19. Data science is the science and design of 1. Actively

    creating a question to investigate a hypothesis with data 2. Connecting that question with the collection of appropriate data and the application of appropriate methods, algorithms, computational tools or languages in a data analysis 3. Communicating and making decisions based on new or already established knowledge derived from the data and data analysis Hicks and Peng (2019) arXiv (https://arxiv.org/abs/1903.07639)
  20. Feel free to send comments/questions: Twitter: @stephaniehicks Email: [email protected] #rladies

    Thank you! Normal distribution Weibull distribution Poisson distribution