Slide 1

Slide 1 text

Addressing Open Challenges in Data Science Education Stephanie Hicks Assistant Professor, Biostatistics Johns Hopkins Bloomberg School of Public Health Faculty Member Johns Hopkins Data Science Lab @stephaniehicks

Slide 2

Slide 2 text

Teaching: Data Science Research: Genomics (analyzing what genes are expressed in individual cells) • R/Bioconductor user and developer (since 2009/2010) Other fun things about me: • Co-founded Baltimore • Creating a children’s book featuring women statisticians and data scientists ABOUT ME JOHNS HOPKINS BLOOMBERG SCHOOL OF PUBLIC HEALTH

Slide 3

Slide 3 text

https://jhudatascience.org

Slide 4

Slide 4 text

The “OG”s ROGER BRIAN JEFF Joined in 2018 STEPHANIE Who are we?

Slide 5

Slide 5 text

Why data science? Data science is the number one rated job by Glassdoor and there are more than 350,000 new data science jobs expected by 2020.

Slide 6

Slide 6 text

https://analytics.ncsu.edu/?page_id=4184

Slide 7

Slide 7 text

So…. what is data science?

Slide 8

Slide 8 text

“We hold a broad view of data science – we see it as the science of extracting meaningful information from data.”

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

What if we define data science based on what a data scientist does?

Slide 11

Slide 11 text

“Data science” defined by Michael Hochster

Slide 12

Slide 12 text

“Data science” defined by Michael Hochster

Slide 13

Slide 13 text

Data science is the science and design of 1. Actively creating a question to investigate a hypothesis with data 2. Connecting that question with the collection of appropriate data and the application of appropriate methods, algorithms, computational tools or languages in a data analysis 3. Communicating and making decisions based on new or already established knowledge derived from the data and data analysis Hicks and Peng (2019) arXiv (https://arxiv.org/abs/1903.07639) …taking this one step further (in production or not) (with or without user interaction) (weak or strong coders) (in particular domains or not) (One-way communication or feedback loop)

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Data science is the science and design of 1. Actively creating a question to investigate a hypothesis with data 2. Connecting that question with the collection of appropriate data and the application of appropriate methods, algorithms, computational tools or languages in a data analysis 3. Communicating and making decisions based on new or already established knowledge derived from the data and data analysis Hicks and Peng (2019) arXiv (https://arxiv.org/abs/1903.07639)

Slide 16

Slide 16 text

Rest of the talk 1. Lessons learned when teaching intro data science courses Put the problem first; teach with case studies with non-trivial solutions solving real-world challenges with data 2. Some open challenges in data science education - How to describe variation across data analyses? - How to evaluate quality of data analyses?

Slide 17

Slide 17 text

https://jhu-advdatasci.github.io/2018/ http://cs109.github.io/2014/ http://datasciencelabs.github.io/2016/ (Harvard University – CS – over 400 students online, in person – 25 TAs – Python) (Harvard SPH – Biostats -- 150 students – online, in person – 8 TAs – R) (Johns Hopkins SPH – 25 students – in person – PhD Biostats – 2 TAs – R)

Slide 18

Slide 18 text

•  Teach R software/tools needed for a complete data analysis •  Use git/GitHub for assignments to learn version control •  Teach collaborative practices with group projects •  Final Project: analyze dataset of choice & create website and 2 min screencast summarizing results •  Focus on key statistical concepts (and less math details) •  Minimize “traditional” slides/lectures, note-taking •  Maximize hands-on code in class using Rstudio & RMarkdown •  Use “mini assessments” & Google Polls to get live feedback •  Motivate concepts with real world data problems Transforming the Classroom to Teach Statistics and Data Science with Active Learning What is Active Learning? “Anything course-related that all students in a class session are called upon to do other than simply watching, listening and taking notes.” - Felder & Brent (2009) http://r4ds.had.co.nz/intro.html 1.  Data Science (Harvard CS 109) (http://cs109.github.io/2014/) 2.  Introduction to Data Science (HSPH BIO 260) (http://datasciencelabs.github.io) Using Active Learning to teach courses in data science and statistics Oct 1, 2015 Feb 3, 2015 Goal Develop a curriculum for an applied statistics and data science course using active learning techniques Course websites ggplot2 dplyr +dyr readr stringr lubridate broom h5r rvest jsonlite Stephanie Hicks Dana-Farber Cancer Institute, Harvard T.H. Chan School of Public Health Contact information: @stephaniehicks [email protected] My poster presented at the Women in Statistics and Data Science Conference in Fall 2016

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Data Science in Academia? • Statistics was born directly from developing solutions to practical problems by data analysis problems • Galton, Ronald Fisher • Wild and Pfannkuch (1999) describe applied statistics as: • A department that embraces applied statistics defined above is a natural home for data science in academia “part of the information gathering and learning process which, in an ideal world, is undertaken to inform decisions and actions. With industry, medicine and many other sectors of society increasingly relying on data for decision making, statistics should be an integral part of the emerging information era.”

Slide 21

Slide 21 text

Got it, but what’s missing in current statistics curriculum?

Slide 22

Slide 22 text

What is missing in the current statistics curriculum? Wild and Pfannkuch (1999) complained that: “Large parts of the investigative process, such as problem analysis and measurement, have been largely abandoned by statisticians and statistics educators to the realm of the particular, perhaps to be developed separately within other disciplines.” They add that “[t]he arid, context-free landscape on which so many examples used in statistics teaching are built ensures that large numbers of students never even see, let alone engage in, statistical thinking.”

Slide 23

Slide 23 text

What is missing in the current statistics curriculum? Computing • Need more computing in the curriculum

Slide 24

Slide 24 text

What is missing in the current statistics curriculum? Computing, Connecting • Need more computing in the curriculum • Need to teach how to connect the subject matter question to appropriate dataset and analysis tools

Slide 25

Slide 25 text

What is missing in the current statistics curriculum? Computing, Connecting, Creating • Need more computing in the curriculum • Need to teach how to connect the subject matter question to appropriate dataset and analysis tools • Instead of being passive, teach students to be active and how create and formulate questions to investigate hypotheses with data

Slide 26

Slide 26 text

Bridging the gap in the classroom to teach introductory data science courses

Slide 27

Slide 27 text

Bridging the gap in the classroom to teach introductory data science courses • Educators need to be experienced themselves in creating, connecting and computing • Encourage applied statisticians experienced in creating, connecting, and computing to become involved in the development of courses • Encourage statistics departments to reach out to practicing data analysts, perhaps in other departments or from other disciplines, to collaborate in developing these courses

Slide 28

Slide 28 text

Principles of Teaching Data Science • Organize the course around a set of diverse case studies • Integrate computing into every aspect of the course • Teach abstraction, but minimize reliance on mathematical notation • Structure course activities to realistically mimic a data scientist’s experience • Demonstrate the importance of critical thinking / skepticism through examples

Slide 29

Slide 29 text

So you want to teach with case studies too but not sure where to start?

Slide 30

Slide 30 text

https://opencasestudies.github.io

Slide 31

Slide 31 text

https://jhu-advdatasci.github.io/2018/ http://cs109.github.io/2014/ http://datasciencelabs.github.io/2016/ (Harvard University – CS – over 400 students online, in person – 25 TAs – Python) (Harvard SPH – Biostats -- 150 students – online, in person – 5 TAs – R) (Johns Hopkins SPH – 25 students – in person – PhD Biostats – 2 TAs – R)

Slide 32

Slide 32 text

This is me literally all of fall 2018

Slide 33

Slide 33 text

Me in Sept-Dec 2018: “Students are actually learning how to analyze data with case studies!”

Slide 34

Slide 34 text

Me in Jan 2019: after grading final projects, realizing it’s not sufficient to just teach with case studies… Me: sad and struggling to put into words why I’m frustrated when evaluating data analyses

Slide 35

Slide 35 text

Me in Jan 2019: searching the web and literature for how to evaluate quality of data analyses (or even more simply describe differences or variation across data analyses)

Slide 36

Slide 36 text

Can we define a language to describe differences or variation across data analyses?

Slide 37

Slide 37 text

Data science is the science and design of 1. Actively creating a question to investigate a hypothesis with data 2. Connecting that question with the collection of appropriate data and the application of appropriate methods, algorithms, computational tools or languages in a data analysis 3. Communicating and making decisions based on new or already established knowledge derived from the data and data analysis Hicks and Peng (2019) arXiv (https://arxiv.org/abs/1903.07639)

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

What are the elements of a data analysis?

Slide 40

Slide 40 text

Elements of data analysis

Slide 41

Slide 41 text

Principles of data analysis

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

What can the elements and principles be used for?

Slide 49

Slide 49 text

How to select informative elements?

Slide 50

Slide 50 text

How to evaluate quality of a data analysis? - Success? - Validity? - Honesty?

Slide 51

Slide 51 text

Feel free to send comments/questions: Twitter: @stephaniehicks Email: [email protected] #rladies Thank you! Normal distribution Weibull distribution Poisson distribution