Mine Cetinkaya-Rundel
November 02, 2020
140

# The art and science of teaching data science

Abstract: Modern statistics is fundamentally a computational discipline, but too often this fact is not reflected in our statistics curricula. With the rise of data science it has become increasingly clear that students want, expect, and need explicit training in this area of the discipline. Additionally, recent curricular guidelines clearly state that working with data requires extensive computing skills and that statistics students should be fluent in accessing, manipulating, analyzing, and modeling with professional statistical analysis software. In this talk, we introduce the design philosophy behind an introductory data science course, discuss in progress and future research on student learning as well as new directions in assessment and tooling as we scale up the course.

## Mine Cetinkaya-Rundel

November 02, 2020

## Transcript

1. Image credit: Thomas Pedersen, data-imaginist.com/art
the art and science
of teaching data science
mine çetinkaya-rundel
bit.ly/ds-art-sci-ares
mine-cetinkaya-rundel
[email protected]
@minebocek

2. 2016 GAISE
1. Teach statistical thinking.
‣ Teach statistics as an investigative process of problem-solving and decision making.
Students should not leave their introductory statistics course with the mistaken impression
that statistics consists of an unrelated collection of formulas and methods. Rather,
students should understand that statistics is a problem-solving and decision making
process that is fundamental to scientiﬁc inquiry and essential for making sound decisions.
‣ Give students experience with multivariable thinking. We live in a complex world in
which the answer to a question often depends on many factors. Students will encounter
such situations within their own ﬁelds of study and everyday lives. We must prepare our
students to answer challenging questions that require them to investigate and explore
relationships among many variables. Doing so will help them to appreciate the value of
statistical thinking and methods.
2. Focus on conceptual understanding.
3. Integrate real data with a context and purpose.
4. Foster active learning.
5. Use technology to explore concepts and analyse data.
6. Use assessments to improve and evaluate student learning.
amstat.org/asa/ﬁles/pdfs/GAISE/GaiseCollege_Full.pdf

3. 2016 GAISE
1. Teach statistical thinking.
‣ Teach statistics as an investigative process of problem-solving and decision making.
Students should not leave their introductory statistics course with the mistaken impression
that statistics consists of an unrelated collection of formulas and methods. Rather,
students should understand that statistics is a problem-solving and decision making
process that is fundamental to scientiﬁc inquiry and essential for making sound decisions.
‣ Give students experience with multivariable thinking. We live in a complex world in
which the answer to a question often depends on many factors. Students will encounter
such situations within their own ﬁelds of study and everyday lives. We must prepare our
students to answer challenging questions that require them to investigate and explore
relationships among many variables. Doing so will help them to appreciate the value of
statistical thinking and methods.
2. Focus on conceptual understanding.
3. Integrate real data with a context and purpose.
4. Foster active learning.
5. Use technology to explore concepts and analyse data.
6. Use assessments to improve and evaluate student learning.
amstat.org/asa/ﬁles/pdfs/GAISE/GaiseCollege_Full.pdf
1 NOT a
commonly used
subset of tests
and intervals
and produce
them with hand
calculations

4. 2016 GAISE
1. Teach statistical thinking.
‣ Teach statistics as an investigative process of problem-solving and decision making.
Students should not leave their introductory statistics course with the mistaken impression
that statistics consists of an unrelated collection of formulas and methods. Rather,
students should understand that statistics is a problem-solving and decision making
process that is fundamental to scientiﬁc inquiry and essential for making sound decisions.
‣ Give students experience with multivariable thinking. We live in a complex world in
which the answer to a question often depends on many factors. Students will encounter
such situations within their own ﬁelds of study and everyday lives. We must prepare our
students to answer challenging questions that require them to investigate and explore
relationships among many variables. Doing so will help them to appreciate the value of
statistical thinking and methods.
2. Focus on conceptual understanding.
3. Integrate real data with a context and purpose.
4. Foster active learning.
5. Use technology to explore concepts and analyse data.
6. Use assessments to improve and evaluate student learning.
amstat.org/asa/ﬁles/pdfs/GAISE/GaiseCollege_Full.pdf
2 Multivariate
analysis
requires the use
of computing

5. 2016 GAISE
1. Teach statistical thinking.
‣ Teach statistics as an investigative process of problem-solving and decision making.
Students should not leave their introductory statistics course with the mistaken impression
that statistics consists of an unrelated collection of formulas and methods. Rather,
students should understand that statistics is a problem-solving and decision making
process that is fundamental to scientiﬁc inquiry and essential for making sound decisions.
‣ Give students experience with multivariable thinking. We live in a complex world in
which the answer to a question often depends on many factors. Students will encounter
such situations within their own ﬁelds of study and everyday lives. We must prepare our
students to answer challenging questions that require them to investigate and explore
relationships among many variables. Doing so will help them to appreciate the value of
statistical thinking and methods.
2. Focus on conceptual understanding.
3. Integrate real data with a context and purpose.
4. Foster active learning.
5. Use technology to explore concepts and analyse data.
6. Use assessments to improve and evaluate student learning.
amstat.org/asa/ﬁles/pdfs/GAISE/GaiseCollege_Full.pdf
3 NOT use
technology that
is only
applicable in the
intro course or
that doesn’t
science
principles

6. 2016 GAISE
1. Teach statistical thinking.
‣ Teach statistics as an investigative process of problem-solving and decision making.
Students should not leave their introductory statistics course with the mistaken impression
that statistics consists of an unrelated collection of formulas and methods. Rather,
students should understand that statistics is a problem-solving and decision making
process that is fundamental to scientiﬁc inquiry and essential for making sound decisions.
‣ Give students experience with multivariable thinking. We live in a complex world in
which the answer to a question often depends on many factors. Students will encounter
such situations within their own ﬁelds of study and everyday lives. We must prepare our
students to answer challenging questions that require them to investigate and explore
relationships among many variables. Doing so will help them to appreciate the value of
statistical thinking and methods.
2. Focus on conceptual understanding.
3. Integrate real data with a context and purpose.
4. Foster active learning.
5. Use technology to explore concepts and analyse data.
6. Use assessments to improve and evaluate student learning.
amstat.org/asa/ﬁles/pdfs/GAISE/GaiseCollege_Full.pdf
4 Data analysis
isn’t just
inference and
modelling, it’s
also data
importing,
cleaning,
preparation,
exploration, and
visualisation

7. a course that satisﬁes these four
points is looking more like today’s
intro data science courses than
(most) intro stats courses
but this is not because
intro stats is inherently
instead it is because it’s time to visit
intro stats in light of emergence of
data science

8. fundamentals of
data & data viz,
confounding variables,
+
R / RStudio,
R Markdown, simple Git
tidy data, data frames
vs. summary tables,
recoding &
transforming,
web scraping & iteration
+
collaboration on GitHub

9. fundamentals of
data & data viz,
confounding variables,
+
R / RStudio,
R Markdown, simple Git
tidy data, data frames
vs. summary tables,
recoding &
transforming,
web scraping & iteration
+
collaboration on GitHub
building & selecting
models,
visualising interactions,
prediction & validation,
inference via simulation

10. fundamentals of
data & data viz,
confounding variables,
+
R / RStudio,
R Markdown, simple Git
tidy data, data frames
vs. summary tables,
recoding &
transforming,
web scraping & iteration
+
collaboration on GitHub
building & selecting
models,
visualising interactions,
prediction & validation,
inference via simulation
data science ethics,
text analysis,
Bayesian inference
+
communication &
dissemination

11. fundamentals of
data & data viz,
confounding variables,
+
R / RStudio,
R Markdown, simple Git
tidy data, data frames
vs. summary tables,
recoding &
transforming,
web scraping & iteration
+
collaboration on GitHub
building & selecting
models,
visualising interactions,
prediction & validation,
inference via simulation
data science ethics,
text analysis,
Bayesian inference
+
communication &
dissemination

12. ‣ Go to RStudio Cloud
‣ Start the project titled UN Votes

13. ‣ Go to RStudio Cloud
‣ Start the project titled UN Votes
‣ Open the R Markdown document called unvotes.Rmd

14. ‣ Go to RStudio Cloud
‣ Start the project titled UN Votes
‣ Open the R Markdown document called unvotes.Rmd
‣ Knit the document and review the data visualisation you just produced

15. ‣ Go to RStudio Cloud
‣ Start the project titled UN Votes
‣ Open the R Markdown document called unvotes.Rmd
‣ Knit the document and review the data visualisation you just produced
‣ Then, look for the character string “Turkey” in the code and replace it with
‣ Knit again, and review how the voting patterns of the country you picked
compares to the United States and United Kingdom & Northern Ireland

16. three questions that keep me up at night…
1 what should students learn?
2 how will students learn best?
3 what tools will enhance student learning?

17. three questions that keep me up at night…
1 what should students learn?
2 how will students learn best?
3 what tools will enhance student learning?
content
pedagogy
infrastructure

18. content

19. ex. 1
ﬁsheries of the world

20. ✴ data joins

21. ✴ data joins
✴ data science ethics

22. ✴ data joins
✴ data science ethics
✴ critique
✴ improving data
visualisations

23. ✴ data joins
✴ data science ethics
✴ critique
✴ improving data
visualisations
✴ mapping

24. Project: 2016 US Election Redux
Question: Would the outcome of the 2016 US Presidential Elections been
diﬀerent had Bernie Sanders been the Democrat candidate?
Team: 4 Squared

25. ex. 2
First Minister’s COVID brieﬁngs

26. ✴ web scraping
✴ text parsing
✴ data types
✴ regular expressions

27. ✴ web scraping
✴ text parsing
✴ data types
✴ regular expressions
✴ functions
✴ iteration

28. ✴ web scraping
✴ text parsing
✴ data types
✴ regular expressions
✴ functions
✴ iteration
✴ data visualisation
✴ interpretation

29. ✴ web scraping
✴ text parsing
✴ data types
✴ regular expressions
✴ functions
✴ iteration
✴ data visualisation
✴ interpretation
✴ text analysis

30. ✴ web scraping
✴ text parsing
✴ data types
✴ regular expressions
✴ functions
✴ iteration
✴ data visualisation
✴ interpretation
✴ text analysis
✴ data science ethics
robotstxt::paths_allowed("https://www.gov.scot")
#> www.gov.scot
#> [1] TRUE

31. Project: The North South Divide: University Edition
Question: Does the geographical location of a UK university aﬀect its
university score?
Team: Fried Egg Jelly Fish

32. ex. 3
spam ﬁlters

33. ✴ logistic regression
✴ prediction

34. ✴ logistic regression
✴ prediction
✴ decision errors
✴ sensitivity /
speciﬁcity
✴ intuition around
loss functions

35. Project: Spotify Top 100 Tracks of 2017/18
Question: Is it possible to predict the year a song made the Top Tracks
Team: weR20
year ~ danceability + energy + key + loudness + mode + speechiness +
acousticness + instrumentalness + liveness + valence + tempo +
duration_s
2017
name artists
I'm the One DJ Khaled
Redbone Childish Gambino
Sign of the Times Harry Styles
2018
name artists
Everybody Dies In Their Nightmares XXXTENTACION
Jocelyn Flores XXXTENTACION
Plug Walk Rich The Kid
Moonlight XXXTENTACION
Nevermind Dennis Lloyd
In My Mind Dynoro
changes XXXTENTACION

36. pedagogy

37. teams: weekly labs in teams +
periodic team evaluations +
term project in teams
peer feedback: used
minimally so far, but
positive experience
“minute paper”: weekly online
quizzes ending with a brief
reﬂection of the week’s material

38. # A tibble: 19 x 2
bigram n

1 question 7 19
2 question 8 16
3 questions 7 12
4 join function 9
5 question 2 9
6 choice questions 7
7 first question 7
8 multiple choice 7
10 necessarily improve 6
11 join functions 5
12 question 1 5
13 7 8 4
14 airline names 4
15 data frames 4
16 feel like 4
17 many options 4
19 x axis 4

39. teams: weekly labs in teams +
periodic team evaluations +
term project in teams
peer feedback: used
minimally so far, but
positive experience
“minute paper”: weekly online
quizzes ending with a brief
reﬂection of the week’s material
creativity: assignments that
make room for creativity

40. infrastructure & tooling

41. student-facing
+

ghclass
+
instructor-facing

checklist
+
+

learnr
+

parsermd

learnrhash

42. ghclass
+ +

43. ghclass
+

44. openness

45. on

46. Image credit:
Thomas Pedersen, data-imaginist.com/art
the art and science
of teaching data science
mine çetinkaya-rundel
mine-cetinkaya-rundel
[email protected]
@minebocek
bit.ly/ds-art-sci-ares