Slide 1

Slide 1 text

1 WHAT IS DATA SCIENCE? Jeff Goldsmith, PhD Department of Biostatistics

Slide 2

Slide 2 text

2 Data science is pretty new

Slide 3

Slide 3 text

2 Data science is pretty new

Slide 4

Slide 4 text

3 • Data science = statistics • Data science = computer science • Data science = machine learning • Data science = statistics + computer science + machine learning • Data scientists are big data wranglers • “A data scientist is just a sexier word for statistician.” –Nate Silver • “A data scientist is a better computer scientist than a statistician and is a better statistician than a computer scientist.” • “A data scientist is a statistician who is useful” – Hadley Wickham • A data scientist is a good statistical analyst • A data scientist is a statistician who codes in python Some not great definitions

Slide 5

Slide 5 text

3 • Data science = statistics • Data science = computer science • Data science = machine learning • Data science = statistics + computer science + machine learning • Data scientists are big data wranglers • “A data scientist is just a sexier word for statistician.” –Nate Silver • “A data scientist is a better computer scientist than a statistician and is a better statistician than a computer scientist.” • “A data scientist is a statistician who is useful” – Hadley Wickham • A data scientist is a good statistical analyst • A data scientist is a statistician who codes in python Some not great definitions

Slide 6

Slide 6 text

4 Maybe pictures will help? Image from Drew Conway

Slide 7

Slide 7 text

5 Maybe pictures will help? https://blog.zhaw.ch/datascience/the-data-science-skill-set/

Slide 8

Slide 8 text

6 • “Data science is just …” definitions miss the point – If data science is just statistics (or machine learning, or computer science, or engineering) we wouldn’t need a new term, let alone a new discipline – The popularity of “data science” suggests that there’s a newly recognized need • “A data scientist is a good ” whatever definitions aren’t helpful – They’re almost deliberately judgmental – A good definition doesn’t depend on opinions – There are “data scientists” in each discipline, but some very good statisticians / computer scientists / etc aren’t “data scientists” Why these definitions are bad

Slide 9

Slide 9 text

7 • “Data science is the combination of these 40 skills …” are unrealistic Why these definitions are bad https://www.youtube.com/watch?v=b9ZLXwAuUyw&app=desktop

Slide 10

Slide 10 text

8 • Kinda like the blind men and the elephant – no one perspective is completely right or completely wrong, but piling them all up isn’t right either • They give a sense of what is valued by the data science community – using data in a principled way and coding well Why these definitions are good

Slide 11

Slide 11 text

9 • Data science is interdisciplinary – You do need a breadth of skills – You also need a particular mindset – curiosity and engagement is critical – You need some domain knowledge to be successful Why these definitions are good https://www.xkcd.com/1831/

Slide 12

Slide 12 text

10 • We’ll focus mostly on process; how to formulate and answer questions through analyses are the focus of other courses • This is also a “bad” definition, in that it doesn’t explain where data science came from For the purpose of this class: Data science is the study of formulating and rigorously answering questions using a data-centric process that emphasizes clarity, reproducibility, effective communication, and ethical practices.

Slide 13

Slide 13 text

11 ISI 2017

Slide 14

Slide 14 text

11 ISI 2017

Slide 15

Slide 15 text

12 “What is the point of ‘data science’? Aren’t we already data scientists?” First question from the audience

Slide 16

Slide 16 text

12 “What is the point of ‘data science’? Aren’t we already data scientists?” First question from the audience 🤣 👏 🎉 🤣 👏 😀 🎊 👍 🎉 👍 🤣 😁 🎉 😀 👏 👏 😁 🎊 👏 👍 😀 🎉 😀 👏 🎉 🎊 😁 🎉 😁 🤣 🎊 🤣 🤣 👏 🎉 🤣 👏 😀 🎊 👍 🎉 👍 🤣 😁 🎉 😀 👏 👏 😁 🎊 👏 👍 😀 🎉 😀 👏 🎉 🎊 😁 🎉 😁 🤣 🎊 🤣 🤣 👏 🎉 🤣 👏 😀 🎊 👍 🎉 👍 🤣 😁 🎉 😀 👏 👏 😁 🎊 👏 👍 😀 🎉 😀 👏 🎉 🎊 😁 🎉 😁 🤣 🎊 🤣 🤣 👏 🎉 🤣 👏 😀 🎊 👍 🎉 👍 🤣 😁 🎉 😀 👏 👏 😁 🎊 👏 👍 😀 🎉 😀 👏 🎉 🎊 😁 🎉 😁 🤣 🎊 🤣 🤣 👏 🎉 🤣 👏 😀 🎊 👍 🎉 👍 🤣 😁 🎉 😀 👏 👏 😁 🎊 👏 👍 😀 🎉 😀 👏 🎉 🎊 😁 🎉 😁 🤣 🎊 🤣 🤣 👏 🎉 🤣 👏 😀 🎊 👍 🎉 👍 🤣 😁 🎉 😀 👏 👏 😁 🎊 👏 👍 🤦 🙁 😡 👎 🙄 🙁 🤦 😑 🙄 👎 😡 🙁 🤦 🙁 👎 😡 🤦 🙄 😑 🤬 👎 🙄 😑 🤦 🥱 🤦 😑 😡 🙄 🤦 🙁 😡 👎 🙄 🙁 🤦 😑 🙄 👎 😡 🙁 🤦 🙁 👎 😡 🤦 🙄 😑 🤬 👎 🙄 😑 🤦 🥱 😑 🙄 🤦 🙁 😡 👎 🙄 🙁 🤦 😑 🙄 👎 😡 🙁 🤦 🙁 👎 😡 🤦 🤦 🙁 😡 👎 🙄 🙁 🤦 😑 🙄 👎 😡 🙁 🤦 🙁 👎 😡 🤦 🙄 😑 🤬 👎 🙄 😑 🙄 😑 🤬 👎 🙄 😑 🤦 🥱 🤦 😑 😡 🙄 🤦 🙁 😡 👎 🙄 🙁 🤦 😑 🙄 👎 😡 🙁 🤦 🙁 👎 😡 🤦 🙄 😑 🤬 👎 🙄 😑 😑 🤦 🥱 🤦 😑 😡 🙄 🤦 🙁 😡 👎 🙄 🙁 🤦 😑 🙄 👎 😡 🙁 🤦 🙁 👎 😡 🤦 🙄 😑 🤬 👎 🙄 😑 🤦 🥱 🤦 😑 😡 😑 🤦 🥱 🙁 🤦 😑 🙄 👎 😡 🙁 🤦 🙁 👎 😡

Slide 17

Slide 17 text

13 “A data scientist is a statistician who’s useful” Response from Hadley Wickham (roughly)

Slide 18

Slide 18 text

13 “A data scientist is a statistician who’s useful” Response from Hadley Wickham (roughly) 🤣 👏 🎉 🤣 👏 😀 🎊 👍 🎉 👍 🤣 😁 🎉 😀 👏 👏 😁 🎊 👏 👍 😀 🎉 😀 👏 🎉 🎊 😁 🎉 😁 🤣 🎊 🤣 🤣 👏 🎉 🤣 👏 😀 🎊 👍 🎉 👍 🤣 😁 🎉 😀 👏 👏 😁 🎊 👏 👍 😀 🎉 😀 👏 🎉 🎊 😁 🎉 😁 🤣 🎊 🤣 🤣 👏 🎉 🤣 👏 😀 🎊 👍 🎉 👍 🤣 😁 🎉 😀 👏 👏 😁 🎊 👏 👍 😀 🎉 😀 👏 🎉 🎊 😁 🎉 😁 🤣 🎊 🤣 🤣 👏 🎉 🤣 👏 😀 🎊 👍 🎉 👍 🤣 😁 🎉 😀 👏 👏 😁 🎊 👏 👍 😀 🎉 😀 👏 🎉 🎊 😁 🎉 😁 🤣 🎊 🤣 🤣 👏 🎉 🤣 👏 😀 🎊 👍 🎉 👍 🤣 😁 🎉 😀 👏 👏 😁 🎊 👏 👍 😀 🎉 😀 👏 🎉 🎊 😁 🎉 😁 🤣 🎊 🤣 🤣 👏 🎉 🤣 👏 😀 🎊 👍 🎉 👍 🤣 😁 🎉 😀 👏 👏 😁 🎊 👏 👍 😀 🎉 😀 👏 🎉 🎊 😁 🎉 😁 🤣 🎊 🤣 🤣 👏 🎉 🤦 🙁 😡 👎 🙄 🙁 🤦 😑 🙄 👎 😡 🙁 🤦 🙁 👎 😡 🤦 🙄 😑 🤬 👎 🙄 😑 🤦 🥱 🤦 😑 😡 🙄 🤦 🙁 😡 👎 🙄 🙁 🤦 😑 🙄 👎 😡 🙁 🤦 🙁 👎 😡 🤦 🙄 😑 🤬 👎 🙄 😑 🤦 🥱 😑 🙄 🤦 🙁 😡 👎 🙄 🙁 🤦 😑 🙄 👎 😡 🙁 🤦 🙁 👎 😡 🤦 🤦 🙁 😡 👎 🙄 🙁 🤦 😑 🙄 👎 😡 🙁 🤦 🙁 👎 😡 🤦 🙄 😑 🤬 👎 🙄 😑 🙄 😑 🤬 👎 🙄 😑 🤦 🥱 🤦 😑 😡 🙄 🤦 🙁 😡 👎 🙄 🙁 🤦 😑 🙄 👎 😡 🙁 🤦 🙁 👎 😡 🤦 🙄 😑 🤬 👎 🙄 😑 😑 🤦 🥱 🤦 😑 😡 🙄 🤦 🙁 😡 👎 🙄 🙁 🤦 😑 🙄 👎 😡 🙁 🤦 🙁 👎 😡 🤦 🙄 😑 🤬 👎 🙄 😑 🤦 🥱 🤦 😑 😡 😑 🤦 🥱 🙁 🤦 😑 🙄 👎 😡 🙁 🤦 🙁 👎 😡 🤦 🙄 😑 🤬 👎 🙄 😑 🤦 🥱 🤦 😑 😡 🙄 😑 🤬

Slide 19

Slide 19 text

14 • It’s easy, in 2024, to forget what the statistical identity crisis phase was like • But that was a whole thing, for quite a while That question is understandable

Slide 20

Slide 20 text

14 • It’s easy, in 2024, to forget what the statistical identity crisis phase was like • But that was a whole thing, for quite a while That question is understandable

Slide 21

Slide 21 text

15 • Data science emerged in parallel to (at least) six broad trends: – Big data – Emphasis on prediction – Reproducibility crisis in science – Interdisciplinary research – Diversity, equity, and inclusion – Everything should be on the internet • These weren’t new in 2012 and aren’t unique to data science • … but they had a big impact on the “data science” perspective What made “data science” happen

Slide 22

Slide 22 text

16 • Core data science values aren’t built into the definition, but were critical to the valence of “data science” Connotation >> definition

Slide 23

Slide 23 text

17 Public Health Data Science [Public health] data science is the study of formulating and rigorously answering questions [in order to advance health and well-being] using a data-centric process that emphasizes clarity, reproducibility, effective communication, and ethical practices.

Slide 24

Slide 24 text

18 • Public health training emphasizes some elements that are critical data science thinking and work: – Study design – Sampling process – Measurement process – Desire vs ability to infer causation – Cross-disciplinary collaboration – Engagement with data ethics – Public dissemination and dialog “Public Health” is the important part

Slide 25

Slide 25 text

18 • Public health training emphasizes some elements that are critical data science thinking and work: – Study design – Sampling process – Measurement process – Desire vs ability to infer causation – Cross-disciplinary collaboration – Engagement with data ethics – Public dissemination and dialog “Public Health” is the important part From “Total Survey Error: Past, Present, and Future” (Groves and Lyberg) via “Data Alone Isn’t Ground Truth” by Angela Bassa

Slide 26

Slide 26 text

19 • Build a broad knowledge base • Don’t be embarrassed by what you don’t know – Corollary: don’t be a jerk to people who don’t know what you know • Ask questions (well) and keep learning • Pretty much the same as learning anything, but hard because people don’t like to show their code How to learn data science

Slide 27

Slide 27 text

19 • Build a broad knowledge base • Don’t be embarrassed by what you don’t know – Corollary: don’t be a jerk to people who don’t know what you know • Ask questions (well) and keep learning • Pretty much the same as learning anything, but hard because people don’t like to show their code How to learn data science

Slide 28

Slide 28 text

20 • All questions are good questions, but sometimes good questions aren’t asked well • Think through what you’re trying to ask • If your code is broken, create a simple example that illustrates what’s broken How to learn data science

Slide 29

Slide 29 text

21 • Build up you “known knowns” • Recognize your “known unknowns” • Avoid “unknown unknows” How to learn data science

Slide 30

Slide 30 text

22 Real talk about AI (as part of data science)

Slide 31

Slide 31 text

23 What about LLMs? https://twitter.com/dsmerdon/status/1618816703923912704

Slide 32

Slide 32 text

23 What about LLMs? https://twitter.com/dsmerdon/status/1618816703923912704

Slide 33

Slide 33 text

24 Large Language Models https://www.nature.com/articles/d41586-023-00107-z

Slide 34

Slide 34 text

24 Large Language Models https://www.nature.com/articles/d41586-023-00107-z From Kathy McKeown, Vishal Misra, Zhou Yu

Slide 35

Slide 35 text

25 • It learns, in ways that are difficult to interrogate, from input data • Even with curation, this can go badly ChatGPT is “AI” https://twitter.com/spiantado/status/1599462375887114240

Slide 36

Slide 36 text

26 ChatGPT can program for you

Slide 37

Slide 37 text

27 ChatGPT can program for you … sort of

Slide 38

Slide 38 text

27 ChatGPT can program for you … sort of

Slide 39

Slide 39 text

27 ChatGPT can program for you … sort of

Slide 40

Slide 40 text

28 Where is the value in being a “programmer” now?

Slide 41

Slide 41 text

28 Ben has some professional coding experience of his own, but it was brief, shallow, and now about twenty years out of date. GPT-4 on its own is, for the moment, a worse programmer than I am. Ben is much worse. But Ben plus GPT-4 is a dangerous thing. Where is the value in being a “programmer” now?

Slide 42

Slide 42 text

28 Ben has some professional coding experience of his own, but it was brief, shallow, and now about twenty years out of date. GPT-4 on its own is, for the moment, a worse programmer than I am. Ben is much worse. But Ben plus GPT-4 is a dangerous thing. I still feel secure in my profession. In fact, I feel somewhat more secure than before. ... The thing I’m relatively good at is knowing what’s worth building, what users like, how to communicate both technically and humanely. I suspect that ... we will think of “the programmer” the way we now look back on “the computer,” when that phrase referred to a person who did calculations by hand. Where is the value in being a “programmer” now?

Slide 43

Slide 43 text

29 A data science analogy 1910s

Slide 44

Slide 44 text

29 A data science analogy 1910s 1969 / 1970

Slide 45

Slide 45 text

30 Reproducibility • One concrete emphasis of data science is reproducibility • Given the same data and the same code, anyone should be able to produce the same results – Code is an important means of communication – New tools encourage reproducibility, but the concept is not platform- dependent

Slide 46

Slide 46 text

31 Sharing code • Openness is valuable – identify errors early and fix them quickly • Try to think of sharing code as a gesture of confidence and humility – You’ve done your best, and you should feel good about that – Everyone makes mistakes sometimes; when you do, that’s fine – fix it and move on • Lack of transparency can reflect a lot of things • Of these, arrogance is the most dangerous

Slide 47

Slide 47 text

32 Choosing data science tools

Slide 48

Slide 48 text

32 Choosing data science tools

Slide 49

Slide 49 text

33 Time to code!!