Slide 1

Slide 1 text

Am I a data scientist? Alyssa Frazee, Stripe @acfrazee

Slide 2

Slide 2 text

© 2009 Lisa Slavid

Slide 3

Slide 3 text

statistician a data scientist © 2009 Lisa Slavid

Slide 4

Slide 4 text

Where I’m coming from Math undergrad Biostatistics PhD “Machine Learning Engineer” today Recurse Center (née Hacker School) 2010

Slide 5

Slide 5 text

Am I a data scientist?

Slide 6

Slide 6 text

Am I a data scientist? What do I really mean by this question?

Slide 7

Slide 7 text

Am I a data scientist? What do I really mean by this question? Could I get a job offer with a title of “data scientist?”

Slide 8

Slide 8 text

Am I a data scientist? What do I really mean by this question? Am I preparing my students to be able to get job offers with a title of “data scientist?”

Slide 9

Slide 9 text

Am I a data scientist? What do I really mean by this question? Could I get a job offer with a title of “data scientist?” → sometimes implicitly industry → and sometimes specifically tech

Slide 10

Slide 10 text

What’s “data science”?

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

data skills spectrum

Slide 13

Slide 13 text

theoretical statistics software engineering

Slide 14

Slide 14 text

theoretical statistics software engineering data science

Slide 15

Slide 15 text

understanding quantitative data building a product data science

Slide 16

Slide 16 text

output: numerical results output: usable software data science

Slide 17

Slide 17 text

Am I a statistician? points for: ● Am in a grad program called [bio]statistics ● Know things about martingales and the delta method ● Can explain what a p-value is and interpret linear regression coefficients points against: ● Haven’t proved a theorem since 2011 ● Spend more time writing bash scripts than inventing estimators ● No publications in statistics journals

Slide 18

Slide 18 text

Or am I a data scientist? points for: ● Can program in more than one language ● Actively use git & GitHub ● Have written R packages and reproducible reports ● Once made a web app and also a D3.js graph points against: ● Not working in industry ● Have never written a SQL query more complicated than select * from table ● Understanding of Hadoop, Spark, and AWS is vague at best ● Have never written production code

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Idea! I will listen to what experts in our field say! Camp #1: Data science is just a rebranding of applied statistics. Camp #2: Statistics and data science are overlapping. Neither is a subset of the other. Camp #3: Statistics is irrelevant to data science.

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

First: do I want to be a data scientist?

Slide 23

Slide 23 text

Second: Does it matter?

Slide 24

Slide 24 text

Am I on the job market? Am I hiring?

Slide 25

Slide 25 text

If you decide it matters: some distinguishing features

Slide 26

Slide 26 text

Intentionality about programming

Slide 27

Slide 27 text

Intentionality about programming Spending time thinking primarily about: ● code efficiency ● version control ● code quality (cleanliness, modularity) ● documentation / usability ● unit testing ● systematic debugging ● giving and receiving code review ● and other principles of software engineering

Slide 28

Slide 28 text

Interest in schleppy- but-practical projects

Slide 29

Slide 29 text

Interest in schleppy- but-practical projects ● figuring out how to get the data you need ● combining existing tools/methods in new ways ● finding the simplest solution that works in practice

Slide 30

Slide 30 text

Focus on concrete decision-making

Slide 31

Slide 31 text

Focus on concrete decision-making less about inference and parameter estimation, more about what action should be taken

Slide 32

Slide 32 text

Camp #1: Data science is just a rebranding of applied statistics. Camp #2: Statistics and data science are overlapping. Neither is a subset of the other. Camp #3: Statistics is irrelevant to data science.

Slide 33

Slide 33 text

Perspective from the other side Camp #1: Data science is just a rebranding of applied statistics.

Slide 34

Slide 34 text

Perspective from the other side Camp #1: Data science is just a rebranding of applied statistics. Intentionality about programming

Slide 35

Slide 35 text

Perspective from the other side Camp #1: Data science is just a rebranding of applied statistics. The day-to-day work is different!

Slide 36

Slide 36 text

Perspective from the other side Last month I: ● wrote Ruby, Scala, Coffeescript, and Python ● fought with maven ● backfilled some busted tables in our databases ● investigated the mystery of why some of our cluster boxes are overworked ● learned how to be on call (so I can fix some of Stripe if it breaks at 3am) ● helped teach a SQL class ● and did some statistics

Slide 37

Slide 37 text

Camp #3: Statistics is irrelevant to data science. Perspective from the other side

Slide 38

Slide 38 text

Camp #3: Statistics is irrelevant to data science. Perspective from the other side

Slide 39

Slide 39 text

Statistics and data science are overlapping. Neither is a subset of the other. Perspective from the other side

Slide 40

Slide 40 text

About that identity crisis: Program intentionally and be a data scientist, if you want!

Slide 41

Slide 41 text

About that identity crisis: Or don’t! Statistics is hugely important and relevant in its own right!

Slide 42

Slide 42 text

● http://andrewgelman.com/2013/11/14/statistics-least- important-part-data-science/ ● http://bulletin.imstat.org/2014/09/data-science-how-is-it- different-to-statistics%E2%80%89/ ● https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the- 21st-century/ ● http://datascopeanalytics.com/blog/what-is-a-data-scientist/ Further reading: