Slide 1

Slide 1 text

Alex K Gold, RStudio Workweek 2019 Why Data Scientists Stink at Software Engineering

Slide 2

Slide 2 text

Hello RStudio! ● Background in Math and Econ ○ Proud econ PhD dropout ● Think tank work ○ Economic policy ○ Microsimulation modeling ● “Data Science” ○ Ran voter outreach experiments for progressive causes and candidates ○ Ran small data science consulting team

Slide 3

Slide 3 text

Am I talking to you?

Slide 4

Slide 4 text

Data Scientists Stink at: ● Version control. ● Testing (esp test-driven development). They should do these things... But they often don’t...

Slide 5

Slide 5 text

Version Control

Slide 6

Slide 6 text

The difference... Client Report Data Scientist Software Engineer Delivery: Merge to master

Slide 7

Slide 7 text

The Result... Please use version control. Nah. Alex Team

Slide 8

Slide 8 text

Testing

Slide 9

Slide 9 text

All of Data Science in 1 Slide* Data call firm_size it_sophistication Prediction Pr(call) ~ f(firm_size, it_sophistication) Inference call = ꞵ0 + ꞵ1 * firm_size + ꞵ2 * it_sophistication Clustering Group A Group B .... *Well, mostly. call firm_size it_sophistication call firm_size it_sophistication

Slide 10

Slide 10 text

Let’s Do Test-Driven Development... Don’t know answer... ● If I knew, I wouldn’t need data. Unit tests? ● Just algorithmic validity... Model metrics? R2 AUC F1 Gini RMSE Pr(call) or ꞵ0, ꞵ1, ꞵ2 or Group A, B QA and testing should be done… ● Can’t really specify ahead ●

Slide 11

Slide 11 text

Bottom Line

Slide 12

Slide 12 text

The difference... Software Engineers Data Scientist Deliverable Working Code Code? Paper Slide deck Dashboard Model Best Practices Version Control Automated Testing/TDD ????? ?????* Thank you! ➔ Empathy? ➔ It’s not hopeless ➔ Talk to me about “agile data science”