Am I a data scientist? What do I really mean by this question? Am I preparing my students to be able to get job offers with a title of “data scientist?”
Am I a data scientist? What do I really mean by this question? Could I get a job offer with a title of “data scientist?” → sometimes implicitly industry → and sometimes specifically tech
Am I a statistician? points for: ● Am in a grad program called [bio]statistics ● Know things about martingales and the delta method ● Can explain what a p-value is and interpret linear regression coefficients points against: ● Haven’t proved a theorem since 2011 ● Spend more time writing bash scripts than inventing estimators ● No publications in statistics journals
Or am I a data scientist? points for: ● Can program in more than one language ● Actively use git & GitHub ● Have written R packages and reproducible reports ● Once made a web app and also a D3.js graph points against: ● Not working in industry ● Have never written a SQL query more complicated than select * from table ● Understanding of Hadoop, Spark, and AWS is vague at best ● Have never written production code
Idea! I will listen to what experts in our field say! Camp #1: Data science is just a rebranding of applied statistics. Camp #2: Statistics and data science are overlapping. Neither is a subset of the other. Camp #3: Statistics is irrelevant to data science.
Intentionality about programming Spending time thinking primarily about: ● code efficiency ● version control ● code quality (cleanliness, modularity) ● documentation / usability ● unit testing ● systematic debugging ● giving and receiving code review ● and other principles of software engineering
Interest in schleppy- but-practical projects ● figuring out how to get the data you need ● combining existing tools/methods in new ways ● finding the simplest solution that works in practice
Camp #1: Data science is just a rebranding of applied statistics. Camp #2: Statistics and data science are overlapping. Neither is a subset of the other. Camp #3: Statistics is irrelevant to data science.
Perspective from the other side Last month I: ● wrote Ruby, Scala, Coffeescript, and Python ● fought with maven ● backfilled some busted tables in our databases ● investigated the mystery of why some of our cluster boxes are overworked ● learned how to be on call (so I can fix some of Stripe if it breaks at 3am) ● helped teach a SQL class ● and did some statistics