Am I a data scientist?

Am I a data scientist?

Slides from my talk at JSM 2015, in the session: "The Statistics Identity Crisis: Are We Really Data Scientists?" https://www.amstat.org/meetings/jsm/2015/onlineprogram/ActivityDetails.cfm?SessionID=211266

567d15666cd2891a4e6c49e007f30a08?s=128

Alyssa Frazee

August 11, 2015
Tweet

Transcript

  1. 4.

    Where I’m coming from Math undergrad Biostatistics PhD “Machine Learning

    Engineer” today Recurse Center (née Hacker School) 2010
  2. 7.

    Am I a data scientist? What do I really mean

    by this question? Could I get a job offer with a title of “data scientist?”
  3. 8.

    Am I a data scientist? What do I really mean

    by this question? Am I preparing my students to be able to get job offers with a title of “data scientist?”
  4. 9.

    Am I a data scientist? What do I really mean

    by this question? Could I get a job offer with a title of “data scientist?” → sometimes implicitly industry → and sometimes specifically tech
  5. 11.
  6. 17.

    Am I a statistician? points for: • Am in a

    grad program called [bio]statistics • Know things about martingales and the delta method • Can explain what a p-value is and interpret linear regression coefficients points against: • Haven’t proved a theorem since 2011 • Spend more time writing bash scripts than inventing estimators • No publications in statistics journals
  7. 18.

    Or am I a data scientist? points for: • Can

    program in more than one language • Actively use git & GitHub • Have written R packages and reproducible reports • Once made a web app and also a D3.js graph points against: • Not working in industry • Have never written a SQL query more complicated than select * from table • Understanding of Hadoop, Spark, and AWS is vague at best • Have never written production code
  8. 19.
  9. 20.

    Idea! I will listen to what experts in our field

    say! Camp #1: Data science is just a rebranding of applied statistics. Camp #2: Statistics and data science are overlapping. Neither is a subset of the other. Camp #3: Statistics is irrelevant to data science.
  10. 21.
  11. 27.

    Intentionality about programming Spending time thinking primarily about: • code

    efficiency • version control • code quality (cleanliness, modularity) • documentation / usability • unit testing • systematic debugging • giving and receiving code review • and other principles of software engineering
  12. 29.

    Interest in schleppy- but-practical projects • figuring out how to

    get the data you need • combining existing tools/methods in new ways • finding the simplest solution that works in practice
  13. 32.

    Camp #1: Data science is just a rebranding of applied

    statistics. Camp #2: Statistics and data science are overlapping. Neither is a subset of the other. Camp #3: Statistics is irrelevant to data science.
  14. 33.

    Perspective from the other side Camp #1: Data science is

    just a rebranding of applied statistics.
  15. 34.

    Perspective from the other side Camp #1: Data science is

    just a rebranding of applied statistics. Intentionality about programming
  16. 35.

    Perspective from the other side Camp #1: Data science is

    just a rebranding of applied statistics. The day-to-day work is different!
  17. 36.

    Perspective from the other side Last month I: • wrote

    Ruby, Scala, Coffeescript, and Python • fought with maven • backfilled some busted tables in our databases • investigated the mystery of why some of our cluster boxes are overworked • learned how to be on call (so I can fix some of Stripe if it breaks at 3am) • helped teach a SQL class • and did some statistics
  18. 39.

    Statistics and data science are overlapping. Neither is a subset

    of the other. Perspective from the other side