Pro Yearly is on sale from $80 to $50! »

Am I a data scientist?

Am I a data scientist?

Slides from my talk at JSM 2015, in the session: "The Statistics Identity Crisis: Are We Really Data Scientists?" https://www.amstat.org/meetings/jsm/2015/onlineprogram/ActivityDetails.cfm?SessionID=211266

567d15666cd2891a4e6c49e007f30a08?s=128

Alyssa Frazee

August 11, 2015
Tweet

Transcript

  1. Am I a data scientist? Alyssa Frazee, Stripe @acfrazee

  2. © 2009 Lisa Slavid

  3. statistician a data scientist © 2009 Lisa Slavid

  4. Where I’m coming from Math undergrad Biostatistics PhD “Machine Learning

    Engineer” today Recurse Center (née Hacker School) 2010
  5. Am I a data scientist?

  6. Am I a data scientist? What do I really mean

    by this question?
  7. Am I a data scientist? What do I really mean

    by this question? Could I get a job offer with a title of “data scientist?”
  8. Am I a data scientist? What do I really mean

    by this question? Am I preparing my students to be able to get job offers with a title of “data scientist?”
  9. Am I a data scientist? What do I really mean

    by this question? Could I get a job offer with a title of “data scientist?” → sometimes implicitly industry → and sometimes specifically tech
  10. What’s “data science”?

  11. None
  12. data skills spectrum

  13. theoretical statistics software engineering

  14. theoretical statistics software engineering data science

  15. understanding quantitative data building a product data science

  16. output: numerical results output: usable software data science

  17. Am I a statistician? points for: • Am in a

    grad program called [bio]statistics • Know things about martingales and the delta method • Can explain what a p-value is and interpret linear regression coefficients points against: • Haven’t proved a theorem since 2011 • Spend more time writing bash scripts than inventing estimators • No publications in statistics journals
  18. Or am I a data scientist? points for: • Can

    program in more than one language • Actively use git & GitHub • Have written R packages and reproducible reports • Once made a web app and also a D3.js graph points against: • Not working in industry • Have never written a SQL query more complicated than select * from table • Understanding of Hadoop, Spark, and AWS is vague at best • Have never written production code
  19. None
  20. Idea! I will listen to what experts in our field

    say! Camp #1: Data science is just a rebranding of applied statistics. Camp #2: Statistics and data science are overlapping. Neither is a subset of the other. Camp #3: Statistics is irrelevant to data science.
  21. None
  22. First: do I want to be a data scientist?

  23. Second: Does it matter?

  24. Am I on the job market? Am I hiring?

  25. If you decide it matters: some distinguishing features

  26. Intentionality about programming

  27. Intentionality about programming Spending time thinking primarily about: • code

    efficiency • version control • code quality (cleanliness, modularity) • documentation / usability • unit testing • systematic debugging • giving and receiving code review • and other principles of software engineering
  28. Interest in schleppy- but-practical projects

  29. Interest in schleppy- but-practical projects • figuring out how to

    get the data you need • combining existing tools/methods in new ways • finding the simplest solution that works in practice
  30. Focus on concrete decision-making

  31. Focus on concrete decision-making less about inference and parameter estimation,

    more about what action should be taken
  32. Camp #1: Data science is just a rebranding of applied

    statistics. Camp #2: Statistics and data science are overlapping. Neither is a subset of the other. Camp #3: Statistics is irrelevant to data science.
  33. Perspective from the other side Camp #1: Data science is

    just a rebranding of applied statistics.
  34. Perspective from the other side Camp #1: Data science is

    just a rebranding of applied statistics. Intentionality about programming
  35. Perspective from the other side Camp #1: Data science is

    just a rebranding of applied statistics. The day-to-day work is different!
  36. Perspective from the other side Last month I: • wrote

    Ruby, Scala, Coffeescript, and Python • fought with maven • backfilled some busted tables in our databases • investigated the mystery of why some of our cluster boxes are overworked • learned how to be on call (so I can fix some of Stripe if it breaks at 3am) • helped teach a SQL class • and did some statistics
  37. Camp #3: Statistics is irrelevant to data science. Perspective from

    the other side
  38. Camp #3: Statistics is irrelevant to data science. Perspective from

    the other side
  39. Statistics and data science are overlapping. Neither is a subset

    of the other. Perspective from the other side
  40. About that identity crisis: Program intentionally and be a data

    scientist, if you want!
  41. About that identity crisis: Or don’t! Statistics is hugely important

    and relevant in its own right!
  42. • http://andrewgelman.com/2013/11/14/statistics-least- important-part-data-science/ • http://bulletin.imstat.org/2014/09/data-science-how-is-it- different-to-statistics%E2%80%89/ • https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the- 21st-century/ •

    http://datascopeanalytics.com/blog/what-is-a-data-scientist/ Further reading: