Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Am I a data scientist?

Am I a data scientist?

Slides from my talk at JSM 2015, in the session: "The Statistics Identity Crisis: Are We Really Data Scientists?" https://www.amstat.org/meetings/jsm/2015/onlineprogram/ActivityDetails.cfm?SessionID=211266

Alyssa Frazee

August 11, 2015
Tweet

More Decks by Alyssa Frazee

Other Decks in Technology

Transcript

  1. Am I a data scientist?
    Alyssa Frazee, Stripe
    @acfrazee

    View Slide

  2. © 2009 Lisa Slavid

    View Slide

  3. statistician
    a data scientist
    © 2009 Lisa Slavid

    View Slide

  4. Where I’m coming from
    Math
    undergrad
    Biostatistics PhD
    “Machine
    Learning
    Engineer”
    today
    Recurse Center
    (née Hacker School)
    2010

    View Slide

  5. Am I a data scientist?

    View Slide

  6. Am I a data scientist?
    What do I really mean by this question?

    View Slide

  7. Am I a data scientist?
    What do I really mean by this question?
    Could I get a job offer with a title of “data
    scientist?”

    View Slide

  8. Am I a data scientist?
    What do I really mean by this question?
    Am I preparing my students to be able to
    get job offers with a title of “data
    scientist?”

    View Slide

  9. Am I a data scientist?
    What do I really mean by this question?
    Could I get a job offer with a title of “data
    scientist?”
    → sometimes implicitly industry
    → and sometimes specifically tech

    View Slide

  10. What’s “data science”?

    View Slide

  11. View Slide

  12. data skills
    spectrum

    View Slide

  13. theoretical statistics
    software engineering

    View Slide

  14. theoretical statistics
    software engineering
    data science

    View Slide

  15. understanding quantitative data
    building a product
    data science

    View Slide

  16. output: numerical results
    output: usable software
    data science

    View Slide

  17. Am I a statistician?
    points for:
    ● Am in a grad program called [bio]statistics
    ● Know things about martingales and the delta method
    ● Can explain what a p-value is and interpret linear regression
    coefficients
    points against:
    ● Haven’t proved a theorem since 2011
    ● Spend more time writing bash scripts than inventing
    estimators
    ● No publications in statistics journals

    View Slide

  18. Or am I a data scientist?
    points for:
    ● Can program in more than one language
    ● Actively use git & GitHub
    ● Have written R packages and reproducible reports
    ● Once made a web app and also a D3.js graph
    points against:
    ● Not working in industry
    ● Have never written a SQL query more complicated than
    select * from table
    ● Understanding of Hadoop, Spark, and AWS is vague at best
    ● Have never written production code

    View Slide

  19. View Slide

  20. Idea! I will listen to what experts in our field say!
    Camp #1: Data science is just a rebranding of
    applied statistics.
    Camp #2: Statistics and data science are
    overlapping. Neither is a subset of the other.
    Camp #3: Statistics is irrelevant to data science.

    View Slide

  21. View Slide

  22. First: do I want to be a
    data scientist?

    View Slide

  23. Second: Does it
    matter?

    View Slide

  24. Am I on the job market?
    Am I hiring?

    View Slide

  25. If you decide it matters:
    some distinguishing features

    View Slide

  26. Intentionality about
    programming

    View Slide

  27. Intentionality about
    programming
    Spending time thinking primarily about:
    ● code efficiency
    ● version control
    ● code quality (cleanliness, modularity)
    ● documentation / usability
    ● unit testing
    ● systematic debugging
    ● giving and receiving code review
    ● and other principles of software engineering

    View Slide

  28. Interest in schleppy-
    but-practical
    projects

    View Slide

  29. Interest in schleppy-
    but-practical
    projects
    ● figuring out how to get the data you need
    ● combining existing tools/methods in new ways
    ● finding the simplest solution that works in
    practice

    View Slide

  30. Focus on concrete
    decision-making

    View Slide

  31. Focus on concrete
    decision-making
    less about inference and parameter estimation,
    more about what action should be taken

    View Slide

  32. Camp #1: Data science is just a rebranding of
    applied statistics.
    Camp #2: Statistics and data science are
    overlapping. Neither is a subset of the other.
    Camp #3: Statistics is irrelevant to data science.

    View Slide

  33. Perspective from the other side
    Camp #1: Data science is just a rebranding of
    applied statistics.

    View Slide

  34. Perspective from the other side
    Camp #1: Data science is just a rebranding of
    applied statistics.
    Intentionality about
    programming

    View Slide

  35. Perspective from the other side
    Camp #1: Data science is just a rebranding of
    applied statistics.
    The day-to-day work is
    different!

    View Slide

  36. Perspective from the other side
    Last month I:
    ● wrote Ruby, Scala, Coffeescript, and Python
    ● fought with maven
    ● backfilled some busted tables in our databases
    ● investigated the mystery of why some of our
    cluster boxes are overworked
    ● learned how to be on call (so I can fix some of
    Stripe if it breaks at 3am)
    ● helped teach a SQL class
    ● and did some statistics

    View Slide

  37. Camp #3: Statistics is irrelevant to data science.
    Perspective from the other side

    View Slide

  38. Camp #3: Statistics is irrelevant to data science.
    Perspective from the other side

    View Slide

  39. Statistics and data science are
    overlapping. Neither is a subset
    of the other.
    Perspective from the other side

    View Slide

  40. About that identity crisis:
    Program intentionally and be a
    data scientist, if you want!

    View Slide

  41. About that identity crisis:
    Or don’t! Statistics is hugely
    important and relevant in its
    own right!

    View Slide

  42. ● http://andrewgelman.com/2013/11/14/statistics-least-
    important-part-data-science/
    ● http://bulletin.imstat.org/2014/09/data-science-how-is-it-
    different-to-statistics%E2%80%89/
    ● https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-
    21st-century/
    ● http://datascopeanalytics.com/blog/what-is-a-data-scientist/
    Further reading:

    View Slide