Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Am I a data scientist?

Am I a data scientist?

Slides from my talk at JSM 2015, in the session: "The Statistics Identity Crisis: Are We Really Data Scientists?" https://www.amstat.org/meetings/jsm/2015/onlineprogram/ActivityDetails.cfm?SessionID=211266

Alyssa Frazee

August 11, 2015
Tweet

More Decks by Alyssa Frazee

Other Decks in Technology

Transcript

  1. Am I a data scientist?
    Alyssa Frazee, Stripe
    @acfrazee

    View full-size slide

  2. © 2009 Lisa Slavid

    View full-size slide

  3. statistician
    a data scientist
    © 2009 Lisa Slavid

    View full-size slide

  4. Where I’m coming from
    Math
    undergrad
    Biostatistics PhD
    “Machine
    Learning
    Engineer”
    today
    Recurse Center
    (née Hacker School)
    2010

    View full-size slide

  5. Am I a data scientist?

    View full-size slide

  6. Am I a data scientist?
    What do I really mean by this question?

    View full-size slide

  7. Am I a data scientist?
    What do I really mean by this question?
    Could I get a job offer with a title of “data
    scientist?”

    View full-size slide

  8. Am I a data scientist?
    What do I really mean by this question?
    Am I preparing my students to be able to
    get job offers with a title of “data
    scientist?”

    View full-size slide

  9. Am I a data scientist?
    What do I really mean by this question?
    Could I get a job offer with a title of “data
    scientist?”
    → sometimes implicitly industry
    → and sometimes specifically tech

    View full-size slide

  10. What’s “data science”?

    View full-size slide

  11. data skills
    spectrum

    View full-size slide

  12. theoretical statistics
    software engineering

    View full-size slide

  13. theoretical statistics
    software engineering
    data science

    View full-size slide

  14. understanding quantitative data
    building a product
    data science

    View full-size slide

  15. output: numerical results
    output: usable software
    data science

    View full-size slide

  16. Am I a statistician?
    points for:
    ● Am in a grad program called [bio]statistics
    ● Know things about martingales and the delta method
    ● Can explain what a p-value is and interpret linear regression
    coefficients
    points against:
    ● Haven’t proved a theorem since 2011
    ● Spend more time writing bash scripts than inventing
    estimators
    ● No publications in statistics journals

    View full-size slide

  17. Or am I a data scientist?
    points for:
    ● Can program in more than one language
    ● Actively use git & GitHub
    ● Have written R packages and reproducible reports
    ● Once made a web app and also a D3.js graph
    points against:
    ● Not working in industry
    ● Have never written a SQL query more complicated than
    select * from table
    ● Understanding of Hadoop, Spark, and AWS is vague at best
    ● Have never written production code

    View full-size slide

  18. Idea! I will listen to what experts in our field say!
    Camp #1: Data science is just a rebranding of
    applied statistics.
    Camp #2: Statistics and data science are
    overlapping. Neither is a subset of the other.
    Camp #3: Statistics is irrelevant to data science.

    View full-size slide

  19. First: do I want to be a
    data scientist?

    View full-size slide

  20. Second: Does it
    matter?

    View full-size slide

  21. Am I on the job market?
    Am I hiring?

    View full-size slide

  22. If you decide it matters:
    some distinguishing features

    View full-size slide

  23. Intentionality about
    programming

    View full-size slide

  24. Intentionality about
    programming
    Spending time thinking primarily about:
    ● code efficiency
    ● version control
    ● code quality (cleanliness, modularity)
    ● documentation / usability
    ● unit testing
    ● systematic debugging
    ● giving and receiving code review
    ● and other principles of software engineering

    View full-size slide

  25. Interest in schleppy-
    but-practical
    projects

    View full-size slide

  26. Interest in schleppy-
    but-practical
    projects
    ● figuring out how to get the data you need
    ● combining existing tools/methods in new ways
    ● finding the simplest solution that works in
    practice

    View full-size slide

  27. Focus on concrete
    decision-making

    View full-size slide

  28. Focus on concrete
    decision-making
    less about inference and parameter estimation,
    more about what action should be taken

    View full-size slide

  29. Camp #1: Data science is just a rebranding of
    applied statistics.
    Camp #2: Statistics and data science are
    overlapping. Neither is a subset of the other.
    Camp #3: Statistics is irrelevant to data science.

    View full-size slide

  30. Perspective from the other side
    Camp #1: Data science is just a rebranding of
    applied statistics.

    View full-size slide

  31. Perspective from the other side
    Camp #1: Data science is just a rebranding of
    applied statistics.
    Intentionality about
    programming

    View full-size slide

  32. Perspective from the other side
    Camp #1: Data science is just a rebranding of
    applied statistics.
    The day-to-day work is
    different!

    View full-size slide

  33. Perspective from the other side
    Last month I:
    ● wrote Ruby, Scala, Coffeescript, and Python
    ● fought with maven
    ● backfilled some busted tables in our databases
    ● investigated the mystery of why some of our
    cluster boxes are overworked
    ● learned how to be on call (so I can fix some of
    Stripe if it breaks at 3am)
    ● helped teach a SQL class
    ● and did some statistics

    View full-size slide

  34. Camp #3: Statistics is irrelevant to data science.
    Perspective from the other side

    View full-size slide

  35. Camp #3: Statistics is irrelevant to data science.
    Perspective from the other side

    View full-size slide

  36. Statistics and data science are
    overlapping. Neither is a subset
    of the other.
    Perspective from the other side

    View full-size slide

  37. About that identity crisis:
    Program intentionally and be a
    data scientist, if you want!

    View full-size slide

  38. About that identity crisis:
    Or don’t! Statistics is hugely
    important and relevant in its
    own right!

    View full-size slide

  39. ● http://andrewgelman.com/2013/11/14/statistics-least-
    important-part-data-science/
    ● http://bulletin.imstat.org/2014/09/data-science-how-is-it-
    different-to-statistics%E2%80%89/
    ● https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-
    21st-century/
    ● http://datascopeanalytics.com/blog/what-is-a-data-scientist/
    Further reading:

    View full-size slide