Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Having an Impact as a Modern Statistician

Having an Impact as a Modern Statistician

How adopting software engineering best practices can make you a more effective data scientist.

Skipper Seabold

July 31, 2016
Tweet

Other Decks in Programming

Transcript

  1. Building a Data-Driven WorldTM
    Having an Impact as a Modern
    Statistician
    Lessons from Software Engineering
    Skipper Seabold @jseabold
    Joint Statistical Meetings, July 31, 2016

    View full-size slide

  2. About Me
    ● Economist by Training
    ● Open Source Software Contributor
    ● Lead Data Scientist at Civis Analytics
    ○ (Requisite: We’re Hiring)

    View full-size slide

  3. Role of a Modern Statistician
    ● Modern statisticians write a lot of code
    ● Increasingly, this is software rather scripts
    ● Academics
    ○ Teaching
    ○ Writing papers
    ○ Implementations
    ● Tool Builders
    ○ Maintaining popular open source tools
    ● Industry
    ○ Deploying models that drive business decisions
    ○ Maintaining code
    ○ Knowledge transfer

    View full-size slide

  4. What does it mean to be impactful?
    ● Across Fields
    ○ Engendering trust
    ○ Useful and widely felt contributions
    ○ Be a good collaborator or colleague
    ● Academics
    ○ Trust through reproducibility
    ● Tool Building
    ○ Trust through working, easy to use, and documented code
    ● Industry
    ○ Trust through demonstrated ability to deliver

    View full-size slide

  5. How to Be Impactful
    ● Borrow from the best practices of Software Engineers

    View full-size slide

  6. Statistics Development Life Cycle
    Publish
    Distribute
    Deploy

    View full-size slide

  7. Version Control
    ● What is it?
    ○ Tracking changes for projects
    ○ Most often code
    ○ Also works for collaborative paper writing
    ● How does this make you impactful?
    ○ Backup
    ○ Collaboration
    ○ Distribution and integration
    ○ Legacy
    ○ Github as a resume
    ● What tools are available?
    ○ Git
    ○ Github

    View full-size slide

  8. Unit Testing
    ● Prove that a piece of code does exactly what it is supposed to do
    ● Unit tests should test the smallest possible unit
    ● How does this make you impactful?
    ○ Documentation (not a substitute for real documentation!)
    ○ Improves code quality through modularity
    ○ Refactor code quickly
    ○ Maintainable code base
    ○ Reduces risks of bugs and even retractions

    View full-size slide

  9. Unit Testing Example
    import numpy as np
    def calculate_gradient(func, x):
    ...
    return gradient
    def test_gradient():
    result = calculate_gradient(log_loss, minimum)
    np.testing.assert_allclose(result, 0)
    $ py.test gradient.py
    ============================ test session starts =============================
    platform darwin -- Python 3.4.4, pytest-2.9.2, py-1.4.31, pluggy-0.3.1
    rootdir: /Users/user/project/path, inifile:
    collected 1 items
    gradient.py .
    ========================== 1 passed in 0.11 seconds ==========================

    View full-size slide

  10. Test Driven Development
    ● When to write a test?
    ● Many advocate to write a test first
    ● When you discover a bug
    ● How does this make you impactful?
    ○ Makes you think about code design
    ○ Easy for you and others to change and improve your code

    View full-size slide

  11. Continuous Integration
    ● Making changes and getting them back into your “master” branch
    ○ Coupled with unit tests
    ■ Always have code that works
    ■ Always have a paper that compiles
    ● How does this make you impactful?
    ○ Trust
    ○ Makes iteration fast

    View full-size slide

  12. Continuous Integration Example
    language: python
    python:
    - "3.4"
    before_install:
    - sudo apt-get -qq update
    - sudo apt-get install -y pdflatex
    install:
    - pip install -r numpy scipy scikit-learn
    script: py.test your-project && make pdf
    ● An example using Travis CI

    View full-size slide

  13. Containers
    ● Enabled by features of the linux kernel
    ● Resource isolation and limits for groups of processes
    ● Reliably run software from one computational environment in another
    ● What problems can containers solve?
    ○ Your colleague has a different setup than you
    ○ Your development environment looks different than your test environment
    ○ You need to limit the computational resources of an application
    ● Running groups of containers with tools like Kubernetes
    ○ Spark
    ○ dask.distributed
    ● How does this make you impactful?
    ○ Makes CI easier
    ○ Makes distribution of your code easier
    ○ Easy to try things out

    View full-size slide

  14. Container Example
    ● Docker is a popular container technology
    ● Environment is defined by a Dockerfile
    FROM ubuntu:14.04
    MAINTAINER "Open Source Dev Team [email protected]"
    RUN apt-get update && \
    apt-get install -y git gcc r-base
    CMD ["R"]
    ● Large user community around Docker Hub
    $ docker run -d -p 8000:8000 --name jupyter jupyter/datascience-notebook
    Unable to find image 'jupyter/datascience-notebook:latest' locally
    latest: Pulling from jupyter/datascience-notebook
    ...

    View full-size slide

  15. Further Resources
    ● Craft
    ○ Effective Computation in Physics
    ○ The Pragmatic Programmer
    ○ The Senior Software Engineer
    ● Tools
    ○ The Git Book
    ○ The Docker Book
    ● Testing
    ○ xUnit Test Patterns
    ● General Programming
    ○ The Structure and Interpretation of Computer Programs (aka The Wizard Book)

    View full-size slide