Having an Impact as a Modern Statistician

Having an Impact as a Modern Statistician

How adopting software engineering best practices can make you a more effective data scientist.


Skipper Seabold

July 31, 2016


  1. Building a Data-Driven WorldTM Having an Impact as a Modern

    Statistician Lessons from Software Engineering Skipper Seabold @jseabold Joint Statistical Meetings, July 31, 2016
  2. About Me • Economist by Training • Open Source Software

    Contributor • Lead Data Scientist at Civis Analytics ◦ (Requisite: We’re Hiring)
  3. Role of a Modern Statistician • Modern statisticians write a

    lot of code • Increasingly, this is software rather scripts • Academics ◦ Teaching ◦ Writing papers ◦ Implementations • Tool Builders ◦ Maintaining popular open source tools • Industry ◦ Deploying models that drive business decisions ◦ Maintaining code ◦ Knowledge transfer
  4. What does it mean to be impactful? • Across Fields

    ◦ Engendering trust ◦ Useful and widely felt contributions ◦ Be a good collaborator or colleague • Academics ◦ Trust through reproducibility • Tool Building ◦ Trust through working, easy to use, and documented code • Industry ◦ Trust through demonstrated ability to deliver
  5. How to Be Impactful • Borrow from the best practices

    of Software Engineers
  6. Statistics Development Life Cycle Publish Distribute Deploy

  7. Languages

  8. Version Control • What is it? ◦ Tracking changes for

    projects ◦ Most often code ◦ Also works for collaborative paper writing • How does this make you impactful? ◦ Backup ◦ Collaboration ◦ Distribution and integration ◦ Legacy ◦ Github as a resume • What tools are available? ◦ Git ◦ Github
  9. Git Example

  10. Unit Testing • Prove that a piece of code does

    exactly what it is supposed to do • Unit tests should test the smallest possible unit • How does this make you impactful? ◦ Documentation (not a substitute for real documentation!) ◦ Improves code quality through modularity ◦ Refactor code quickly ◦ Maintainable code base ◦ Reduces risks of bugs and even retractions
  11. Unit Testing Example import numpy as np def calculate_gradient(func, x):

    ... return gradient def test_gradient(): result = calculate_gradient(log_loss, minimum) np.testing.assert_allclose(result, 0) $ py.test gradient.py ============================ test session starts ============================= platform darwin -- Python 3.4.4, pytest-2.9.2, py-1.4.31, pluggy-0.3.1 rootdir: /Users/user/project/path, inifile: collected 1 items gradient.py . ========================== 1 passed in 0.11 seconds ==========================
  12. Test Driven Development • When to write a test? •

    Many advocate to write a test first • When you discover a bug • How does this make you impactful? ◦ Makes you think about code design ◦ Easy for you and others to change and improve your code
  13. Continuous Integration • Making changes and getting them back into

    your “master” branch ◦ Coupled with unit tests ▪ Always have code that works ▪ Always have a paper that compiles • How does this make you impactful? ◦ Trust ◦ Makes iteration fast
  14. Continuous Integration Example language: python python: - "3.4" before_install: -

    sudo apt-get -qq update - sudo apt-get install -y pdflatex install: - pip install -r numpy scipy scikit-learn script: py.test your-project && make pdf • An example using Travis CI
  15. Containers • Enabled by features of the linux kernel •

    Resource isolation and limits for groups of processes • Reliably run software from one computational environment in another • What problems can containers solve? ◦ Your colleague has a different setup than you ◦ Your development environment looks different than your test environment ◦ You need to limit the computational resources of an application • Running groups of containers with tools like Kubernetes ◦ Spark ◦ dask.distributed • How does this make you impactful? ◦ Makes CI easier ◦ Makes distribution of your code easier ◦ Easy to try things out
  16. Container Example • Docker is a popular container technology •

    Environment is defined by a Dockerfile FROM ubuntu:14.04 MAINTAINER "Open Source Dev Team maintainers@maintenance.org" RUN apt-get update && \ apt-get install -y git gcc r-base CMD ["R"] • Large user community around Docker Hub $ docker run -d -p 8000:8000 --name jupyter jupyter/datascience-notebook Unable to find image 'jupyter/datascience-notebook:latest' locally latest: Pulling from jupyter/datascience-notebook ...
  17. Further Resources • Craft ◦ Effective Computation in Physics ◦

    The Pragmatic Programmer ◦ The Senior Software Engineer • Tools ◦ The Git Book ◦ The Docker Book • Testing ◦ xUnit Test Patterns • General Programming ◦ The Structure and Interpretation of Computer Programs (aka The Wizard Book)