Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reproducible Data Science with Docker by Richard Ackon

Pycon ZA
October 12, 2018

Reproducible Data Science with Docker by Richard Ackon

Collaboration is a major part of doing Data Science. This means Data Scientists are always sharing their work with their colleagues whether to continue in the Data Science process or for review. One problem that is mostly faced in this process is the "It works on my machine" problem.

Docker is a tool that is used to package and run applications with all their dependencies in an isolated environment.

In this talk, I'll use Python to analyse some data in jupyter notebooks and show how Docker can be used to ensure reproducibility of that analysis in a different environment.

This talk will cover:

The basics of the data science workflow
The basics of Docker
A demonstration of sharing and reproducing data analysis work in a jupyter notebook.

Pycon ZA

October 12, 2018
Tweet

More Decks by Pycon ZA

Other Decks in Technology

Transcript

  1. Who Am I? • Machine Learning Engineer, Kudobuzz • Co-organizer,

    Accra Artificial Intelligence Meetup • Writer for Analytics Vidhya, Divo.com
  2. Overview • Reproducible Data Science? • Why is it important?

    • Where do we need reproducibility? • How do we achieve reproducibility • Demo • Conclusion
  3. What is Reproducible Data Science? The ability to replicate the

    same results for a data science experiment using the same data and code running in the same environment.
  4. Why is it important? “non-reproducible single occurrences are of no

    significance to science.” - Karl Popper • Proof of phenomenon • Facilitates peer review • Basis for decision making
  5. Docker • Docker is a tool designed to make it

    easier to create, deploy, and run applications by using containers. • Containers allow you to package an application with everything it needs to run, such as libraries and other dependencies.