Slide 1

Slide 1 text

Reproducible Data Science with Docker By: Richard Ackon @esquire_gh

Slide 2

Slide 2 text

Who Am I? ● Machine Learning Engineer, Kudobuzz ● Co-organizer, Accra Artificial Intelligence Meetup ● Writer for Analytics Vidhya, Divo.com

Slide 3

Slide 3 text

Overview ● Reproducible Data Science? ● Why is it important? ● Where do we need reproducibility? ● How do we achieve reproducibility ● Demo ● Conclusion

Slide 4

Slide 4 text

What is Reproducible Data Science? The ability to replicate the same results for a data science experiment using the same data and code running in the same environment.

Slide 5

Slide 5 text

Why is it important? “non-reproducible single occurrences are of no significance to science.” - Karl Popper ● Proof of phenomenon ● Facilitates peer review ● Basis for decision making

Slide 6

Slide 6 text

Where do we need it? ● Data ● Environment ● Code

Slide 7

Slide 7 text

So, How do we achieve reproducibility?

Slide 8

Slide 8 text

Common Data Science Workflow

Slide 9

Slide 9 text

Common Reproducibility Errors

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Docker ● Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. ● Containers allow you to package an application with everything it needs to run, such as libraries and other dependencies.

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Demo Using Docker to ensure reproducibility

Slide 14

Slide 14 text

Thank you!