Slide 1

Slide 1 text

Robust Data Pipelines
 with drake and Docker @tudosgar | tamaszilagyi.com

Slide 2

Slide 2 text

2 The end goal for most data analytics projects is to build a data product, whether it is a weekly dashboard or deploying a ML model. From getting and cleaning the data to generating plots or fitting models, each project can be split up into individual tasks. Afterwards, we want to make sure that the environment we wrote our code in can be replicated. The low-down @tudosgar | tamaszilagyi.com

Slide 3

Slide 3 text

3 Automation. Reproducibility. @tudosgar | tamaszilagyi.com

Slide 4

Slide 4 text

Workflow management tool that enables you to build managed workflows. It is a first serious attempt to create a tool like Airflow or Luigi, but in R @tudosgar | tamaszilagyi.com building pipelines

Slide 5

Slide 5 text

5 Your analysis is a sequence of transformations. @tudosgar | tamaszilagyi.com

Slide 6

Slide 6 text

write a plan. A plan can be simple, or very complex. It all depends on how you define the tasks’ dependencies. @tudosgar | tamaszilagyi.com

Slide 7

Slide 7 text

not only targets are kept track of. By default we can visualise all imports, functions and transformations in our DAG, not only completed tasks. As a nice bonus we can also monitor progress for longer jobs. @tudosgar | tamaszilagyi.com

Slide 8

Slide 8 text

make(my_plan) Checks dependencies and cache before creating plan. This means that on subsequent runs, only the changed tasks will rerun, leaving the rest intact. @tudosgar | tamaszilagyi.com

Slide 9

Slide 9 text

9 A container is kinda like a VM but..different. @tudosgar | tamaszilagyi.com

Slide 10

Slide 10 text

benefits of containers - Easily reproduce your infrastructure - Runs independent of host OS - Consistency in production @tudosgar | tamaszilagyi.com

Slide 11

Slide 11 text

step 1: Dockerfile A Dockerfile includes, in backwards order: - Your script - Package dependencies of your script - System level dependencies of these packages @tudosgar | tamaszilagyi.com

Slide 12

Slide 12 text

step 2: build image This is the stage where we actually build our mini computer, ready for deployment. @tudosgar | tamaszilagyi.com

Slide 13

Slide 13 text

step 3: run container Instantiates the container, mounts the folder from our host where the data resides and runs our executable as defined in the Dockerfile. @tudosgar | tamaszilagyi.com

Slide 14

Slide 14 text

http:/ /tamaszilagyi.com/ @tudosgar @tudosgar | tamaszilagyi.com Find me

Slide 15

Slide 15 text

Thank you. @tudosgar | tamaszilagyi.com