New Workflows for Building Data Pipelines

Pachyderm Thinking with Pipelines

Scripts are Easy • They run locally • You know
what data goes in • You know what libraries are installed • Easy to share

Doing the same thing in a cluster is hard 1.
code dependency management - libraries, conflicts ! 2. data dependency management - data locality, job scheduling, etc ! 3. collaboration - building on each other’s work, sharing insights

Pipelines simplify clusters ! • So what is a pipeline? 
A simple way to describe computation in a cluster ! • Pipelines are useful because… – they’re reproducible – test locally before you run on the cluster – collaborate on them as easily as code

So how do you actually implement this? • In Pachyderm,
a pipeline is:

/image Docker image to specify the runtime environment and job
logic ! All of your library and code dependencies are neatly packaged into a Docker container

/job A DAG of data manipulations to be run in
the container. ! ! Complete text-based expression of the data dependencies

/data and /install Sample data to test locally Install scripts
to run the pipeline ! ! ! Pipelines run the same anywhere Collaboration becomes easy

What’s next? Having a concrete spec for pipelines gives a
foundation to build on. ! What can we build on it?

Incremental processing • Pachyderm has built a diff-aware file system
which tells us what data has changed • Our pipelines tell us how those changes affect our results • Combined, this gets us efficient incremental processing

Pachyderm pachyderm.io github.com/pachyderm [email protected]

New Workflows for Building Data Pipelines

New Workflows for Building Data Pipelines

Hakka Labs

More Decks by Hakka Labs

Other Decks in Technology

Featured

Transcript

Pachyderm Thinking with Pipelines

Scripts are Easy • They run locally • You know

Doing the same thing in a cluster is hard 1.

Pipelines simplify clusters ! • So what is a pipeline?

So how do you actually implement this? • In Pachyderm,

/image Docker image to specify the runtime environment and job

/job A DAG of data manipulations to be run in

/data and /install Sample data to test locally Install scripts

What’s next? Having a concrete spec for pipelines gives a

Incremental processing • Pachyderm has built a diff-aware file system

Pachyderm pachyderm.io github.com/pachyderm [email protected]