Upgrade to Pro — share decks privately, control downloads, hide ads and more …

New Workflows for Building Data Pipelines

New Workflows for Building Data Pipelines

Full post and video here: Hakkalabs.co/articles/new-workflows-for-building-data-pipelines

Hakka Labs

May 14, 2015
Tweet

More Decks by Hakka Labs

Other Decks in Technology

Transcript

  1. Scripts are Easy • They run locally • You know

    what data goes in • You know what libraries are installed • Easy to share
  2. Doing the same thing in a cluster is hard 1.

    code dependency management - libraries, conflicts ! 2. data dependency management - data locality, job scheduling, etc ! 3. collaboration - building on each other’s work, sharing insights
  3. Pipelines simplify clusters ! • So what is a pipeline?


    A simple way to describe computation in a cluster ! • Pipelines are useful because… – they’re reproducible – test locally before you run on the cluster – collaborate on them as easily as code
  4. /image Docker image to specify the runtime environment and job

    logic ! All of your library and code dependencies are neatly packaged into a Docker container
  5. /job A DAG of data manipulations to be run in

    the container. ! ! Complete text-based expression of the data dependencies
  6. /data and /install Sample data to test locally Install scripts

    to run the pipeline ! ! ! Pipelines run the same anywhere Collaboration becomes easy
  7. What’s next? Having a concrete spec for pipelines gives a

    foundation to build on. ! What can we build on it?
  8. Incremental processing • Pachyderm has built a diff-aware file system

    which tells us what data has changed • Our pipelines tell us how those changes affect our results • Combined, this gets us efficient incremental processing