Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reproducible Data Science with Pachyderm

Reproducible Data Science with Pachyderm

Pachyderm is a data lake that offers complete version control for data and leverages the container ecosystem to provide reproducible data processing.

As data scientists, we know the best ideas come through collaboration. We also know reproducibility matters. Pachyderm makes both a reality, and this talk shows how. We’ll talk about data containers and analysis pipelines, and how they combine to make Pachyderm a Git for Data Science.

Avatar for Najib Ninaba

Najib Ninaba

March 16, 2017
Tweet

Other Decks in Technology

Transcript

  1. About me • Co-founder & Principal Consultant @ REAL Analytics.

    • Loves to geek out about HPC, Clusters, Clouds, Containers, Orchestration, APIs & Backends. • Formerly Co-Founder for Scalable Systems Pte Ltd, Singapore Engineering Manager for Platform Computing Inc & Singapore Development Manager for Revolution Analytics Inc.
  2. What is Reproducible Data Science? • The fundamental principle that

    enables Data Science • Being able to consistently reconstruct any previous state of your data and analysis • Feedback loop that relies on testing hypotheses. • Must be able to 100% reproduce the input data, the output data and the analysis in exactly the same way
  3. Challenges faced by Data Science community "There was only one

    problem — all of my work was done in my local machine in R. People appreciate my efforts but they don’t know how to consume my model because it was not 'productionized' and the infrastructure cannot talk to my local model. Hard lesson learned!" -Robert Chang, data scientist at Twitter
  4. Challenges faced by Data Science community "Data engineers are often

    frustrated that data scientists produce inefficient and poorly written code, have little consideration for the maintenance cost of productionizing ideas, demand unrealistic features that skew implementation effort for little gain... The list goes on, but you get the point." -Jeff Magnusson, director of data platform at Stitchfix
  5. Challenges faced by Data Science community Data analysis is incredibly

    easy to get wrong, and it's just as hard to know when you're getting it right, which makes reproducible research all the more important! — Reproducibility is not just for researchers, Data School
  6. Challenges faced by Data Science community Six months later, someone

    asks you a question you didn't cover so you need to reproduce your analysis. But you can't remember where the hell you saved the damn thing on your computer. If you're a data scientist (especially the decision sciences/analysis focused kind), this has happened to you.—The Most Boring/Valuable Data Science Advice, by Justin Bozonier
  7. How do you achieve reproducibility? • Aim for simple and

    interpretable solutions • Strive to be testable and deployable • Version your data • Know your provenance • Documentation
  8. Introducing Pachyderm • "Git for Data Science" • Runs on

    Kubernetes (k8s) • Language and tooling agnostic • Alternative to Hadoop • Focuses on reproducibility, Data provenance and collaboration • Website: http://pachyderm.io
  9. Pachyderm, the alternative to Hadoop • Replaces HDFS with its

    Pachyderm File System • Replaces MapReduce with its Pachyderm Pipelines System • Handles your distributed processing and automatically shards data • "Data Versioning" + "Data Pipelining"
  10. Pachyderm File System (PFS) • Same version control semantics as

    for code but for massive data sets. Time Machine for your data. • Backed by Object Store of your choice (S3, Google Cloud Storage, Azure Storage, Minio, etc). • Data does not live in Pachyderm, stays in object storage and therefore has all the safety guarantees of those underlying systems (e.g replication and persistence).
  11. Getting data into PFS • Use pachctl CLI (recommended) •

    Mount Pachyderm locally and add files directly through FUSE interface (Linux and Mac OS X only) • Use protobufs API, currently only Golang supported. Other languages will be supported soon (come help!)
  12. Getting data out of PFS • Use pachctl CLI (recommended)

    • Mount Pachyderm locally and add files directly through FUSE interface (Linux and Mac OS X only) • Use protobufs API, currently only Golang supported. Other languages will be supported soon (come help!) • Use Pachyderm Pipeline System (PPS)
  13. Pachyderm Pipeline System (PPS) • PPS is the containerized processing

    engine for PFS. • Data is exposed via PFS interface as a local file system in the container (in /pfs dir). • Runs jobs over PFS via k8s. • To get started, build your analysis code into a Docker container. • Use any language, libraries, tooling you want.
  14. How to create your analysis code for Pachyderm via PPS

    • Your analysis code just need to read and write data from the local filesystem in the container. • Data is read in input directory in /pfs. The input directory is the same name as your repo name in PFS. • Results can be written out to the output directory in /pfs. The directory is hardcoded to /pfs/out. • Describe your PPS job in a JSON pipeline spec file.
  15. PPS is a data-aware container scheduler • PPS jobs are

    attached to Pachyderm Data Repos (PDR). • Jobs triggered by commits to the PDR. • Understands job dependencies. • Supports multiple pipelines for a Pachyderm workflow. • Jobs are resilient and can be restarted. • Efficient and supports incremental processing whether in batched mode or streaming.
  16. Use Case: General Fusion • Outgrown its existing data infrastructure

    and needed a solution that can meet their requirements. • Need to augment (not "rip and replace") existing experimental & analysis workflows • Need to facilitate collaboration with external scientific partners seamlessly and adhoc sharing of large data sets • With Pachyderm, the General Fusion team can stay focused on plasma physics instead of designing and maintaining big data systems. The combination of language-agnostic infrastructure and version controlled data allows them to efficiently develop and iterate on their data analysis. General Fusion is developing fusion energy: a clean, safe, abundant and cost-competitive form of power. The company aims to design the world’s first full-scale demonstration fusion power plant based on commercially-viable technology. “The true tipping point in our decision to use Pachyderm was its version control features for managing our data.” - Jonathan Fraser Engineer at General Fusion
  17. Use Case: Fogger • Evaluated having to build their own

    solution in-house or using something like Hadoop/Spark • The learning curve and infrastructure overhead of Hadoop/Spark led them to Pachyderm as they are already using containers for their stack • Containers allow Fogger to build data-processing algorithms in any programming language and they did not have to learn any new technology other than Pachyderm CLI itself Fogger makes a software platform for processing sensor data on industrial machinery such as solar farms and wind turbines. Its Fog Computing platform allows data processing on small Linux boxes close to the machines and pushes it over a peer-to- peer network to a central cloud hub. It uses Pachyderm for local data processing on it way to the cloud. “Pachyderm has a very well-designed technological stack. We love the idea of map/reduce pipelines built with containers and a simple Git-like triggering system." - Kamil Kozak, CEO
  18. Other Use Cases • Prodigy Finance, a FinTech company that

    provides a platform that offers loans for international postgraduate students attending top universities, have started to use Pachyderm for their data archive and ML pipeline. • Video and image processing, Fraud Analysis, ETL pipelines, Sales Funnel Analysis. • Pachyderm also have a lot of users working on genomics and bioinformatics where reproducibility and dynamically scaling data pipelines are really important.
  19. Pachyderm Resources • Website: http://pachyderm.io/ • Github Repo: https://github.com/pachyderm/pachyderm •

    Twitter: https://twitter.com/pachydermIO • Pachyderm Slack: http://slack.pachyderm.io/
  20. Summary • Challenges faced and ideas/tips on solving them •

    Pachyderm architecture, benefits and use cases