Reproducible Data Science with Pachyderm

Reproducible Data Science with Pachyderm Najib Ninaba Principal Consultant @
REAL Analytics [email protected]

About me • Co-founder & Principal Consultant @ REAL Analytics.
• Loves to geek out about HPC, Clusters, Clouds, Containers, Orchestration, APIs & Backends. • Formerly Co-Founder for Scalable Systems Pte Ltd, Singapore Engineering Manager for Platform Computing Inc & Singapore Development Manager for Revolution Analytics Inc.

What is Reproducible Data Science? • The fundamental principle that
enables Data Science • Being able to consistently reconstruct any previous state of your data and analysis • Feedback loop that relies on testing hypotheses. • Must be able to 100% reproduce the input data, the output data and the analysis in exactly the same way

Challenges faced by Data Science community "There was only one
problem — all of my work was done in my local machine in R. People appreciate my efforts but they don’t know how to consume my model because it was not 'productionized' and the infrastructure cannot talk to my local model. Hard lesson learned!" -Robert Chang, data scientist at Twitter

Challenges faced by Data Science community "Data engineers are often
frustrated that data scientists produce inefficient and poorly written code, have little consideration for the maintenance cost of productionizing ideas, demand unrealistic features that skew implementation effort for little gain... The list goes on, but you get the point." -Jeff Magnusson, director of data platform at Stitchfix

Challenges faced by Data Science community Data analysis is incredibly
easy to get wrong, and it's just as hard to know when you're getting it right, which makes reproducible research all the more important! — Reproducibility is not just for researchers, Data School

Challenges faced by Data Science community Six months later, someone
asks you a question you didn't cover so you need to reproduce your analysis. But you can't remember where the hell you saved the damn thing on your computer. If you're a data scientist (especially the decision sciences/analysis focused kind), this has happened to you.—The Most Boring/Valuable Data Science Advice, by Justin Bozonier

Lack of Reproducibility

Why should you care about reproducibility? •Collaboration •Creativity and innovation
•Compliance

How do you achieve reproducibility? • Aim for simple and
interpretable solutions • Strive to be testable and deployable • Version your data • Know your provenance • Documentation

Introducing Pachyderm • "Git for Data Science" • Runs on
Kubernetes (k8s) • Language and tooling agnostic • Alternative to Hadoop • Focuses on reproducibility, Data provenance and collaboration • Website: http://pachyderm.io

Pachyderm, the alternative to Hadoop • Replaces HDFS with its
Pachyderm File System • Replaces MapReduce with its Pachyderm Pipelines System • Handles your distributed processing and automatically shards data • "Data Versioning" + "Data Pipelining"

Pachyderm File System (PFS) • Same version control semantics as
for code but for massive data sets. Time Machine for your data. • Backed by Object Store of your choice (S3, Google Cloud Storage, Azure Storage, Minio, etc). • Data does not live in Pachyderm, stays in object storage and therefore has all the safety guarantees of those underlying systems (e.g replication and persistence).

Getting data into PFS • Use pachctl CLI (recommended) •
Mount Pachyderm locally and add files directly through FUSE interface (Linux and Mac OS X only) • Use protobufs API, currently only Golang supported. Other languages will be supported soon (come help!)

Getting data out of PFS • Use pachctl CLI (recommended)
• Mount Pachyderm locally and add files directly through FUSE interface (Linux and Mac OS X only) • Use protobufs API, currently only Golang supported. Other languages will be supported soon (come help!) • Use Pachyderm Pipeline System (PPS)

Pachyderm Pipeline System (PPS) • PPS is the containerized processing
engine for PFS. • Data is exposed via PFS interface as a local file system in the container (in /pfs dir). • Runs jobs over PFS via k8s. • To get started, build your analysis code into a Docker container. • Use any language, libraries, tooling you want.

How to create your analysis code for Pachyderm via PPS
• Your analysis code just need to read and write data from the local filesystem in the container. • Data is read in input directory in /pfs. The input directory is the same name as your repo name in PFS. • Results can be written out to the output directory in /pfs. The directory is hardcoded to /pfs/out. • Describe your PPS job in a JSON pipeline spec file.

PPS is a data-aware container scheduler • PPS jobs are
attached to Pachyderm Data Repos (PDR). • Jobs triggered by commits to the PDR. • Understands job dependencies. • Supports multiple pipelines for a Pachyderm workflow. • Jobs are resilient and can be restarted. • Efficient and supports incremental processing whether in batched mode or streaming.

Use Case: General Fusion • Outgrown its existing data infrastructure
and needed a solution that can meet their requirements. • Need to augment (not "rip and replace") existing experimental & analysis workflows • Need to facilitate collaboration with external scientific partners seamlessly and adhoc sharing of large data sets • With Pachyderm, the General Fusion team can stay focused on plasma physics instead of designing and maintaining big data systems. The combination of language-agnostic infrastructure and version controlled data allows them to efficiently develop and iterate on their data analysis. General Fusion is developing fusion energy: a clean, safe, abundant and cost-competitive form of power. The company aims to design the world’s first full-scale demonstration fusion power plant based on commercially-viable technology. “The true tipping point in our decision to use Pachyderm was its version control features for managing our data.” - Jonathan Fraser Engineer at General Fusion

Use Case: Fogger • Evaluated having to build their own
solution in-house or using something like Hadoop/Spark • The learning curve and infrastructure overhead of Hadoop/Spark led them to Pachyderm as they are already using containers for their stack • Containers allow Fogger to build data-processing algorithms in any programming language and they did not have to learn any new technology other than Pachyderm CLI itself Fogger makes a software platform for processing sensor data on industrial machinery such as solar farms and wind turbines. Its Fog Computing platform allows data processing on small Linux boxes close to the machines and pushes it over a peer-to- peer network to a central cloud hub. It uses Pachyderm for local data processing on it way to the cloud. “Pachyderm has a very well-designed technological stack. We love the idea of map/reduce pipelines built with containers and a simple Git-like triggering system." - Kamil Kozak, CEO

Other Use Cases • Prodigy Finance, a FinTech company that
provides a platform that offers loans for international postgraduate students attending top universities, have started to use Pachyderm for their data archive and ML pipeline. • Video and image processing, Fraud Analysis, ETL pipelines, Sales Funnel Analysis. • Pachyderm also have a lot of users working on genomics and bioinformatics where reproducibility and dynamically scaling data pipelines are really important.

Pachyderm Resources • Website: http://pachyderm.io/ • Github Repo: https://github.com/pachyderm/pachyderm •
Twitter: https://twitter.com/pachydermIO • Pachyderm Slack: http://slack.pachyderm.io/

Credits & Attributions • Daniel Whitenack, Joe Doliner and the
Pachyderm Team • Pachyderm Slack Team

Summary • Challenges faced and ideas/tips on solving them •
Pachyderm architecture, benefits and use cases

Demo Time!

Endtro • Any questions? • Reach me via Twitter: @najibninaba
or email: [email protected] • Enjoy the rest of the session!

Reproducible Data Science with Pachyderm

Reproducible Data Science with Pachyderm

Najib Ninaba

Other Decks in Technology

Featured

Transcript

Reproducible Data Science with Pachyderm Najib Ninaba Principal Consultant @

About me • Co-founder & Principal Consultant @ REAL Analytics.

What is Reproducible Data Science? • The fundamental principle that

Challenges faced by Data Science community "There was only one

Challenges faced by Data Science community "Data engineers are often

Challenges faced by Data Science community Data analysis is incredibly

Challenges faced by Data Science community Six months later, someone

Lack of Reproducibility

Why should you care about reproducibility? •Collaboration •Creativity and innovation

How do you achieve reproducibility? • Aim for simple and

Introducing Pachyderm • "Git for Data Science" • Runs on

Pachyderm, the alternative to Hadoop • Replaces HDFS with its

Pachyderm File System (PFS) • Same version control semantics as

Getting data into PFS • Use pachctl CLI (recommended) •

Getting data out of PFS • Use pachctl CLI (recommended)

Pachyderm Pipeline System (PPS) • PPS is the containerized processing

How to create your analysis code for Pachyderm via PPS

PPS is a data-aware container scheduler • PPS jobs are

Use Case: General Fusion • Outgrown its existing data infrastructure

Use Case: Fogger • Evaluated having to build their own

Other Use Cases • Prodigy Finance, a FinTech company that

Pachyderm Resources • Website: http://pachyderm.io/ • Github Repo: https://github.com/pachyderm/pachyderm •

Credits & Attributions • Daniel Whitenack, Joe Doliner and the

Summary • Challenges faced and ideas/tips on solving them •

Demo Time!

Endtro • Any questions? • Reach me via Twitter: @najibninaba