Slide 1

Slide 1 text

Building Scalable Data Science Pipeline luigi | spark | flask www.unnati.xyz Raghotham S Nischal HP

Slide 2

Slide 2 text

Agenda ● Introduction ● Data engineering ● Machine Learning ● Data Pipelines ● API ● Hands on

Slide 3

Slide 3 text

Introduction Applying software engineering principles to Data Science

Slide 4

Slide 4 text

Data Engineering Process of acquiring, cleaning, transforming & persisting data

Slide 5

Slide 5 text

Machine Learning Art & science of choosing a model & scaling it

Slide 6

Slide 6 text

Data Pipelines Plumbing data engineering & machine learning tasks

Slide 7

Slide 7 text

API Expose data science as a service

Slide 8

Slide 8 text

Project Structure

Slide 9

Slide 9 text

Hands on Dataset: Bay Area Bike Share Hypothesis based solution

Slide 10

Slide 10 text

Apache Spark ● Distributed in-memory computing ● Distributed machine learning framework ● 100x faster than Hadoop ● RDDs

Slide 11

Slide 11 text

Luigi ● Complex pipelines ● Dependency resolution ● Workflow management ● Visualization ● Exception handling

Slide 12

Slide 12 text

What did we learn today?

Slide 13

Slide 13 text

What did we learn today? Building scalable data science platform is easy

Slide 14

Slide 14 text

What did we learn today? Building scalable data science platform is easy

Slide 15

Slide 15 text

Thank You @unnati_xyz