Building Scalable Data Science Pipeline

July 31, 2016

1.3k

Building Scalable Data Science Pipeline

"In theory, there is no difference between theory and practice. But in practice, there is." - Yogi Berra

Once the task of prototyping a data science solution has been accomplished on a local machine, the real challenge begins in how to make it work in production. To ensure that the plumbing of the data pipeline will work in production at scale is both an art and a science. The science involves understanding the different tools and technologies needed to make the data pipeline connect, while the art involves making the trade-offs needed to tune the data pipeline so that it flows.

In this workshop, you will learn how to build a scalable data science platform with set up and conduct data engineering using Pandas and Luigi, build a machine learning model with Apache Spark and deploy it as predictive api with Flask

unnati_xyz

July 31, 2016

Tweet

More Decks by unnati_xyz

See All by unnati_xyz

Architecture Decisions for Tiny & Big Data

0

110

Architecture Choices for Big and Tiny Data Problems

0

43

Introduction to Deep Learning & NLP - PyData London 2016

1

1.2k

Building Scalable Data Products in R

0

1.2k

Other Decks in Technology

See All in Technology

LLMをツールからプラットフォームへ〜Ai Workforceの戦略〜 #BetAIDay

PRO

0

180

Datasets for Critical Operations by Dataform

0

130

反脆弱性(アンチフラジャイル)とデータ基盤構築

2

120

隙間時間で爆速開発！ Claude Code × Vibe Coding で作るマニュアル自動生成サービス

2

240

마라톤 끝의 단거리 스퍼트: 2025년의 AI

PRO

1

210

バクラクによるコーポレート業務の自動運転 #BetAIDay

PRO

0

220

Jitera Company Deck / JP

0

310

「育てる」サーバーレス〜チーム開発研修で学んだ、小さく始めて大きく拡張するAWS設計〜

1

210

MCPに潜むセキュリティリスクを考えてみる

1

920

【CEDEC2025】現場を理解して実現！ゲーム開発を効率化するWebサービスの開発と、利用促進のための継続的な改善

PRO

0

530

AWS表彰プログラムとキャリアについて

1

150

ビジネス文書に特化した基盤モデル開発 / SaaSxML_Session_2

0

180

Featured

See All Featured

sergeychernyshev

32

1k

Bash Introduction

613

210k

GraphQLの誤解/rethinking-graphql

71

11k

Building Adaptive Systems

43

2.7k

The Psychology of Web Performance [Beyond Tellerrand 2023]

48

2.9k

XXLCSS - How to scale CSS and keep your sanity

248

1.3M

Unsuck your backbone

671

58k

Art, The Web, and Tiny UX

301

21k

Principles of Awesome APIs and How to Build Them.

126

17k

[RailsConf 2023 Opening Keynote] The Magic of Rails

29

9.6k

How to Think Like a Performance Engineer

25

1.8k

The MySQL Ecosystem @ GitHub 2015

251

13k

Transcript

Building Scalable Data Science Pipeline luigi | spark | flask
www.unnati.xyz Raghotham S Nischal HP
Agenda • Introduction • Data engineering • Machine Learning •
Data Pipelines • API • Hands on
Introduction Applying software engineering principles to Data Science
Data Engineering Process of acquiring, cleaning, transforming & persisting data
Machine Learning Art & science of choosing a model &
scaling it
Data Pipelines Plumbing data engineering & machine learning tasks
API Expose data science as a service
Project Structure
Hands on Dataset: Bay Area Bike Share Hypothesis based solution
Apache Spark • Distributed in-memory computing • Distributed machine learning
framework • 100x faster than Hadoop • RDDs
Luigi • Complex pipelines • Dependency resolution • Workflow management
• Visualization • Exception handling
What did we learn today?
What did we learn today? Building scalable data science platform
is easy
What did we learn today? Building scalable data science platform
is easy
Thank You @unnati_xyz