Future Patterns in Data Ecosystem

Architect, Flipkart Future Patterns in Data Ecosystem Siddhartha Reddy

Data as a 1st class citizen in SDLC Pattern #1

Data availability for Analytics & Data Science ✤ Product teams
develop applications  ☛ store data in databases  ☛ log information for debugging ✤ A central data team  ☛ pulls data from the databases  ☛ parses log ﬁles  ☛ loads into a data warehouse

✤ This is broken ✤ No contract between product and
data teams  ☛ frequent breakages ✤ Data team spends most time  ☛ ﬁxing issues & catching up to changes ✤ Availability of data in logs  ☛ is a matter of luck ✤ Tight coupling, no cohesiveness ✤ Does not scale ✤ Data is an afterthought

“Data is the only true IP of an internet company.”
–Amod Malviya, CTO, Flipkart

Data availability for Analytics & Data Science ✤ Product teams
develop applications  ☛ store data in databases  ☛ log information for debugging  ☛ push data to a central data repository ✤ The central data team  ☛ can take a nice long vacation

At Flipkart ✤ Dart: Central ingestion service ✤ Multiple modes
of ingestion: HTTP service, daemon running alongside applications, bulk ingestion etc. ✤ Push, not pull ✤ Responsibility of ensuring ingestion pipeline is healthy is with product teams, not Dart team ✤ Audit: Ability to match ingested data with data in databases

Data as a 1st class citizen in SDLC ✤ Apache
✤ (Facebook) ✤ Apache

Primary users will be machines,  not humans Pattern #2

At Flipkart ✤ Analytics ✤ Systemic Consumption ✤ Recommendations /
Personalisation ✤ Inventory Planning ✤ Fraud Detection ✤ Pricing Engine ✤ …

Real-time processing takes  centre stage Pattern #3

Real-time ✤ Daily  (or weekly, or monthly) ✤ Hourly ✤
A few minutes ✤ A few seconds ✤ Instantaneous

✤ Apache ✤ Apache — Streaming ✤ Apache ✤ AWS
✤ Google Cloud Dataﬂow ✤ Druid

At Flipkart ✤ Challenges ✤ windowing ✤ out-of-order data ✤
out-of-phase streams ✤ state mutations in streams ✤ Storm based “recipes” ✤ Joins (symmetric, asymmetric) ✤ Aggregations ✤ Bootstrapping ✤ Multiple

Convergence between Batch and Stream processing Pattern #4

✤ Batch Processing: Scale & power ✤ Stream Processing: Freshness

Lambda Architecture

At Flipkart ✤ Batch Processing: Hadoop + Vertica ✤ Stream
Processing: Storm + ElasticSearch ✤ Query: Apache

But the workflow is… broken ✤ Write batch-processing pipelines ✤
Then write stream-processing pipelines ✤ When logic needs to be updated?  Update both!

✤ Twitter’s Summingbird ✤ Apache — Streaming ✤ Google Cloud
Dataﬂow

Thank you Siddhartha Reddy sid@ﬂipkart.com

Future Patterns in Data Ecosystem

Future Patterns in Data Ecosystem

Siddhartha Reddy

More Decks by Siddhartha Reddy

Other Decks in Technology

Featured

Transcript

Architect, Flipkart Future Patterns in Data Ecosystem Siddhartha Reddy

Data as a 1st class citizen in SDLC Pattern #1

Data availability for Analytics & Data Science ✤ Product teams

✤ This is broken ✤ No contract between product and

“Data is the only true IP of an internet company.”

Data availability for Analytics & Data Science ✤ Product teams

At Flipkart ✤ Dart: Central ingestion service ✤ Multiple modes

Data as a 1st class citizen in SDLC ✤ Apache

Primary users will be machines,  not humans Pattern #2

At Flipkart ✤ Analytics ✤ Systemic Consumption ✤ Recommendations /

Real-time processing takes  centre stage Pattern #3

Real-time ✤ Daily  (or weekly, or monthly) ✤ Hourly ✤

✤ Apache ✤ Apache — Streaming ✤ Apache ✤ AWS

At Flipkart ✤ Challenges ✤ windowing ✤ out-of-order data ✤

Convergence between Batch and Stream processing Pattern #4

✤ Batch Processing: Scale & power ✤ Stream Processing: Freshness

Lambda Architecture

At Flipkart ✤ Batch Processing: Hadoop + Vertica ✤ Stream

But the workflow is… broken ✤ Write batch-processing pipelines ✤

✤ Twitter’s Summingbird ✤ Apache — Streaming ✤ Google Cloud

Thank you Siddhartha Reddy sid@ﬂipkart.com