Future Patterns in Data Ecosystem

Slide 1

Slide 1 text

Architect, Flipkart Future Patterns in Data Ecosystem Siddhartha Reddy

Slide 2

Slide 2 text

Data as a 1st class citizen in SDLC Pattern #1

Slide 3

Slide 3 text

Data availability for Analytics & Data Science ✤ Product teams develop applications  ☛ store data in databases  ☛ log information for debugging ✤ A central data team  ☛ pulls data from the databases  ☛ parses log ﬁles  ☛ loads into a data warehouse

Slide 4

Slide 4 text

✤ This is broken ✤ No contract between product and data teams  ☛ frequent breakages ✤ Data team spends most time  ☛ ﬁxing issues & catching up to changes ✤ Availability of data in logs  ☛ is a matter of luck ✤ Tight coupling, no cohesiveness ✤ Does not scale ✤ Data is an afterthought

Slide 5

Slide 5 text

“Data is the only true IP of an internet company.” –Amod Malviya, CTO, Flipkart

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Data availability for Analytics & Data Science ✤ Product teams develop applications  ☛ store data in databases  ☛ log information for debugging  ☛ push data to a central data repository ✤ The central data team  ☛ can take a nice long vacation

Slide 8

Slide 8 text

At Flipkart ✤ Dart: Central ingestion service ✤ Multiple modes of ingestion: HTTP service, daemon running alongside applications, bulk ingestion etc. ✤ Push, not pull ✤ Responsibility of ensuring ingestion pipeline is healthy is with product teams, not Dart team ✤ Audit: Ability to match ingested data with data in databases

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Data as a 1st class citizen in SDLC ✤ Apache ✤ (Facebook) ✤ Apache

Slide 11

Slide 11 text

Primary users will be machines,  not humans Pattern #2

Slide 12

Slide 12 text

At Flipkart ✤ Analytics ✤ Systemic Consumption ✤ Recommendations / Personalisation ✤ Inventory Planning ✤ Fraud Detection ✤ Pricing Engine ✤ …

Slide 13

Slide 13 text

Real-time processing takes  centre stage Pattern #3

Slide 14

Slide 14 text

Real-time ✤ Daily  (or weekly, or monthly) ✤ Hourly ✤ A few minutes ✤ A few seconds ✤ Instantaneous

Slide 15

Slide 15 text

✤ Apache ✤ Apache — Streaming ✤ Apache ✤ AWS ✤ Google Cloud Dataﬂow ✤ Druid

Slide 16

Slide 16 text

At Flipkart ✤ Challenges ✤ windowing ✤ out-of-order data ✤ out-of-phase streams ✤ state mutations in streams ✤ Storm based “recipes” ✤ Joins (symmetric, asymmetric) ✤ Aggregations ✤ Bootstrapping ✤ Multiple

Slide 17

Slide 17 text

Convergence between Batch and Stream processing Pattern #4

Slide 18

Slide 18 text

✤ Batch Processing: Scale & power ✤ Stream Processing: Freshness

Slide 19

Slide 19 text

Lambda Architecture

Slide 20

Slide 20 text

At Flipkart ✤ Batch Processing: Hadoop + Vertica ✤ Stream Processing: Storm + ElasticSearch ✤ Query: Apache

Slide 21

Slide 21 text

But the workflow is… broken ✤ Write batch-processing pipelines ✤ Then write stream-processing pipelines ✤ When logic needs to be updated?  Update both!