Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Future Patterns in Data Ecosystem

Future Patterns in Data Ecosystem

It is important for anyone building a software platform to have a good pulse on the patterns of use that the system would see — this is one of the most important aspects that determines the value it would generate. The world of Big Data, Analytics & Data Science is fast-moving, with the "in vogue" tools changing almost every year. While most of us have grown accustomed to adopting to the changes in tooling, what often catch us unawares are the paradigm shifts within the ecosystem. This talk will walk through some such emerging patterns in the world of Big Data and go over how they're manifested at Flipkart — which has been an early adopter of many of these patterns.

Siddhartha Reddy

August 07, 2015
Tweet

More Decks by Siddhartha Reddy

Other Decks in Technology

Transcript

  1. Data availability for Analytics & Data Science ✤ Product teams

    develop applications
 ☛ store data in databases
 ☛ log information for debugging ✤ A central data team
 ☛ pulls data from the databases
 ☛ parses log files
 ☛ loads into a data warehouse
  2. ✤ This is broken ✤ No contract between product and

    data teams
 ☛ frequent breakages ✤ Data team spends most time
 ☛ fixing issues & catching up to changes ✤ Availability of data in logs
 ☛ is a matter of luck ✤ Tight coupling, no cohesiveness ✤ Does not scale ✤ Data is an afterthought
  3. Data availability for Analytics & Data Science ✤ Product teams

    develop applications
 ☛ store data in databases
 ☛ log information for debugging
 ☛ push data to a central data repository ✤ The central data team
 ☛ can take a nice long vacation
  4. At Flipkart ✤ Dart: Central ingestion service ✤ Multiple modes

    of ingestion: HTTP service, daemon running alongside applications, bulk ingestion etc. ✤ Push, not pull ✤ Responsibility of ensuring ingestion pipeline is healthy is with product teams, not Dart team ✤ Audit: Ability to match ingested data with data in databases
  5. At Flipkart ✤ Analytics ✤ Systemic Consumption ✤ Recommendations /

    Personalisation ✤ Inventory Planning ✤ Fraud Detection ✤ Pricing Engine ✤ …
  6. Real-time ✤ Daily
 (or weekly, or monthly) ✤ Hourly ✤

    A few minutes ✤ A few seconds ✤ Instantaneous
  7. ✤ Apache ✤ Apache — Streaming ✤ Apache ✤ AWS

    ✤ Google Cloud Dataflow ✤ Druid
  8. At Flipkart ✤ Challenges ✤ windowing ✤ out-of-order data ✤

    out-of-phase streams ✤ state mutations in streams ✤ Storm based “recipes” ✤ Joins (symmetric, asymmetric) ✤ Aggregations ✤ Bootstrapping ✤ Multiple
  9. At Flipkart ✤ Batch Processing: Hadoop + Vertica ✤ Stream

    Processing: Storm + ElasticSearch ✤ Query: Apache
  10. But the workflow is… broken ✤ Write batch-processing pipelines ✤

    Then write stream-processing pipelines ✤ When logic needs to be updated?
 Update both!