Real time Data Pipeline example

Sam Bessalah

April 29, 2014

  1. DATA PIPELINES at SCALE Sam Bessalah @samklr

  2. Haz big data problem?

  3. Throw in some hadoop …

  4. Then …

  5. In reality

  6. In reality • Volume grows increasingly • Real life environnement

    always complicated • Privacy, compliance, etc • ETL is a pain, not always feasible • Data is always messy, incoherent ,incomplete • E.g Date: “Sat Mar 1 10:12:53 PST,” “ 2014-03-01 18:12:53 +00:00” “1393697578”
  7. No silver bullet, but • Tackle the scalability problem upfront

    • Build a resilient, reliable data processing pipeline • Enforce auditing, verification, testing of your system
  8. Logs Users $$$ ETL Batches

  9. Problem with this • Costly $$$ • High latency •

    Mostly batch oriented • Hard to evolve
  10. Properties of an efficient pipeline : • Keep data close

    to the source • Work on hot data • Avoid sampling, instead summarize or hash • Determine a common format , logical and physical • Make access to the data easy for analysis • Let the business drive question
  11. Logs Users $$$ ETL Batches

  12. An example of architecture App App App App

  13. KAFKA • High throughput distributed messaging • Publish/Subscribe model •

    Categorizes messages into topics • Persists messages into disk. Allows message retention for a specified amount of time • Can have multiple producers and consumers
  14. STORM • Distributed real time data computation engine • Uses

    a graph of computations called topologies. E.g We can run many topologies to prepare data, and run real time machine learning models.
  15. App App App App Offline Processing Real Time

  16. What breaks at scale • Serialisation : Important part of

    your pipeline Take your pick : Protocol Buffers, Thrift, Avro Stay away of schemaless JSON. • Compression : Snappy, LZO, … • Storage format : Look at columnar storage formats like Parquet. Easier for OLAP operations.
