Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real time Data Pipeline example

Real time Data Pipeline example

Sam Bessalah

April 29, 2014
Tweet

More Decks by Sam Bessalah

Other Decks in Programming

Transcript

  1. In reality • Volume grows increasingly • Real life environnement

    always complicated • Privacy, compliance, etc • ETL is a pain, not always feasible • Data is always messy, incoherent ,incomplete • E.g Date: “Sat Mar 1 10:12:53 PST,” “ 2014-03-01 18:12:53 +00:00” “1393697578”
  2. No silver bullet, but • Tackle the scalability problem upfront

    • Build a resilient, reliable data processing pipeline • Enforce auditing, verification, testing of your system
  3. Problem with this • Costly $$$ • High latency •

    Mostly batch oriented • Hard to evolve
  4. Properties of an efficient pipeline : • Keep data close

    to the source • Work on hot data • Avoid sampling, instead summarize or hash • Determine a common format , logical and physical • Make access to the data easy for analysis • Let the business drive question
  5. KAFKA • High throughput distributed messaging • Publish/Subscribe model •

    Categorizes messages into topics • Persists messages into disk. Allows message retention for a specified amount of time • Can have multiple producers and consumers
  6. STORM • Distributed real time data computation engine • Uses

    a graph of computations called topologies. E.g We can run many topologies to prepare data, and run real time machine learning models.
  7. What breaks at scale • Serialisation : Important part of

    your pipeline Take your pick : Protocol Buffers, Thrift, Avro Stay away of schemaless JSON. • Compression : Snappy, LZO, … • Storage format : Look at columnar storage formats like Parquet. Easier for OLAP operations.