Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real time Data Pipeline example

Real time Data Pipeline example

Sam Bessalah

April 29, 2014
Tweet

More Decks by Sam Bessalah

Other Decks in Programming

Transcript

  1. DATA PIPELINES at SCALE
    Sam Bessalah
    @samklr

    View Slide

  2. Haz big data problem?

    View Slide

  3. Throw in some hadoop …

    View Slide

  4. Then …

    View Slide

  5. In reality

    View Slide

  6. In reality
    • Volume grows increasingly
    • Real life environnement always complicated
    • Privacy, compliance, etc
    • ETL is a pain, not always feasible
    • Data is always messy, incoherent ,incomplete
    • E.g Date:
    “Sat Mar 1 10:12:53 PST,”
    “ 2014-03-01 18:12:53 +00:00”
    “1393697578”

    View Slide

  7. No silver bullet, but
    • Tackle the scalability problem upfront
    • Build a resilient, reliable data processing
    pipeline
    • Enforce auditing, verification, testing of your
    system

    View Slide

  8. Logs
    Users
    $$$
    ETL
    Batches

    View Slide

  9. Problem with this
    • Costly $$$
    • High latency
    • Mostly batch oriented
    • Hard to evolve

    View Slide

  10. Properties of an efficient pipeline :
    • Keep data close to the source
    • Work on hot data
    • Avoid sampling, instead summarize or hash
    • Determine a common format , logical and
    physical
    • Make access to the data easy for analysis
    • Let the business drive question

    View Slide

  11. Logs
    Users
    $$$
    ETL
    Batches

    View Slide

  12. An example of architecture
    App
    App
    App
    App

    View Slide

  13. KAFKA
    • High throughput distributed messaging
    • Publish/Subscribe model
    • Categorizes messages into topics
    • Persists messages into disk. Allows message
    retention for a specified amount of time
    • Can have multiple producers and consumers

    View Slide

  14. STORM
    • Distributed real time data computation engine
    • Uses a graph of computations called topologies. E.g
    We can run many topologies to prepare data, and run real time machine
    learning models.

    View Slide

  15. App
    App
    App
    App
    Offline
    Processing
    Real Time

    View Slide

  16. What breaks at scale
    • Serialisation : Important part of your pipeline
    Take your pick : Protocol Buffers, Thrift, Avro
    Stay away of schemaless JSON.
    • Compression : Snappy, LZO, …
    • Storage format : Look at columnar storage formats like
    Parquet. Easier for OLAP operations.

    View Slide

  17. • B

    View Slide