Real time Data Pipeline example

DATA PIPELINES at SCALE Sam Bessalah @samklr

Haz big data problem?

Throw in some hadoop …

Then …

In reality

In reality • Volume grows increasingly • Real life environnement
always complicated • Privacy, compliance, etc • ETL is a pain, not always feasible • Data is always messy, incoherent ,incomplete • E.g Date: “Sat Mar 1 10:12:53 PST,” “ 2014-03-01 18:12:53 +00:00” “1393697578”

No silver bullet, but • Tackle the scalability problem upfront
• Build a resilient, reliable data processing pipeline • Enforce auditing, verification, testing of your system

Logs Users $$$ ETL Batches

Problem with this • Costly $$$ • High latency •
Mostly batch oriented • Hard to evolve

Properties of an efficient pipeline : • Keep data close
to the source • Work on hot data • Avoid sampling, instead summarize or hash • Determine a common format , logical and physical • Make access to the data easy for analysis • Let the business drive question

Logs Users $$$ ETL Batches

An example of architecture App App App App

KAFKA • High throughput distributed messaging • Publish/Subscribe model •
Categorizes messages into topics • Persists messages into disk. Allows message retention for a specified amount of time • Can have multiple producers and consumers

STORM • Distributed real time data computation engine • Uses
a graph of computations called topologies. E.g We can run many topologies to prepare data, and run real time machine learning models.

App App App App Offline Processing Real Time

What breaks at scale • Serialisation : Important part of
your pipeline Take your pick : Protocol Buffers, Thrift, Avro Stay away of schemaless JSON. • Compression : Snappy, LZO, … • Storage format : Look at columnar storage formats like Parquet. Easier for OLAP operations.

Real time Data Pipeline example

Real time Data Pipeline example

Sam Bessalah

More Decks by Sam Bessalah

Other Decks in Programming

Featured

Transcript

DATA PIPELINES at SCALE Sam Bessalah @samklr

Haz big data problem?

Throw in some hadoop …

Then …

In reality

In reality • Volume grows increasingly • Real life environnement

No silver bullet, but • Tackle the scalability problem upfront

Logs Users $$$ ETL Batches

Problem with this • Costly $$$ • High latency •

Properties of an efficient pipeline : • Keep data close

Logs Users $$$ ETL Batches

An example of architecture App App App App

KAFKA • High throughput distributed messaging • Publish/Subscribe model •

STORM • Distributed real time data computation engine • Uses

App App App App Offline Processing Real Time

What breaks at scale • Serialisation : Important part of

• B