Slide 1

Slide 1 text

DATA PIPELINES at SCALE Sam Bessalah @samklr

Slide 2

Slide 2 text

Haz big data problem?

Slide 3

Slide 3 text

Throw in some hadoop …

Slide 4

Slide 4 text

Then …

Slide 5

Slide 5 text

In reality

Slide 6

Slide 6 text

In reality • Volume grows increasingly • Real life environnement always complicated • Privacy, compliance, etc • ETL is a pain, not always feasible • Data is always messy, incoherent ,incomplete • E.g Date: “Sat Mar 1 10:12:53 PST,” “ 2014-03-01 18:12:53 +00:00” “1393697578”

Slide 7

Slide 7 text

No silver bullet, but • Tackle the scalability problem upfront • Build a resilient, reliable data processing pipeline • Enforce auditing, verification, testing of your system

Slide 8

Slide 8 text

Logs Users $$$ ETL Batches

Slide 9

Slide 9 text

Problem with this • Costly $$$ • High latency • Mostly batch oriented • Hard to evolve

Slide 10

Slide 10 text

Properties of an efficient pipeline : • Keep data close to the source • Work on hot data • Avoid sampling, instead summarize or hash • Determine a common format , logical and physical • Make access to the data easy for analysis • Let the business drive question

Slide 11

Slide 11 text

Logs Users $$$ ETL Batches

Slide 12

Slide 12 text

An example of architecture App App App App

Slide 13

Slide 13 text

KAFKA • High throughput distributed messaging • Publish/Subscribe model • Categorizes messages into topics • Persists messages into disk. Allows message retention for a specified amount of time • Can have multiple producers and consumers

Slide 14

Slide 14 text

STORM • Distributed real time data computation engine • Uses a graph of computations called topologies. E.g We can run many topologies to prepare data, and run real time machine learning models.

Slide 15

Slide 15 text

App App App App Offline Processing Real Time

Slide 16

Slide 16 text

What breaks at scale • Serialisation : Important part of your pipeline Take your pick : Protocol Buffers, Thrift, Avro Stay away of schemaless JSON. • Compression : Snappy, LZO, … • Storage format : Look at columnar storage formats like Parquet. Easier for OLAP operations.

Slide 17

Slide 17 text

• B