DATA PIPELINE AND KAFKA

DATA PIPELINE | KAFKA Raphaël Mazelier - 2019

DATA PIPELINE : ME, MYSELF AND I && CONTENTSQUARE •
ut0mt8 in 2019 : a third boy, more white hair and a new job (again)

DATA PIPELINE : WHAT IS A DATA PIPELINE ? a
simple example

DATA PIPELINE : A SIMPLE ONE (really) $ awk '{print
$7}' access.log | sort | uniq -c | sort -rn | head -n 5 46.229.168.133 - - [18/Oct/2019:12:22:24 +0200] "GET /myhomepage.php HTTP/1.1" 200 11544 "-" "Mozilla/5.0 (compatible; SemrushBot/6~bl; +http://www.semrush.com/bot.html)" 185.246.210.153 - - [18/Oct/2019:12:23:39 +0200] "GET /css/beautiful-color.css HTTP/1.0" 200 25003 "http:/www.mysupersite.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36" 4189 /favicon.ico 3631 /myhomepage.php 2124 /category/electric-something/ 1369 /products/electric-showers/deadmachine101.php 915 /css/beautiful-color.css

DATA PIPELINE : UNIX PHILOSOPHY → COMPOSABILITY • Make each
program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new “features.” • Expect the output of every program to become the input to another, as yet unknown, program.

DATA PIPELINE : ANOTHER EXAMPLE (enrichment)

DATA PIPELINE : PIPE AND DATA FORMAT Good: - Composability/do
one thing well - Streams - Simple, powerful interface Problems : - Single machine only - One to one communication only - Input parsing, output escaping - No fault tolerance The simplest possible interface : • ordered sequence of bytes • maybe with EOF • often ASCII • \n = record separator • [ \t] = ﬁeld separator

DATA PIPELINE : REAL EXAMPLE (guess what’s in blue?)

DATA PIPELINE : KAFKA, PIPE FOR THE 21th CENTURY? Kafka
- Messages - Durable - Buﬀering - Multi-subscriber Pub/sub pattern - Distributed processing - Replication, auto recovery - Schema mngt & encoding Unix pipes - Byte stream - In-memory - Blocking / backpressure - One to one - Single machine only - No fault tolerance - Input parsing / output escaping VS > otherwise quite similar :) Fun fact : Command-line Tools can be 235x Faster than your Hadoop Cluster https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html

DATA PIPELINE : (FAST REWIND) WHAT IS KAFKA ? Zero
Copy • Kafka is a distributed streaming platform that is used publish and subscribe to streams of records • Kafka is used most often for streaming data in real-time into other systems • Why Kafka ? ◦ Kafka has operational simplicity ◦ Kafka Is Polyglot ◦ Kafka Is Scalable ◦ Kafka has record retention ◦ Kafka handle back pressure ◦ and last but not least Kafka is Fast !

DATA PIPELINE : KAFKA CONCEPTS Producers – consume the data
feed and send it to Kafka for distribution to consumers. Consumers – applications that subscribe to topics and read the data from kafka Brokers – workers that take data from the producers and send it to the consumers. They handle replication as well. Topics – categories for messages. Partitions – the physical divisions of a topic. They are used for redundancy as partitions and are spread over diﬀerent storage servers. Zookeeper – Apache Zookeeper is coordinate services in distributed systems.

DATA PIPELINE : KAFKA CONCEPTS

DATA PIPELINE : KAFKA CONCEPTS (consumer group) > Think carefully
about your number of partitions

DATA PIPELINE : KAFKA CONCEPTS (rebalancing) > Take care of
your client implementation...

DATA PIPELINE : KAFKA CONCEPTS (Fault tolerance)

DATA PIPELINE : KAFKA@ContentSquare Numbers • 12 environments (prod /
staging / X feature-env) • 2 main region (EU/ US) • ~ 20 kafka clusters / ~ 80 instances / brokers • ~ 10 zookeeper clusters / ~ 30 zookeeper instances • ~ 5TB data collected per day • ~ 120k msg/s in peak for our ﬁrst entry point cluster Topology • kafka for data-eng : streaming mode, intensive use, high retention ◦ typical cluster is composed of 12 brokers (8 cores, 32g ram, 2To disks) • kafka for app : pub-sub event based.

DATA PIPELINE : KAFKA OPERATION@ContentSquare A B C Multi -AZ

DATA PIPELINE : KAFKA OPERATION@ContentSquare +

DATA PIPELINE : KAFKA OPERATION@ContentSquare Monitor Everything • Monitor broker
/ topics/ partitions / zookeeper • Monitor : alert on lag, producer rate, consummer rate, stale ◦ => Prometheus / Alert Manager/ PD • Lag is still problematic to monitor (exporters are all broken :/) ◦ => write our own exporter (https://github.com/ut0mt8/yakle)

DATA PIPELINE : TOPICS SIZING https://medium.com/contentsquare-engineering-blog/kafka-topics-sizing-how-much-mess ages-do-i-store-9b3d904a053e • retention calculation
is hard ◦ size based retention vs ◦ time based retention • we made some spreadsheet magic to forecast • and then you need to check it ! ◦ we wrote a custom tool to ﬁnd the oldest message in a topic (https://github.com/ut0mt8/kafka-oldest-message)

DATA PIPELINE : (KAFKA PATTERN) BASIC PATTERNS - single producer
/ single partition / single consumer => just to decouple and handle back pressure - single producer / single partition / multiple consumers => publish / subscribe common pattern

DATA PIPELINE : (KAFKA PATTERN) TEE BACKUP After a transformation
of the data, send it to a kafka topics This topic is read twice (or more) - by the next data processor - by something that write a “backup” of the data (to s3 for example)

DATA PIPELINE : (KAFKA PATTERN) ENRICHMENT Read an event from
a topic Then enrich this event it with some external data and re-post it to the same topic The aggregator read the two events but as it is an aggregation operation it doesn’t matter. Allow us to enrich data without having a component in between => better resilient The message/event must be idempotent

DATA PIPELINE : (KAFKA PATTERN) CUSTOM PARTITIONING Producer compute the
resulting sharding key, based on something unique (for example an session key, so each session in is only one partition) Then it contact kafka which tell where to insert / which partition Consumers choose a partition at start (and periodically) => again pro tip choose advisely your number of partitions => Allow us to scale (either scale up producer/partition/broker/consumer)

DATA PIPELINE : (KAFKA PATTERN) STATIC PARTITIONING Producer compute the
resulting sharding key of the DB Then it insert directly in the right partition Consumers are statically bound to corrects partitions And then insert directly in the right DB instance / shard => Benchmark show us x6 performance in insert

DATA PIPELINE : CONCLUSION Is Cool, drink it! Kafka is
the pipe you need for your data pipeline.

DATA PIPELINE : QUESTIONS ? (or beers ?)

DATA PIPELINE AND KAFKA

DATA PIPELINE AND KAFKA

Raphael Mazelier

Other Decks in Technology

Featured

Transcript

DATA PIPELINE | KAFKA Raphaël Mazelier - 2019

DATA PIPELINE : ME, MYSELF AND I && CONTENTSQUARE •

DATA PIPELINE : WHAT IS A DATA PIPELINE ? a

DATA PIPELINE : A SIMPLE ONE (really) $ awk '{print

DATA PIPELINE : UNIX PHILOSOPHY → COMPOSABILITY • Make each

DATA PIPELINE : ANOTHER EXAMPLE (enrichment)

DATA PIPELINE : PIPE AND DATA FORMAT Good: - Composability/do

DATA PIPELINE : REAL EXAMPLE (guess what’s in blue?)

DATA PIPELINE : KAFKA, PIPE FOR THE 21th CENTURY? Kafka

DATA PIPELINE : (FAST REWIND) WHAT IS KAFKA ? Zero

DATA PIPELINE : KAFKA CONCEPTS Producers – consume the data

DATA PIPELINE : KAFKA CONCEPTS

DATA PIPELINE : KAFKA CONCEPTS (consumer group) > Think carefully

DATA PIPELINE : KAFKA CONCEPTS (rebalancing) > Take care of

DATA PIPELINE : KAFKA CONCEPTS (Fault tolerance)

DATA PIPELINE : KAFKA@ContentSquare Numbers • 12 environments (prod /

DATA PIPELINE : KAFKA OPERATION@ContentSquare A B C Multi -AZ

DATA PIPELINE : KAFKA OPERATION@ContentSquare +

DATA PIPELINE : KAFKA OPERATION@ContentSquare Monitor Everything • Monitor broker

DATA PIPELINE : TOPICS SIZING https://medium.com/contentsquare-engineering-blog/kafka-topics-sizing-how-much-mess ages-do-i-store-9b3d904a053e • retention calculation

DATA PIPELINE : (KAFKA PATTERN) BASIC PATTERNS - single producer

DATA PIPELINE : (KAFKA PATTERN) TEE BACKUP After a transformation

DATA PIPELINE : (KAFKA PATTERN) ENRICHMENT Read an event from

DATA PIPELINE : (KAFKA PATTERN) CUSTOM PARTITIONING Producer compute the

DATA PIPELINE : (KAFKA PATTERN) STATIC PARTITIONING Producer compute the

DATA PIPELINE : CONCLUSION Is Cool, drink it! Kafka is

DATA PIPELINE : QUESTIONS ? (or beers ?)