Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DATA PIPELINE AND KAFKA

DATA PIPELINE AND KAFKA

I think Kafka is the unix pipe of the 21th century. I would explain why and give some example of common pattern used in my company (ContentSquare)

Raphael Mazelier

November 18, 2019
Tweet

Other Decks in Technology

Transcript

  1. DATA PIPELINE : ME, MYSELF AND I && CONTENTSQUARE •

    ut0mt8 in 2019 : a third boy, more white hair and a new job (again)
  2. DATA PIPELINE : A SIMPLE ONE (really) $ awk '{print

    $7}' access.log | sort | uniq -c | sort -rn | head -n 5 46.229.168.133 - - [18/Oct/2019:12:22:24 +0200] "GET /myhomepage.php HTTP/1.1" 200 11544 "-" "Mozilla/5.0 (compatible; SemrushBot/6~bl; +http://www.semrush.com/bot.html)" 185.246.210.153 - - [18/Oct/2019:12:23:39 +0200] "GET /css/beautiful-color.css HTTP/1.0" 200 25003 "http:/www.mysupersite.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36" 4189 /favicon.ico 3631 /myhomepage.php 2124 /category/electric-something/ 1369 /products/electric-showers/deadmachine101.php 915 /css/beautiful-color.css
  3. DATA PIPELINE : UNIX PHILOSOPHY → COMPOSABILITY • Make each

    program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new “features.” • Expect the output of every program to become the input to another, as yet unknown, program.
  4. DATA PIPELINE : PIPE AND DATA FORMAT Good: - Composability/do

    one thing well - Streams - Simple, powerful interface Problems : - Single machine only - One to one communication only - Input parsing, output escaping - No fault tolerance The simplest possible interface : • ordered sequence of bytes • maybe with EOF • often ASCII • \n = record separator • [ \t] = field separator
  5. DATA PIPELINE : KAFKA, PIPE FOR THE 21th CENTURY? Kafka

    - Messages - Durable - Buffering - Multi-subscriber Pub/sub pattern - Distributed processing - Replication, auto recovery - Schema mngt & encoding Unix pipes - Byte stream - In-memory - Blocking / backpressure - One to one - Single machine only - No fault tolerance - Input parsing / output escaping VS > otherwise quite similar :) Fun fact : Command-line Tools can be 235x Faster than your Hadoop Cluster https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
  6. DATA PIPELINE : (FAST REWIND) WHAT IS KAFKA ? Zero

    Copy • Kafka is a distributed streaming platform that is used publish and subscribe to streams of records • Kafka is used most often for streaming data in real-time into other systems • Why Kafka ? ◦ Kafka has operational simplicity ◦ Kafka Is Polyglot ◦ Kafka Is Scalable ◦ Kafka has record retention ◦ Kafka handle back pressure ◦ and last but not least Kafka is Fast !
  7. DATA PIPELINE : KAFKA CONCEPTS Producers – consume the data

    feed and send it to Kafka for distribution to consumers. Consumers – applications that subscribe to topics and read the data from kafka Brokers – workers that take data from the producers and send it to the consumers. They handle replication as well. Topics – categories for messages. Partitions – the physical divisions of a topic. They are used for redundancy as partitions and are spread over different storage servers. Zookeeper – Apache Zookeeper is coordinate services in distributed systems.
  8. DATA PIPELINE : KAFKA@ContentSquare Numbers • 12 environments (prod /

    staging / X feature-env) • 2 main region (EU/ US) • ~ 20 kafka clusters / ~ 80 instances / brokers • ~ 10 zookeeper clusters / ~ 30 zookeeper instances • ~ 5TB data collected per day • ~ 120k msg/s in peak for our first entry point cluster Topology • kafka for data-eng : streaming mode, intensive use, high retention ◦ typical cluster is composed of 12 brokers (8 cores, 32g ram, 2To disks) • kafka for app : pub-sub event based.
  9. DATA PIPELINE : KAFKA OPERATION@ContentSquare Monitor Everything • Monitor broker

    / topics/ partitions / zookeeper • Monitor : alert on lag, producer rate, consummer rate, stale ◦ => Prometheus / Alert Manager/ PD • Lag is still problematic to monitor (exporters are all broken :/) ◦ => write our own exporter (https://github.com/ut0mt8/yakle)
  10. DATA PIPELINE : TOPICS SIZING https://medium.com/contentsquare-engineering-blog/kafka-topics-sizing-how-much-mess ages-do-i-store-9b3d904a053e • retention calculation

    is hard ◦ size based retention vs ◦ time based retention • we made some spreadsheet magic to forecast • and then you need to check it ! ◦ we wrote a custom tool to find the oldest message in a topic (https://github.com/ut0mt8/kafka-oldest-message)
  11. DATA PIPELINE : (KAFKA PATTERN) BASIC PATTERNS - single producer

    / single partition / single consumer => just to decouple and handle back pressure - single producer / single partition / multiple consumers => publish / subscribe common pattern
  12. DATA PIPELINE : (KAFKA PATTERN) TEE BACKUP After a transformation

    of the data, send it to a kafka topics This topic is read twice (or more) - by the next data processor - by something that write a “backup” of the data (to s3 for example)
  13. DATA PIPELINE : (KAFKA PATTERN) ENRICHMENT Read an event from

    a topic Then enrich this event it with some external data and re-post it to the same topic The aggregator read the two events but as it is an aggregation operation it doesn’t matter. Allow us to enrich data without having a component in between => better resilient The message/event must be idempotent
  14. DATA PIPELINE : (KAFKA PATTERN) CUSTOM PARTITIONING Producer compute the

    resulting sharding key, based on something unique (for example an session key, so each session in is only one partition) Then it contact kafka which tell where to insert / which partition Consumers choose a partition at start (and periodically) => again pro tip choose advisely your number of partitions => Allow us to scale (either scale up producer/partition/broker/consumer)
  15. DATA PIPELINE : (KAFKA PATTERN) STATIC PARTITIONING Producer compute the

    resulting sharding key of the DB Then it insert directly in the right partition Consumers are statically bound to corrects partitions And then insert directly in the right DB instance / shard => Benchmark show us x6 performance in insert
  16. DATA PIPELINE : CONCLUSION Is Cool, drink it! Kafka is

    the pipe you need for your data pipeline.