Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DATA PIPELINE AND KAFKA

DATA PIPELINE AND KAFKA

I think Kafka is the unix pipe of the 21th century. I would explain why and give some example of common pattern used in my company (ContentSquare)

Raphael Mazelier

November 18, 2019
Tweet

Other Decks in Technology

Transcript

  1. DATA PIPELINE | KAFKA
    Raphaël Mazelier - 2019

    View Slide

  2. DATA PIPELINE : ME, MYSELF AND I && CONTENTSQUARE
    ● ut0mt8 in 2019 : a third boy, more white hair and a new job (again)

    View Slide

  3. DATA PIPELINE : WHAT IS A DATA PIPELINE ? a simple example

    View Slide

  4. DATA PIPELINE : A SIMPLE ONE (really)
    $ awk '{print $7}' access.log | sort | uniq -c | sort -rn | head -n 5
    46.229.168.133 - - [18/Oct/2019:12:22:24 +0200] "GET /myhomepage.php HTTP/1.1" 200 11544
    "-" "Mozilla/5.0 (compatible; SemrushBot/6~bl; +http://www.semrush.com/bot.html)"
    185.246.210.153 - - [18/Oct/2019:12:23:39 +0200] "GET /css/beautiful-color.css HTTP/1.0"
    200 25003 "http:/www.mysupersite.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5)
    AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
    4189 /favicon.ico
    3631 /myhomepage.php
    2124 /category/electric-something/
    1369 /products/electric-showers/deadmachine101.php
    915 /css/beautiful-color.css

    View Slide

  5. DATA PIPELINE : UNIX PHILOSOPHY
    → COMPOSABILITY
    ● Make each program do one thing well. To do a new job, build afresh rather
    than complicate old programs by adding new “features.”
    ● Expect the output of every program to become the input to another, as yet
    unknown, program.

    View Slide

  6. DATA PIPELINE : ANOTHER EXAMPLE (enrichment)

    View Slide

  7. DATA PIPELINE : PIPE AND DATA FORMAT
    Good:
    - Composability/do one thing well
    - Streams
    - Simple, powerful interface
    Problems :
    - Single machine only
    - One to one communication only
    - Input parsing, output escaping
    - No fault tolerance
    The simplest possible interface :
    ● ordered sequence of bytes
    ● maybe with EOF
    ● often ASCII
    ● \n = record separator
    ● [ \t] = field separator

    View Slide

  8. DATA PIPELINE : REAL EXAMPLE (guess what’s in blue?)

    View Slide

  9. DATA PIPELINE : KAFKA, PIPE FOR THE 21th CENTURY?
    Kafka
    - Messages
    - Durable
    - Buffering
    - Multi-subscriber Pub/sub pattern
    - Distributed processing
    - Replication, auto recovery
    - Schema mngt & encoding
    Unix pipes
    - Byte stream
    - In-memory
    - Blocking / backpressure
    - One to one
    - Single machine only
    - No fault tolerance
    - Input parsing / output escaping
    VS
    > otherwise quite similar :)
    Fun fact : Command-line Tools can be 235x Faster than your Hadoop Cluster
    https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html

    View Slide

  10. DATA PIPELINE : (FAST REWIND) WHAT IS KAFKA ?
    Zero Copy
    ● Kafka is a distributed streaming platform that is used publish and subscribe to
    streams of records
    ● Kafka is used most often for streaming data in real-time into other systems
    ● Why Kafka ?
    ○ Kafka has operational simplicity
    ○ Kafka Is Polyglot
    ○ Kafka Is Scalable
    ○ Kafka has record retention
    ○ Kafka handle back pressure
    ○ and last but not least Kafka is Fast !

    View Slide

  11. DATA PIPELINE : KAFKA CONCEPTS
    Producers – consume the data feed
    and send it to Kafka for distribution to
    consumers.
    Consumers – applications that
    subscribe to topics and read the data
    from kafka
    Brokers – workers that take data from
    the producers and send it to the
    consumers. They handle replication as
    well.
    Topics – categories for messages.
    Partitions – the physical divisions of a
    topic. They are used for redundancy as
    partitions and are spread over different
    storage servers.
    Zookeeper – Apache Zookeeper is
    coordinate services in distributed
    systems.

    View Slide

  12. DATA PIPELINE : KAFKA CONCEPTS

    View Slide

  13. DATA PIPELINE : KAFKA CONCEPTS (consumer group)
    > Think carefully about your number of partitions

    View Slide

  14. DATA PIPELINE : KAFKA CONCEPTS (rebalancing)
    > Take care of your client implementation...

    View Slide

  15. DATA PIPELINE : KAFKA CONCEPTS (Fault tolerance)

    View Slide

  16. DATA PIPELINE : KAFKA@ContentSquare
    Numbers
    ● 12 environments (prod / staging / X feature-env)
    ● 2 main region (EU/ US)
    ● ~ 20 kafka clusters / ~ 80 instances / brokers
    ● ~ 10 zookeeper clusters / ~ 30 zookeeper instances
    ● ~ 5TB data collected per day
    ● ~ 120k msg/s in peak for our first entry point cluster
    Topology
    ● kafka for data-eng : streaming mode, intensive use, high retention
    ○ typical cluster is composed of 12 brokers (8 cores, 32g ram, 2To disks)
    ● kafka for app : pub-sub event based.

    View Slide

  17. DATA PIPELINE : KAFKA OPERATION@ContentSquare
    A B C
    Multi -AZ

    View Slide

  18. DATA PIPELINE : KAFKA OPERATION@ContentSquare
    +

    View Slide

  19. DATA PIPELINE : KAFKA OPERATION@ContentSquare
    Monitor Everything
    ● Monitor broker / topics/ partitions / zookeeper
    ● Monitor : alert on lag, producer rate, consummer rate, stale
    ○ => Prometheus / Alert Manager/ PD
    ● Lag is still problematic to monitor (exporters are all broken :/)
    ○ => write our own exporter (https://github.com/ut0mt8/yakle)

    View Slide

  20. DATA PIPELINE : TOPICS SIZING
    https://medium.com/contentsquare-engineering-blog/kafka-topics-sizing-how-much-mess
    ages-do-i-store-9b3d904a053e
    ● retention calculation is hard
    ○ size based retention vs
    ○ time based retention
    ● we made some spreadsheet magic to forecast
    ● and then you need to check it !
    ○ we wrote a custom tool to find the oldest message in a
    topic (https://github.com/ut0mt8/kafka-oldest-message)

    View Slide

  21. DATA PIPELINE : (KAFKA PATTERN) BASIC PATTERNS
    - single producer / single partition / single
    consumer
    => just to decouple and handle back pressure
    - single producer / single partition / multiple consumers
    => publish / subscribe common pattern

    View Slide

  22. DATA PIPELINE : (KAFKA PATTERN) TEE BACKUP
    After a transformation of the data, send it to a kafka topics
    This topic is read twice (or more)
    - by the next data processor
    - by something that write a “backup” of the data (to s3 for
    example)

    View Slide

  23. DATA PIPELINE : (KAFKA PATTERN) ENRICHMENT
    Read an event from a topic
    Then enrich this event it with some external data and re-post it to the same topic
    The aggregator read the two events but as it is an aggregation operation it doesn’t matter.
    Allow us to enrich data without having a component in between => better resilient
    The message/event must be idempotent

    View Slide

  24. DATA PIPELINE : (KAFKA PATTERN) CUSTOM PARTITIONING
    Producer compute the resulting sharding key, based on something unique
    (for example an session key, so each session in is only one partition)
    Then it contact kafka which tell where to insert / which partition
    Consumers choose a partition at start (and periodically) => again pro tip choose advisely your number of partitions
    => Allow us to scale (either scale up producer/partition/broker/consumer)

    View Slide

  25. DATA PIPELINE : (KAFKA PATTERN) STATIC PARTITIONING
    Producer compute the resulting sharding key of the DB
    Then it insert directly in the right partition
    Consumers are statically bound to corrects partitions
    And then insert directly in the right DB instance / shard
    => Benchmark show us x6 performance in insert

    View Slide

  26. DATA PIPELINE : CONCLUSION
    Is Cool, drink it!
    Kafka is the pipe you need for your
    data pipeline.

    View Slide

  27. DATA PIPELINE : QUESTIONS ? (or beers ?)

    View Slide