Kafka JVM adventures

KAFKA AND A DEVELOPER WALK INTO A BAR Shlomi Shemesh,
R&D Group manager

Mobile Marketing Analytics 2014: team of 10 2015: team of
35 in 2 offices 2016: team of 130 in 8 offices, expected to double next year • Product partnership with • 7 Billion messages per day, which translates to a few times that in Kafka messages • Hundreds of nodes using a few Kafka clusters holding 40+ topics.

Ad Networks Agencies Advertisers The Mobile Attribution Eco-System

QUEUES

Pub-Sub vs. Synchronous Invocation

Apache Kafka A publish-subscribe messaging rethought as a distributed commit
log. Fast A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Scalable Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers Durable Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact. Distributed by Design Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.

Apache Kafka A publish-subscribe messaging rethought as a distributed commit
log. • More than one client can read the same message - messages are kept until Kafka discards them (retention policy) • Great support for concurrency using Kafka partitions • Kafka does not hold the queue state - responsibility of the consumer • Guarantees message delivery "at least once"

ELBs Web Handlers . . . Raw Reports Data River
Matches Launches Aggr. Reports . . . IO Cluster Deep Analytics Secor Metadata X109 Loyals AppsFlyer Real-Time Architecture RealTime http messages

AppsFlyer River ® Kafka in Production

Media Sources App Developers App Users AppsFlyer Data Flows X109

• 120 c3.large nodes in the River cluster • ~1000
partners and many thousands or configured customers • Processing high-volume of events per second Http Post Parse and normalize Postback generator Postback Http Sender Monitoring and Analytics Multiple Topics The River Microservice The River Microservice

When Services Go Pop • Using queues if when, one
of the River machines fails there is no data loss • Load is always distributed amongst all machines in the cluster

* more about monitoring and scaling in the next session
Kafka Multiple Partitions = Elastic Microservice Design

• “Shock absorber” • Machine maintenance • Rush hours vs.
off peak traffic time • quickly spot and fix bottlenecks Why Queues?

BACK- PRESSURE

• “Back-pressure is an important feedback mechanism that allows systems
to gracefully respond to load rather than collapse under it… will ensure that the system is resilient under load” AppsFlyer Attribution micro service River blocked by Partner server timeouts Pull vs Push Queue as a Back- Pressure Queue Lag = Log size - Offset http://www.reactivemanifesto.org/glossary#Back-Pressure

AppsFlyer River ® Kafka in Production

Constant Traffic Growth

Troubles in Paradise Http Post Parse and normalize Postback generator
Postback Http Sender Monitoring and Analytics Multiple Topics The River Microservice

commit or not commit?! • Major issues on destination timeouts
=> unstable # threads per jvm while http blocks • Unstable memory and CPU use • Services committing suicide Troubles in Paradise

QUEUES EVERYWHERE!

Consumer Single Topic Per Mode Parse and Normalize Postback Generator
Postback Http Sender Monitoring and Analytics • Mode per topic • Queues and Back-Pressure within our microservice using blocking channel • Spot the bottlenecks by monitoring the buffers • Efficiently utilize the CPU cores Refactor - Introduce Internal Queues

Just Pop-Up Another Consumer Group • Easily add additional micro
services to consume from the same topic • Utilizing Kafka consumer groups to test production data during development or by test servers • Data migration from U.S to E.U located servers • Replace legacy microservice with new code - run in parallel • Replay data in case of a bug

Quickly Spot Where the Bottleneck is

IO Cluster to the Rescue • Microservice that uses Java
event-driven, non-blocking I/O model in order to make the http calls (via the http-kit library) • Split the complexity of the http calls from the complexity of the business logic that yields them • ~1000 concurrent http calls on a single node up to 110k per minute using 'enhanced networking’ machines • Generic - Receive the output topic as a input message parameter

Kafka Consumer Single topic per mode Postback Generator Rule Engine
Kafka Producer Kafka Consumer Http Sender Kafka Producer Kafka Consumer Monitoring and Analytics Kafka Producer River-analytics River-out River IO Cluster River Analytics Refactor - Introduce Internal Queues

Tuning the Kafka Consumer

and sleep much better at night # of River Machines
Reduced by 90% !

• Our Microservices are stateless • We aim to keep
our microservices simple, meaning Single Responsibility Simple Made Easy (Rich Hickey, creator of clojure)

Thank You. WE ARE HIRING!! Email: [email protected]

Kafka JVM adventures

Kafka JVM adventures

AppsFlyer

More Decks by AppsFlyer

Other Decks in Programming

Featured

Transcript

KAFKA AND A DEVELOPER WALK INTO A BAR Shlomi Shemesh,

Mobile Marketing Analytics 2014: team of 10 2015: team of

Ad Networks Agencies Advertisers The Mobile Attribution Eco-System

QUEUES

Pub-Sub vs. Synchronous Invocation

Apache Kafka A publish-subscribe messaging rethought as a distributed commit

Apache Kafka A publish-subscribe messaging rethought as a distributed commit

ELBs Web Handlers . . . Raw Reports Data River

AppsFlyer River ® Kafka in Production

Media Sources App Developers App Users AppsFlyer Data Flows X109

• 120 c3.large nodes in the River cluster • ~1000

When Services Go Pop • Using queues if when, one

* more about monitoring and scaling in the next session

• “Shock absorber” • Machine maintenance • Rush hours vs.

BACK- PRESSURE

• “Back-pressure is an important feedback mechanism that allows systems

AppsFlyer River ® Kafka in Production

Constant Traffic Growth

Troubles in Paradise Http Post Parse and normalize Postback generator

commit or not commit?! • Major issues on destination timeouts

QUEUES EVERYWHERE!

Consumer Single Topic Per Mode Parse and Normalize Postback Generator

Just Pop-Up Another Consumer Group • Easily add additional micro

Quickly Spot Where the Bottleneck is

IO Cluster to the Rescue • Microservice that uses Java

Kafka Consumer Single topic per mode Postback Generator Rule Engine

Tuning the Kafka Consumer

and sleep much better at night # of River Machines

• Our Microservices are stateless • We aim to keep

Thank You. WE ARE HIRING!! Email: [email protected]