PyConZA 2015: "Distributed pub-sub infrastructure with Apache Kafka" by Carl Scheffler

Distributed pub–sub infrastructure with Apache Kafka Carl Schefﬂer takealot.com

Apache Kafka • Low latency, high throughput, publish–subscribe messaging •
Developed by LinkedIn (still used in production) • Open-sourced in 2011 • First-class Apache project in 2012

The Log • append only • sequential / strictly ordered
• record what happened and “when” (index is primary, not timestamp) • (a queue is just a log that still has to happen) • provides • audit trail • replication (reproduce state) • notiﬁcation (react to speciﬁc messages) • aggregation (produce new data streams / logs)

Partitioned Logs • split up message stream into independent message
streams • strict ordering within a partition • no ordering between partitions (but you can add timestamps if you’d like) • provides: • horizontal scaling

Don’t reinvent… • use the operating system and the file
system • persistence: support real time and batch jobs. configurable (default: 7 days) then delete or log compaction • speed: don’t reinvent read ahead, caching, swapping. logs are linear append-only files, reading is also linear. very well optimised by OS • replication is just the same file on another machine

At Takealot • User tracking • checkout • recommendations •
common search terms • AB testing • Real time dashboards • Order audit trail • SQL trace

At Takealot • 3-node Kafka cluster + 3-node Zookeeper cluster
(replicated in EC2) • Peak: 200 messages / second • Average: 20 messages / second (about 1.75 million / day) • Typical real time dashboard lag: 250ms • Index into ElasticSearch for search and more dashboards

Python Client Libraries • pykafka • github.com/Parsely/pykafka • v2.0.0 released
on Monday • Apache licence • kafka-python • github.com/mumrah/kafka-python • v0.9.4 released in June • Apache licence

Python Client Libraries • pykafka • more fully featured: balanced
consumers, manage topics, offsets • supports threading and greenlets for async • requires Kafka 0.8.2 • kafka-python • more stable • threading only for async • supports Kafka 0.8.0, 0.8.1, 0.8.2

Python Client Libraries • pykafka • producer: 41 400 –
46 500 – 50 200 Hz • consumer: 12 100 - 14 400 – 23 700 Hz • still a bit ﬂaky: wait for 2.0.1 • kafka-python • producer: 26 500 – 27 700 – 29 500 Hz • consumer: 35 000 – 37 300 – 39 100 Hz • less active community

Python Client Libraries

Python Client Libraries • don’t forget Hadoop • consumer and
producer integration into Spark

When you might not want to use Kafka • Persistence
not needed  maintaining shared global state (consumer and producer offsets) in Zookeeper is extra overhead • Don’t care so much about resilience  resilience requires extra disk and network trafﬁc • ØMQ is great for client–server or other common messaging patterns, including straightforward pub–sub  zeromq.org

Other things to think about • designing topics • metadata
• message types • schema and validation • monitoring consumers and producers • hide all this from your application developers

PyConZA 2015: "Distributed pub-sub infrastructu...

PyConZA 2015: "Distributed pub-sub infrastructure with Apache Kafka" by Carl Scheffler

Pycon ZA

More Decks by Pycon ZA

Other Decks in Programming

Featured

Transcript

Distributed pub–sub infrastructure with Apache Kafka Carl Schefﬂer takealot.com

Apache Kafka • Low latency, high throughput, publish–subscribe messaging •

The Log • append only • sequential / strictly ordered

Partitioned Logs • split up message stream into independent message

Don’t reinvent… • use the operating system and the ﬁle

At Takealot • User tracking • checkout • recommendations •

At Takealot • 3-node Kafka cluster + 3-node Zookeeper cluster

Demo

Python Client Libraries • pykafka • github.com/Parsely/pykafka • v2.0.0 released

Python Client Libraries • pykafka • more fully featured: balanced

Python Client Libraries • pykafka • producer: 41 400 –

Python Client Libraries

Python Client Libraries • don’t forget Hadoop • consumer and

When you might not want to use Kafka • Persistence

Other things to think about • designing topics • metadata