PyConZA 2015: "Distributed pub-sub infrastructure with Apache Kafka" by Carl Scheffler

Slide 1

Slide 1 text

Distributed pub–sub infrastructure with Apache Kafka Carl Schefﬂer takealot.com

Slide 2

Slide 2 text

Apache Kafka • Low latency, high throughput, publish–subscribe messaging • Developed by LinkedIn (still used in production) • Open-sourced in 2011 • First-class Apache project in 2012

Slide 3

Slide 3 text

The Log • append only • sequential / strictly ordered • record what happened and “when” (index is primary, not timestamp) • (a queue is just a log that still has to happen) • provides • audit trail • replication (reproduce state) • notiﬁcation (react to speciﬁc messages) • aggregation (produce new data streams / logs)

Slide 4

Slide 4 text

Partitioned Logs • split up message stream into independent message streams • strict ordering within a partition • no ordering between partitions (but you can add timestamps if you’d like) • provides: • horizontal scaling

Slide 5

Slide 5 text

Don’t reinvent… • use the operating system and the file system • persistence: support real time and batch jobs. configurable (default: 7 days) then delete or log compaction • speed: don’t reinvent read ahead, caching, swapping. logs are linear append-only files, reading is also linear. very well optimised by OS • replication is just the same file on another machine

Slide 6

Slide 6 text

At Takealot • User tracking • checkout • recommendations • common search terms • AB testing • Real time dashboards • Order audit trail • SQL trace

Slide 7

Slide 7 text

At Takealot • 3-node Kafka cluster + 3-node Zookeeper cluster (replicated in EC2) • Peak: 200 messages / second • Average: 20 messages / second (about 1.75 million / day) • Typical real time dashboard lag: 250ms • Index into ElasticSearch for search and more dashboards

Slide 8

Slide 8 text

Demo

Slide 9

Slide 9 text

Python Client Libraries • pykafka • github.com/Parsely/pykafka • v2.0.0 released on Monday • Apache licence • kafka-python • github.com/mumrah/kafka-python • v0.9.4 released in June • Apache licence

Slide 10

Slide 10 text

Python Client Libraries • pykafka • more fully featured: balanced consumers, manage topics, offsets • supports threading and greenlets for async • requires Kafka 0.8.2 • kafka-python • more stable • threading only for async • supports Kafka 0.8.0, 0.8.1, 0.8.2

Slide 11

Slide 11 text

Python Client Libraries • pykafka • producer: 41 400 – 46 500 – 50 200 Hz • consumer: 12 100 - 14 400 – 23 700 Hz • still a bit ﬂaky: wait for 2.0.1 • kafka-python • producer: 26 500 – 27 700 – 29 500 Hz • consumer: 35 000 – 37 300 – 39 100 Hz • less active community

Slide 12

Slide 12 text

Python Client Libraries

Slide 13

Slide 13 text

Python Client Libraries • don’t forget Hadoop • consumer and producer integration into Spark

Slide 14

Slide 14 text

When you might not want to use Kafka • Persistence not needed  maintaining shared global state (consumer and producer offsets) in Zookeeper is extra overhead • Don’t care so much about resilience  resilience requires extra disk and network trafﬁc • ØMQ is great for client–server or other common messaging patterns, including straightforward pub–sub  zeromq.org

Slide 15

Slide 15 text

Other things to think about • designing topics • metadata • message types • schema and validation • monitoring consumers and producers • hide all this from your application developers