Solid Data Infrastructure with Logs

Solid Data Infrastructure With Logs

Talk overview ‣ How we do things today ‣ Why
are there problems ‣ How logs help ‣ What is Kafka? ‣ Kafka with Python

Me @nailor / / [email protected]

Stockholm, May 9th to 10th 2016

So, let’s talk about a program written in Scala that
runs on JVM at a Python conference. Keynoter, just before the mob with pitchforks

Your Average* App *For some definition of average

Internet Application DB Search Cache Analytics Data dumps Notifications Web
3.0 Moar stack

Complex Nightmare System ‣ Complicated data flows ‣ Inconsistencies creeping
all over the place ‣ Price to introduce new systems

Price to Include New System ‣ Data bootstrapping? ‣ Modifications
to existing systems?

OK! OK! A A Is the Data Consistent? Database Cache
A A OK! A A B B OK! B B OK! A A OK!

How to fix?

How about a log based data structure? “wat?”

Think the Database You Have ‣ Data gets inserted into
a transaction log first ‣ Data structures get mutated after ‣ If a crash happens, it can recover from the log ‣ The log is the source of truth ‣ Your database is a log based data structure!

Enter Kafka!

“But I already got a database?” - The Lonely Developer

What’s Kafka? ‣ Developed by LinkedIn ‣ Written in Scala,
runs on JVM ‣ Distributed, partitioned, replicated commit log service ‣ Maintains feeds of messages in categories (topics)

Terminology: Broker ‣ A single server in a Kafka cluster
‣ Broker metadata is stored in ZooKeeper ‣ Automatic leader election and failover

Terminology: Topics and Logs ‣ Topic is a single logical
binder for logs, a feed of messages ‣ Topic maintains a partitioned log ‣ Log size is controlled by time based retention policy ‣ Logs can be compacted by key

Terminology: Partition ‣ Allows scaling a topic past a single
server ‣ Kafka guarantees order within the partition ‣ Allows parallel consumption of a log

Terminology: Message ‣ A single binary entry in the log
‣ Free format, though usually some encoding (JSON, Avro) is used ‣ Belongs to a partition ‣ Is keyed for compaction

Terminology: Producer ‣ Creates messages into the queue ‣ Controls
topic, key and partition of the message

Terminology: Consumer Group ‣ Consumes a single topic ‣ Multiple
groups can consume same topic ‣ Kafka assigns partitions within the topic to consumer within the consumer group ‣ Guarantees consuming of partitions in order within the consumer group - something most other MQs don’t!

Terminology: Consumer ‣ Consumes a single partition in a topic
‣ Can consume in any order ‣ Can reset the offset ‣ Are grouped in consumer groups

Life of a Single Topic in Kafka Old New #0
#0 #0 #1 #1 #1 #2 #2 #2 #3 #3 #4 Partition 0 Partition 1 Partition2 Producer Writes

Nice stuff ‣ Easy to scale horizontally ‣ Guaranteed order
of processing ‣ Producer has full control over the partitioning within a topic ‣ Data is retained forever if wanted ‣ …or limited by time ‣ …or compacted ‣ … or all combined!

Why Kafka? Why not <another MQ>? ‣ Because I like
it. ‣ Seriously, it really fits this use case. ‣ Super good at scaling up (if you need to care about that)

So, How to Build Around Kafka?

3.0 Moar stack

3.0 Moar stack K  a  f  k  a

Why is this design soooooo nice?

Introducing new data sources Kafka Producer Search v1 Search v2
1. Consume from the beginning 2. Start streaming 3. Switch over

Introducing new data sources becomes a cheap operation! ‣ Cache
reload / warm up? No prob! ‣ New DB schema? No prob! ‣ New component? No prob!

Ok, how About Python and Kafka?

DEMO TIME!

That’s it! Thanks :) @nailor / [email protected]

Useful links! ‣ kafka.apache.org ‣ martin.kleppmann.com/2015/05/27/ logs-for-data-infrastructure.html ‣ github.com/nailor/kafka-python-demo ‣
kafka-python.readthedocs.org/ ‣ Slides are here: speakerdeck.com/nailor

Solid Data Infrastructure with Logs

Solid Data Infrastructure with Logs

More Decks by Jyrki Pulliainen

Other Decks in Technology

Featured

Transcript