Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Solid Data Infrastructure with Logs

Solid Data Infrastructure with Logs

How to build data infrastructure based on logs.

Talk originally published in PyCon Finland 2015

Jyrki Pulliainen

October 19, 2015
Tweet

More Decks by Jyrki Pulliainen

Other Decks in Technology

Transcript

  1. Talk overview ‣ How we do things today ‣ Why

    are there problems ‣ How logs help ‣ What is Kafka? ‣ Kafka with Python
  2. So, let’s talk about a program written in Scala that

    runs on JVM at a Python conference. Keynoter, just before the mob with pitchforks
  3. Complex Nightmare System ‣ Complicated data flows ‣ Inconsistencies creeping

    all over the place ‣ Price to introduce new systems
  4. OK! OK! A A Is the Data Consistent? Database Cache

    A A OK! A A B B OK! B B OK! A A OK!
  5. Think the Database You Have ‣ Data gets inserted into

    a transaction log first ‣ Data structures get mutated after ‣ If a crash happens, it can recover from the log ‣ The log is the source of truth ‣ Your database is a log based data structure!
  6. What’s Kafka? ‣ Developed by LinkedIn ‣ Written in Scala,

    runs on JVM ‣ Distributed, partitioned, replicated commit log service ‣ Maintains feeds of messages in categories (topics)
  7. Terminology: Broker ‣ A single server in a Kafka cluster

    ‣ Broker metadata is stored in ZooKeeper ‣ Automatic leader election and failover
  8. Terminology: Topics and Logs ‣ Topic is a single logical

    binder for logs, a feed of messages ‣ Topic maintains a partitioned log ‣ Log size is controlled by time based retention policy ‣ Logs can be compacted by key
  9. Terminology: Partition ‣ Allows scaling a topic past a single

    server ‣ Kafka guarantees order within the partition ‣ Allows parallel consumption of a log
  10. Terminology: Message ‣ A single binary entry in the log

    ‣ Free format, though usually some encoding (JSON, Avro) is used ‣ Belongs to a partition ‣ Is keyed for compaction
  11. Terminology: Consumer Group ‣ Consumes a single topic ‣ Multiple

    groups can consume same topic ‣ Kafka assigns partitions within the topic to consumer within the consumer group ‣ Guarantees consuming of partitions in order within the consumer group - something most other MQs don’t!
  12. Terminology: Consumer ‣ Consumes a single partition in a topic

    ‣ Can consume in any order ‣ Can reset the offset ‣ Are grouped in consumer groups
  13. Life of a Single Topic in Kafka Old New #0

    #0 #0 #1 #1 #1 #2 #2 #2 #3 #3 #4 Partition 0 Partition 1 Partition2 Producer Writes
  14. Nice stuff ‣ Easy to scale horizontally ‣ Guaranteed order

    of processing ‣ Producer has full control over the partitioning within a topic ‣ Data is retained forever if wanted ‣ …or limited by time ‣ …or compacted ‣ … or all combined!
  15. Why Kafka? Why not <another MQ>? ‣ Because I like

    it. ‣ Seriously, it really fits this use case. ‣ Super good at scaling up (if you need to care about that)
  16. Introducing new data sources Kafka Producer Search v1 Search v2

    1. Consume from the beginning 2. Start streaming 3. Switch over
  17. Introducing new data sources becomes a cheap operation! ‣ Cache

    reload / warm up? No prob! ‣ New DB schema? No prob! ‣ New component? No prob!