Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Decaton: High performance task processing with Kafka

Decaton: High performance task processing with Kafka

Eebedc2ee7ff95ffb9d9102c6d4a065c?s=128

LINE DevDay 2020

November 26, 2020
Tweet

Transcript

  1. None
  2. Speaker › Software engineer at LINE Z-Part team › Developing

    company-wide Apache Kafka platform › Decaton maintainer › Haruki Okada
  3. › Workload separation › Quick response time › == Better

    UX Background task processing is crucial for web apps › Failure isolation Web API Web API Push notification Web hooks Search indexing ɾ ɾ ɾ
  4. Process massive background tasks At maximal resource utilization efficiency Without

    spoiling task processing reliability Background task processing at scale is challenging
  5. Decaton is: › Decaton is battle tested in LINE service

    development for years › “High performance” is the most important feature Decaton provides › Kafka consumer framework › Aimed to implement reliable task processing on it
  6. Agenda › Why we developed Decaton › How Decaton works

    › Various engineering efforts we made › To make it high-performant
  7. Apache Kafka at LINE › Distributed streaming platform › One

    of the most popular middleware in LINE › Over 90 services use company-wide Kafka platform developed by our team
  8. Typical use case of Kafka at LINE › Background task

    processing Task processor Web API Web API Kafka Produce tasks including external I/O Process in background Storage
  9. Example: Notify profile updates TaskProcessor Web API API Server Kafka

    Update profile Notify friends Task: NotifyUpdateProfile
  10. Background task processing at LINE › LINE’s scale › Over

    1 million tasks / sec in most high traffic use case › At least once delivery › All produced tasks should be processed without leaking › Usually I/O intensive › DB access, Web API calls, etc
  11. Why Decaton?

  12. Kafka basics Topic ɾɾɾ | 5 | 4 | 3

    | 2 | 1 ɾɾɾ | 8 | 7 | 6 | 5 | 4 ɾ ɾ ɾ Consumer A Consumer B Consumer Group Web API Producer Partitions Commit processed offset key: foo key: bar key: baz
  13. Consumption model › Each consumer in a group will get

    assignment >= 0 partitions › Some consumers could get 0 partition if num consumers > num partitions › => Max consumer concurrency is limited by the number of partitions ɾɾɾ | 5 | 4 | 3 | 2 | 1 Consumer A (assigned: 1) Consumer B (assigned: 1) ɾɾɾ | 8 | 7 | 6 | 5 | 4 Consumer C (assigned: 0) Partitions
  14. Basic consumer impl ɾɾɾ | 5 | 4 | 3

    | 2 | 1 Partition Consumer thread poll() loop
  15. ɾɾɾ | 5 | 4 | 3 | 2 |

    1 Partition doProcess(1) doProcess(2) doProcess(3) ɾ ɾ ɾ Consumer thread Process latency › Processing model in major frameworks: › Kafka Streams › Spring Kafka Sequential processing per partition
  16. Consumer throughput is a problem Per partition throughput = 1

    process latency per record › If a process includes I/O with 10 ms latency => 100 tasks / second at max
  17. Why not adding more partitions? › It’s difficult to estimate

    required concurrency from the beginning › However, › Adding partitions often requires contacting cluster administrator › Adding partitions has side-effects › Message ordering breaks temporarily
  18. Why not adding more partitions? › Partitions are the unit

    of consumer concurrency. At the same time: › The unit of producer batches => tends to generate smaller batches › Affects number of open file descriptors & mmapped files etc › Not preferable in LINE circumstance › Single, multi-tenant shared Kafka cluster
  19. Why not just process async?

  20. At-least-once is broken | 5 | 4 | 3 |

    2 | 1 | 5 | 4 | 3 | 2 | 1 => 1. Fetch 1 to 5 2. Submit tasks to async executor => 3. Commit latest offset “5” as already submitted => 5. Consumer crash !!! => 6. Other instance takes over from offset “5” › => LOST offset 3,4 => 4. Process async
  21. Why not just batch?

  22. Head-of-line blocking | 5 | 4 | 3 | 2

    | 1 | 10 | 9 | 8 | 7 | 6 ɾɾɾ Waiting single outlier “4” to complete Current batch Blocked…. Partition
  23. Ideal behavior | 5 | 4 | 3 | 2

    | 1 | 10 | 9 | 8 | 7 | 6 ɾɾɾ Continue processing without waiting “4" Partition Commit “3" All done ɾɾɾ
  24. Ideal behavior | 5 | 4 | 3 | 2

    | 1 | 10 | 9 | 8 | 7 | 6 ɾɾɾ Partition Let’s call this offset “Watermark” == The highest offset that all preceding offsets are already processed All done ɾɾɾ
  25. What we need is: › Mechanism to track watermark under

    the condition: › Any offset could be completed from several threads in arbitrary order | 5 | 4 | 3 | 2 | 1 Processor thread Processor thread ɾ ɾ ɾ
  26. How? › We can calculate watermark by iterating over completed

    offsets in ascending order and checking corresponding fetched offset › => Sort out-of-order arriving tasks by its offset => Priority queue? 4 | 2 | 1 ɾɾɾ | 5 | 4 | 3 | 2 | 1 Completed offsets Fetched offsets Watermark
  27. Initial approach: Priority queue 4 | 2 | 1 ɾɾɾ

    | 5 | 4 | 3 | 2 | 1 Consumer thread 1. Register fetched offsets as “pending” Processor thread Processor thread 2. Submit tasks 5,3,1 4,2 3. Put completed offsets concurrently (priority queue) 4. zip offsets => water mark = 2
  28. Deployed to the production Kafka Web API Decaton processors API

    servers 1 million / sec tasks HBase Put
  29. Watermark management became bottleneck ObjectMonitor

  30. 4 | 2 | 1 Processor thread Processor thread lock()

    lock() › To mutate priority queue concurrently › Append to the queue › Remove offsets until watermark to keep queue length sane ɾɾɾ | 5 | 4 | 3 | 2 | 1 Lock was necessary
  31. How can we improve? › Problem: mutating shared object (queue)

    by multiple processor threads 4 | 2 | 1 Processor thread Processor thread lock() lock()
  32. Get rid of mutating shared object › Each processor thread

    only mutates the state of offsets it’s interested in Processor thread Processor thread | 1 | | 2 | | 4 | | 5 | | 3 | markComplete() markComplete() Pending offsets
  33. Revised: Lock-free approach Consumer thread Processor thread Processor thread Fixed

    length ring buffer 1. Initialize states for fetched offsets (1..5) as “not completed” & set watermark pointer 0 1 2 3 4 5 6 7
  34. Revised: Lock-free approach Consumer thread 0 1 2 3 4

    5 6 7 2. Submit tasks 5,3,1 4,2 4. Advance watermark pointer 3. Mark offsets as “complete” Processor thread Processor thread
  35. No longer a bottleneck

  36. Now we have: › Mechanism to track watermark in a

    optimized way › => The core of Decaton’s high performance asynchronous task processing
  37. Decaton’s ordering semantics Processor thread Processor thread ɾɾɾ | 5

    | 4 | 3 | 2 | 1 | a b a b a key: Partition | 5 | 3 | 1 | a a a Internal queue | 4 | 2 | b b › Process ordering guarantee is relaxed from “per partition” to “per key”
  38. Summary: Decaton’s processing model › Can process single partition in

    multiple threads › Which isn’t possible in many other consumer frameworks › With preserving at-least-once processing semantics › At stable delivery latency › Because it’s not a “batching” model › Per-key process ordering guarantee
  39. Benchmark comparison 0 500 1000 1500 2000 2500 3000 Kafka

    Streams Spring Kafka Decaton Decaton (10 threads / partition) › Process latency: 10 ms per task › Partition count: 3 https://github.com/ocadaruma/decaton-benchmark-comparison Throughput (msg/sec) › => in sequential model, throughput will be capped by 300 msg/sec
  40. How Decaton works

  41. Example: Notify profile updates TaskProcessor Web API API Server Kafka

    1. Define task protocol 2. Implement task producer 3. Implement task processor
  42. 1. Define task protocol (protobuf as example)

  43. 2. Implement task producer

  44. 3. Implement task processor

  45. Decaton features › Decaton has various features useful for task

    processing › From requirements for real-world LINE service development › Rate limiting › Retry queueing › And more…
  46. Rate limiting Decaton processor Web API Web API Kafka Sudden

    traffic spike • Abusing • Big campaign External web service Invoke Throttle processing rate Buffering on Kafka
  47. Rate limiting

  48. Retry Queueing Decaton processor Storage (! high load) Fails intermittently

    › Retry until succeeds? › => Can block subsequent tasks (Though Decaton’s processing model mitigates the impact) › Just give up the task? › => Not preferable
  49. Retry Queueing Decaton processor Storage (! high load) Fails intermittently

    Kafka Produce retry task Retry after backoff
  50. Retry Queueing 1. Prepare retry topic. Then:

  51. Decaton adopters (excerpt): https://engineering.linecorp.com/ja/blog/decaton-case-studies/ Blog article:

  52. Ambition: Experimental WebAssembly support https://engineering.linecorp.com/en/blog/ › As of now Decaton

    only supports JVM based language › If Decaton can support WebAssembly binary as a DecatonProcessor implementation, it unlocks supports for many programming languages ! › Adding experimental WebAssembly support to Decaton – Part 1, 2
  53. Maintaining Decaton

  54. Performance can be affected by: › Several factors: › Adding

    new feature › Refactoring › Default configuration value changes › Update dependencies › kafka-clients › …
  55. What we need is: › Integrated benchmarking environment › Measures

    end-to-end processing performance › Rather than micro-benchmark for small portion of the code › Along with profiles to investigate bottleneck › Async-profiler › Linux taskstats › Measure performance transition per code changes
  56. Integrated benchmarking Benchmarker process Kafka Decaton processor 1. Start embedded

    Kafka cluster 3. Start profiling 5. End profiling 2. Fork JVM Consume 4. Produce Communicate via RMI - ready to process tasks - end processing
  57. Continuous benchmarking › Run integrated benchmarking per every code changes

    › As part of continuous integration › Store benchmark results › Visualize performance trend on a web UI
  58. Continuous benchmarking Integrated benchmark CI Server Decaton repository 2. Run

    1. New commit 3. Store benchmark result to gh-pages Performance dashboard
  59. Visualize performance transition › https://line.github.io/decaton/

  60. Conclusion › Decaton is a battle tested Kafka consumer framework

    › Suites for I/O intensive workload the most › Made various efforts to make it high performant › Highly optimized commit management › Continuous integrated benchmarking › Decaton has been open-sourced https://github.com/line/decaton › Try it and give us your feedbacks!
  61. Thank you