Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The application of formal methods in Kafka reliability engineering

The application of formal methods in Kafka reliability engineering

The application of formal methods in Kafka reliability engineering
Haruki Okada / LINE Corporation

「Apache Kafka Meetup Japan #11」
https://kafka-apache-jp.connpass.com/event/249546/

A3966f193f4bef226a0d3e3c1f728d7f?s=128

LINE Developers
PRO

June 27, 2022
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. The application of
 formal methods in
 Kafka reliability engineering Haruki

    Okada 2022.06.24
  2. Speaker 2 •Haruki Okada (Twitter/GitHub: ocadaruma) •Lead of LINE IMF

    team •Kafka contributor •KIP-764: Configurable backlog size for creating Acceptor •Author of tlaplus-intellij-plugin •https://github.com/ocadaruma/tlaplus-intellij-plugin
  3. Kafka at LINE 3 •One of the most popular middleware

    in LINE •We provide multi-tenant shared Kafka cluster •Data in/out: 1.75PB / day •Peak message inflow: 23.5M messages / sec
  4. Reliability engineerings 4 •We put a lot of engineering efforts

    to make our cluster highly reliable •e.g. •“Reliability Engineering Behind The Most Trusted Kafka Platform”
 (LINE DEVELOPER DAY 2019) •"Investigating Request Delay in a Large-Scale Kafka Cluster Caused by TCP"
 (LINE DEVELOPER DAY 2021)
  5. •Summary of the phenomenon we encountered •Solution •Verifying the solution

    using formal method (TLA+) •Conclusion Agenda 5
  6. •Broker 1, 2, 3 hosted partition X’s replica •Broker 1

    was the leader Phenomenon 6
  7. •Broker 1 got down due to hardware failure •Usually, new

    leader should be elected from other replicas and producer continues working Phenomenon 7
  8. •This time, however, new leader wasn’t elected so the partition

    became
 unable to produce at all :( Phenomenon 8
  9. •In Kafka, such partition is called ”offline partition” •Partition leader

    is absent •Also, cannot elect new leader Phenomenon 9
  10. What happened? 10 •In sync replica (ISR) is a replica

    that is catching up the leader fast enough •Only ISR can be elected as a leader
  11. What happened? 11 •Before broker1 completely died, broker1 became unable

    to handle fetch request
 => marked other replicas as “not in-sync”
  12. What happened? 12 •Finally, broker1 died completely •=> There were

    no eligible replicas to be elected as new leader
  13. •Requires tough decision •We have to choose either of “Fast

    recovery” or “Data durability” Handling offline partition situation 13
  14. •Elect non in-sync (called “out-of-sync”) replica as the new leader

    •Could cause data loss Possible handlings for offline partition:
 Unclean leader election 14
  15. •e.g. By replacing the failed hardware •Could take long time

    to recover the situation •Meanwhile, produce / consume against the partition is unable Possible handlings for offline partition:
 Start up the failed leader (our choice) 15 => We decided doing this to avoid data-loss
  16. How can we improve? 16 •We succeeded to recover the

    situation, but took long time to wait hardware replacement •=> We want to recover offline partition more fast while avoiding data loss
  17. Deep dive into data durability in Kafka 17 offset Replica1

    (Leader) Replica2 (Follower) Replica3 (Follower) Producer Buffer
  18. Deep dive into data durability in Kafka 18 offset Replica1

    (Leader) Replica2 (Follower) Replica3 (Follower) •Producer produces a message Send request 0 Producer Buffer Request
  19. 19 offset Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) •Replica2 replicates

    offset 0 0 0 Producer Buffer Request Deep dive into data durability in Kafka
  20. 20 offset Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) •Replica3 replicates

    offset 0 •Produce response successfully returns 0 0 0 Return response Producer Buffer Deep dive into data durability in Kafka
  21. 21 •Kafka ensures messages are replicated to all in-sync replicas

    upon
 successful return of produce requests •(Assumes producer is configured with `acks = all`) •Messages replicated to all in-sync replicas are called “committed messages” •≠ “committed offset” (which means the checkpoint of consumer position) Deep dive into data durability in Kafka
  22. Deep dive into availability in Kafka 22 offset Replica1 (Leader)

    Replica2 (Follower) Replica3 (Follower) •Producer produces a message again 0 0 0 1 Send request Producer Buffer Request
  23. 23 offset Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) •Replica3 replicates

    offset 1 0 0 0 1 1 Producer Buffer Request Deep dive into availability in Kafka
  24. 24 offset Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) •Replica2 became

    out-of-sync •Produce request returns successfully, as it’s replicated to all “in-sync” replicas 0 0 0 1 1 Return response Producer Buffer Deep dive into availability in Kafka
  25. 25 •Kafka has a flexibility in minimum in-sync replica count

    to “commit” the messages •`min.insync.replicas` config •e.g. (Assumes there are 3 replicas) •`min.insync.replicas = 2` •=> Can tolerate 1 replica to fail to continue working, with ensuring at least 2
 replicas have full set of committed messages Deep dive into availability in Kafka
  26. What if in-sync replicas shrunk to 1 26 offset Replica1

    (Leader) Replica2 (Follower) Replica3 (Follower) •Producer produces a message again 0 0 0 1 1 2 Send request Producer Buffer Request
  27. What if in-sync replicas shrunk to 1 27 offset Replica1

    (Leader) Replica2 (Follower) Replica3 (Follower) •Replica3 becomes out of sync 0 0 0 1 1 2 Producer Buffer Request
  28. What if in-sync replicas shrunk to 1 28 offset Replica1

    (Leader) •Message cannot be committed when `min.insync.replicas = 2` •Meanwhile, the message continues sitting in producer’s buffer 0 1 2 Producer Buffer Request Replica2 (Follower) Replica3 (Follower) 0 0 1
  29. How can unclean leader election leads data loss 29 offset

    Replica1 (Leader) •Let’s say the leader dies as like the phenomenon we encountered 0 1 2 Producer Buffer Request Replica2 (Follower) Replica3 (Follower) 0 0 1
  30. How can unclean leader election leads data loss 30 offset

    Replica1 (Leader) •By default, first alive replica in replica-list is elected as the new leader
 upon unclean leader election 0 1 2 Producer Buffer Request Replica2 (New leader) Replica3 (Follower) 0 0 1
  31. How can unclean leader election leads data loss 31 offset

    Replica1 (Leader) •Replica3 will truncate offset 1 because new leader doesn’t have it •=> offset 1 lost 0 1 2 Producer Buffer Request Replica2 (New leader) Replica3 (Follower) 0 0
  32. •With `min.insync.replicas = 2`, at least 1 replica should have

    full set of committed
 messages •i.e. Replica3 in previous example •At LINE, all topics are configured with `min.insync.replicas = 2` •=> If we can choose such replica as the new “unclean” leader, we can expect
 no data-loss even on unclean leader election Idea: Avoid data loss even on unclean leader election 32
  33. •We have to know which replica has the full set

    of committed messages
 without inspecting failed leader’s log •Because we likely cannot login to failed leader’s machine How can we identify such “eligible” replica? 33
  34. How can we identify such “eligible” replica? 34 •Can “alive

    replica that has the latest offset” be the criteria of 
 “the replica that has full set of committed messages” ? Replica1 (Leader) 0 1 2 Producer Buffer Request Replica2 (Follower) Replica3 (New leader) 0 0 1 offset
  35. Sounds work... but not 100% sure 35 •In the simple

    scenario like previous example, it should work •However, there are infinite number of possible scenarios •Checking all of them is beyond human capability
  36. •Using formal methods, we describe the system’s possible behavior in

    a rigorous way •In dedicated specification languages •We can run exhaustive check against the system and find any path that leads
 undesired situation •Specification languages: •VDM++, Z, Alloy,…, TLA+ (our choice) Formal methods to the rescue! 36
  37. •TLA+ is a formal specification language developed by Leslie Lamport

    • http://lamport.azurewebsites.net/tla/tla.html •Many real-world applications: •Amazon Web Services • https://lamport.azurewebsites.net/tla/amazon-excerpt.html •Kafka • https://www.confluent.io/kafka-summit-sf18/hardening-kafka-replication/ •Raft • https://raft.github.io/ What is TLA+ 37
  38. •In TLA+, we describe the system as the state machine

    •Terminologies: •Variables •Represents the system’s state •Actions •State transitions in the system Basic ideas of TLA+ 38
  39. •The program that executes state exploration against TLA+ specification •We

    specify “invariant” of the system when running TLC •e.g. “Committed messages never lost” •As soon as TLC found a state that doesn’t satisfy the invariant, it stops and dumps
 the state transitions TLC model checker 39
  40. •Full specification is available here: • https://github.com/ocadaruma/kafka-spec/blob/main/UncleanLeaderElectionSafety.tla Formal specification of

    the phenomenon 40
  41. Variables 41 •commitedMessages •Tracks committed messages •zkState •Tracks the information

    stored in ZooKeeper like current leader, ISRs •replicaStates •Each replica’s states like local log •inflightProducers •Tracks pending produce requests
  42. Initial state 42

  43. Possible state transitions 43

  44. Invariant that ensures no data loss 44

  45. Let's run 45 •Indeed, we found a state transitions that

    lead data-loss
  46. 46 offset Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) Send request

    0 Producer Buffer Request 0 0 1 Found scenario •Producer produces a message
  47. 47 offset Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) 0 Producer

    Buffer Request 0 0 1 •Replica2 is marked as out-of-sync Found scenario
  48. 48 offset Replica1 (Becoming
 follower) Replica2 (Follower) Replica3 (Leader) 0

    0 0 1 Producer Buffer Request •Replica3 is elected as the leader •e.g. by auto workload balancer (like linkedin’s cruise-control) •Replica1 is becoming follower but offset1 is still not be truncated Found scenario
  49. 49 offset Replica1 (Becoming
 follower) Replica2 (Follower) Replica3 (Leader) 0

    Producer Buffer Request 0 0 1 Producer2 Buffer Request 1 •Another producer produces a message to new leader (replica3) Found scenario
  50. 50 offset 0 Producer Buffer Request 0 0 1 Producer2

    Buffer Request 1 1 Replica1 (Becoming
 follower) Replica2 (Follower) Replica3 (Leader) •Replica2 replicates offset1 of new leader Found scenario
  51. 51 offset 0 Producer Buffer Request 0 0 1 Producer2

    Buffer Request 1 Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) •Replica1 is elected as the leader again •e.g. by auto workload balancer (like linkedin’s cruise-control) •Replica3 truncates offset1, as replica1 doesn’t have it Found scenario
  52. 52 offset 0 Producer Buffer 0 0 1 Producer2 Buffer

    Request 1 Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) 1 •Replica3 replicates offset1 of replica1, then the pending produce returned
 (i.e. committed) Found scenario
  53. 53 offset 0 Producer Buffer 0 0 1 Producer2 Buffer

    Request 1 Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) 1 Request 2 •Producer produces a message again Found scenario
  54. 54 offset 0 Producer Buffer 0 0 1 Producer2 Buffer

    Request 1 Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) 1 Request 2 •Replica3 is marked as out-of-sync Found scenario
  55. 55 offset 0 Producer Buffer 0 0 1 Producer2 Buffer

    Request 1 Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) 1 Request 2 •Replica1 died Found scenario
  56. 56 offset 0 Producer Buffer 0 0 1 Producer2 Buffer

    Request 1 Replica1 (Leader) Replica2 (Leader) Replica3 (Follower) Request 2 •Elect replica2 uncleanly
 => offset1 of replica1 lost while it’s already “committed” Found scenario
  57. •We found an edge-case scenario that leads data-loss even we

    choose a replica which
 has the latest offset as the new leader •Future work: Figure out more appropriate criteria to detect “eligible” replica •Formal methods are very useful for debugging ideas regarding
 distributed systems like Kafka Conclusion 57