The application of formal methods in Kafka reliability engineering

The application of  formal methods in  Kafka reliability engineering Haruki
Okada 2022.06.24

Speaker 2 •Haruki Okada (Twitter/GitHub: ocadaruma) •Lead of LINE IMF
team •Kafka contributor •KIP-764: Configurable backlog size for creating Acceptor •Author of tlaplus-intellij-plugin •https://github.com/ocadaruma/tlaplus-intellij-plugin

Kafka at LINE 3 •One of the most popular middleware
in LINE •We provide multi-tenant shared Kafka cluster •Data in/out: 1.75PB / day •Peak message inflow: 23.5M messages / sec

Reliability engineerings 4 •We put a lot of engineering efforts
to make our cluster highly reliable •e.g. •“Reliability Engineering Behind The Most Trusted Kafka Platform”  (LINE DEVELOPER DAY 2019) •"Investigating Request Delay in a Large-Scale Kafka Cluster Caused by TCP"  (LINE DEVELOPER DAY 2021)

•Summary of the phenomenon we encountered •Solution •Verifying the solution
using formal method (TLA+) •Conclusion Agenda 5

•Broker 1, 2, 3 hosted partition X’s replica •Broker 1
was the leader Phenomenon 6

•Broker 1 got down due to hardware failure •Usually, new
leader should be elected from other replicas and producer continues working Phenomenon 7

•This time, however, new leader wasn’t elected so the partition
became  unable to produce at all :( Phenomenon 8

•In Kafka, such partition is called ”offline partition” •Partition leader
is absent •Also, cannot elect new leader Phenomenon 9

What happened? 10 •In sync replica (ISR) is a replica
that is catching up the leader fast enough •Only ISR can be elected as a leader

What happened? 11 •Before broker1 completely died, broker1 became unable
to handle fetch request  => marked other replicas as “not in-sync”

What happened? 12 •Finally, broker1 died completely •=> There were
no eligible replicas to be elected as new leader

•Requires tough decision •We have to choose either of “Fast
recovery” or “Data durability” Handling offline partition situation 13

•Elect non in-sync (called “out-of-sync”) replica as the new leader
•Could cause data loss Possible handlings for offline partition:  Unclean leader election 14

•e.g. By replacing the failed hardware •Could take long time
to recover the situation •Meanwhile, produce / consume against the partition is unable Possible handlings for offline partition:  Start up the failed leader (our choice) 15 => We decided doing this to avoid data-loss

How can we improve? 16 •We succeeded to recover the
situation, but took long time to wait hardware replacement •=> We want to recover offline partition more fast while avoiding data loss

Deep dive into data durability in Kafka 17 offset Replica1
(Leader) Replica2 (Follower) Replica3 (Follower) Producer Buffer

Deep dive into data durability in Kafka 18 offset Replica1
(Leader) Replica2 (Follower) Replica3 (Follower) •Producer produces a message Send request 0 Producer Buffer Request

19 offset Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) •Replica2 replicates
offset 0 0 0 Producer Buffer Request Deep dive into data durability in Kafka

offset 0 •Produce response successfully returns 0 0 0 Return response Producer Buffer Deep dive into data durability in Kafka

21 •Kafka ensures messages are replicated to all in-sync replicas
upon  successful return of produce requests •(Assumes producer is configured with `acks = all`) •Messages replicated to all in-sync replicas are called “committed messages” •≠ “committed offset” (which means the checkpoint of consumer position) Deep dive into data durability in Kafka

Deep dive into availability in Kafka 22 offset Replica1 (Leader)
Replica2 (Follower) Replica3 (Follower) •Producer produces a message again 0 0 0 1 Send request Producer Buffer Request

offset 1 0 0 0 1 1 Producer Buffer Request Deep dive into availability in Kafka

24 offset Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) •Replica2 became
out-of-sync •Produce request returns successfully, as it’s replicated to all “in-sync” replicas 0 0 0 1 1 Return response Producer Buffer Deep dive into availability in Kafka

25 •Kafka has a flexibility in minimum in-sync replica count
to “commit” the messages •`min.insync.replicas` config •e.g. (Assumes there are 3 replicas) •`min.insync.replicas = 2` •=> Can tolerate 1 replica to fail to continue working, with ensuring at least 2  replicas have full set of committed messages Deep dive into availability in Kafka

What if in-sync replicas shrunk to 1 26 offset Replica1
(Leader) Replica2 (Follower) Replica3 (Follower) •Producer produces a message again 0 0 0 1 1 2 Send request Producer Buffer Request

(Leader) Replica2 (Follower) Replica3 (Follower) •Replica3 becomes out of sync 0 0 0 1 1 2 Producer Buffer Request

(Leader) •Message cannot be committed when `min.insync.replicas = 2` •Meanwhile, the message continues sitting in producer’s buffer 0 1 2 Producer Buffer Request Replica2 (Follower) Replica3 (Follower) 0 0 1

How can unclean leader election leads data loss 29 offset
Replica1 (Leader) •Let’s say the leader dies as like the phenomenon we encountered 0 1 2 Producer Buffer Request Replica2 (Follower) Replica3 (Follower) 0 0 1

Replica1 (Leader) •By default, first alive replica in replica-list is elected as the new leader  upon unclean leader election 0 1 2 Producer Buffer Request Replica2 (New leader) Replica3 (Follower) 0 0 1

Replica1 (Leader) •Replica3 will truncate offset 1 because new leader doesn’t have it •=> offset 1 lost 0 1 2 Producer Buffer Request Replica2 (New leader) Replica3 (Follower) 0 0

•With `min.insync.replicas = 2`, at least 1 replica should have
full set of committed  messages •i.e. Replica3 in previous example •At LINE, all topics are configured with `min.insync.replicas = 2` •=> If we can choose such replica as the new “unclean” leader, we can expect  no data-loss even on unclean leader election Idea: Avoid data loss even on unclean leader election 32

•We have to know which replica has the full set
of committed messages  without inspecting failed leader’s log •Because we likely cannot login to failed leader’s machine How can we identify such “eligible” replica? 33

How can we identify such “eligible” replica? 34 •Can “alive
replica that has the latest offset” be the criteria of   “the replica that has full set of committed messages” ? Replica1 (Leader) 0 1 2 Producer Buffer Request Replica2 (Follower) Replica3 (New leader) 0 0 1 offset

Sounds work... but not 100% sure 35 •In the simple
scenario like previous example, it should work •However, there are infinite number of possible scenarios •Checking all of them is beyond human capability

•Using formal methods, we describe the system’s possible behavior in
a rigorous way •In dedicated specification languages •We can run exhaustive check against the system and find any path that leads  undesired situation •Specification languages: •VDM++, Z, Alloy,…, TLA+ (our choice) Formal methods to the rescue! 36

•TLA+ is a formal specification language developed by Leslie Lamport
• http://lamport.azurewebsites.net/tla/tla.html •Many real-world applications: •Amazon Web Services • https://lamport.azurewebsites.net/tla/amazon-excerpt.html •Kafka • https://www.confluent.io/kafka-summit-sf18/hardening-kafka-replication/ •Raft • https://raft.github.io/ What is TLA+ 37

•In TLA+, we describe the system as the state machine
•Terminologies: •Variables •Represents the system’s state •Actions •State transitions in the system Basic ideas of TLA+ 38

•The program that executes state exploration against TLA+ specification •We
specify “invariant” of the system when running TLC •e.g. “Committed messages never lost” •As soon as TLC found a state that doesn’t satisfy the invariant, it stops and dumps  the state transitions TLC model checker 39

•Full specification is available here: • https://github.com/ocadaruma/kafka-spec/blob/main/UncleanLeaderElectionSafety.tla Formal specification of
the phenomenon 40

Variables 41 •commitedMessages •Tracks committed messages •zkState •Tracks the information
stored in ZooKeeper like current leader, ISRs •replicaStates •Each replica’s states like local log •inflightProducers •Tracks pending produce requests

Initial state 42

Possible state transitions 43

Invariant that ensures no data loss 44

Let's run 45 •Indeed, we found a state transitions that
lead data-loss

46 offset Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) Send request
0 Producer Buffer Request 0 0 1 Found scenario •Producer produces a message

47 offset Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) 0 Producer
Buffer Request 0 0 1 •Replica2 is marked as out-of-sync Found scenario

48 offset Replica1 (Becoming  follower) Replica2 (Follower) Replica3 (Leader) 0
0 0 1 Producer Buffer Request •Replica3 is elected as the leader •e.g. by auto workload balancer (like linkedin’s cruise-control) •Replica1 is becoming follower but offset1 is still not be truncated Found scenario

49 offset Replica1 (Becoming  follower) Replica2 (Follower) Replica3 (Leader) 0
Producer Buffer Request 0 0 1 Producer2 Buffer Request 1 •Another producer produces a message to new leader (replica3) Found scenario

50 offset 0 Producer Buffer Request 0 0 1 Producer2
Buffer Request 1 1 Replica1 (Becoming  follower) Replica2 (Follower) Replica3 (Leader) •Replica2 replicates offset1 of new leader Found scenario

51 offset 0 Producer Buffer Request 0 0 1 Producer2
Buffer Request 1 Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) •Replica1 is elected as the leader again •e.g. by auto workload balancer (like linkedin’s cruise-control) •Replica3 truncates offset1, as replica1 doesn’t have it Found scenario

52 offset 0 Producer Buffer 0 0 1 Producer2 Buffer
Request 1 Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) 1 •Replica3 replicates offset1 of replica1, then the pending produce returned  (i.e. committed) Found scenario

Request 1 Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) 1 Request 2 •Producer produces a message again Found scenario

Request 1 Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) 1 Request 2 •Replica3 is marked as out-of-sync Found scenario

Request 1 Replica1 (Leader) Replica2 (Follower) Replica3 (Follower) 1 Request 2 •Replica1 died Found scenario

Request 1 Replica1 (Leader) Replica2 (Leader) Replica3 (Follower) Request 2 •Elect replica2 uncleanly  => offset1 of replica1 lost while it’s already “committed” Found scenario

•We found an edge-case scenario that leads data-loss even we
choose a replica which  has the latest offset as the new leader •Future work: Figure out more appropriate criteria to detect “eligible” replica •Formal methods are very useful for debugging ideas regarding  distributed systems like Kafka Conclusion 57

The application of formal methods in Kafka reli...

The application of formal methods in Kafka reliability engineering

More Decks by LINE Developers

Other Decks in Technology

Featured

Transcript