Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Split-brain free online Zookeeper migration

Split-brain free online Zookeeper migration

Speaker: Arun Mathew
LINE Messaging Platform Development Department IMF Part Team

※This material was presented at the following event.
https://kafka-apache-jp.connpass.com/event/222711/

53850955f15249a1a9dc49df6113e400?s=128

LINE Developers
PRO

September 24, 2021
Tweet

Transcript

  1. Split-brain free online Zookeeper migration for Kafka’s Zookeeper cluster v3.4.x

    LINE Corporation Arun Mathew
  2. Background Our Requirement • All ZK nodes needs to be

    migrated due to Datacenter renovation • Running a 3 node ZK cluster which was planned to expand to 5 nodes Our Problem • Running ZK version 3.4.6 doesn’t support Dynamic reconfiguration feature • Couldn’t find online references to “safe” online zookeeper migration, except those with • Uses network partition based approaches [ref] • Potential for split-brains (two zookeeper cluster with independent quorum) • Critical Primary Kafka Cluster with 130 nodes which can’t afford downtime • Kafka broker needs restart to update zookeeper config • Need to minimize RR, as it disrupts Produce latency 2021-09-24 Apache Kafka Meetup Japan #9 2
  3. Important Basics • Zookeeper process upon starting reads config &

    instantiates it's View of the ensemble members • View is basically a list of peers as seen by each node and has Peers who may be observers. • Voting View is a subset of View with only non-observers. • QuorumSize is count of Peers in Voting View. • The zookeeper internals is explained in (ref) • This describes the voting and tie breaking strategies • When a Peer starts up • It starts leader election in ServerState.LOOKING mode (ref) • Proposing itself as the leader first (ref). • When there is a change in a process's vote • a notification is sent to all Peers in its Voting View (ref) • This happens when a Peer responds with a notification conveying existence of another superior peer. • During leader election a process only considers votes coming from Peers in its Voting View (ref). • When a Peer gets more than half the QuorumSize of notifications voting for it • it makes itself into ServerState.LEADING. 2021-09-24 Apache Kafka Meetup Japan #9 3
  4. 2 node addition to a 3 node Ensemble As per

    https://gist.github.com/miketheman/6057930 2021-09-24 Apache Kafka Meetup Japan #9 4
  5. 2021-09-24 Apache Kafka Meetup Japan #9 5 Configuration before RR

    or start Final Configuration Quorum(s) Remark Doubts Initial Configuration P, Q, R server.1=P server.2=Q server.3=R P, Q, R S & T up P, Q, R server.1=P server.2=Q server.3=R S, T server.1=P server.2=Q server.3=R server.4=S server.5=T P, Q, R S and T will start catching up Follower RR of R (assuming P is leader) P, Q, R server.1=P server.2=Q server.3=R S, T server.1=P server.2=Q server.3=R server.4=S server.5=T P, Q server.1=P server.2=Q server.3=R R, S, T server.1=P server.2=Q server.3=R server.4=S server.5=T P, Q, R after RR of R P,Q,R? OR R,S,T Various guides suggest RR restart of followers as a method after addition. Since P, Q can form a 2/3 quorum and if R votes for R, S, T with a 3/5 quorum, isn't there a possibility of split brain? Or is it that R will vote for P again as P being current leader would have a newer zkid, or all nodes having same state the lower myid 1 is preferred? Follower RR of Q (assuming P is leader) P, Q server.1=P server.2=Q server.3=R R, S, T server.1=P server.2=Q server.3=R server.4=S server.5=T P server.1=P server.2=Q server.3=R Q, R, S, T server.1=P server.2=Q server.3=R server.4=S server.5=T P, Q, R after RR of Q Q, R, S, T OR P,Q,R ? SAME CONCERN AS ABOVE Leader RR of P P server.1=P server.2=Q server.3=R Q, R, S, T server.1=P server.2=Q server.3=R server.4=S server.5=T P, Q, R, S, T server.1=P server.2=Q server.3=R server.4=S server.5=T Q, R, S, T after RR of P P, Q, R, S, T No confusions as final stage config is consistent across nodes.
  6. 2 node addition to a 3 node Ensemble A split

    brain free approach 2021-09-24 Apache Kafka Meetup Japan #9 6
  7. 2021-09-24 Apache Kafka Meetup Japan #9 7 Configuration before RR

    or start Final Configuration Quorum(s) Remark Initial Configuration P, Q, R server.1=P server.2=Q server.3=R P, Q, R S Up P, Q, R server.1=P server.2=Q server.3=R S server.1=P server.2=Q server.3=R server.4=S server.5=T P, Q, R S will start catching up Vote of S would be ignored by others. Follower RR of R (assuming P is leader) P, Q, R server.1=P server.2=Q server.3=R S server.1=P server.2=Q server.3=R server.4=S server.5=T P, Q server.1=P server.2=Q server.3=R R, S server.1=P server.2=Q server.3=R server.4=S server.5=T P, Q, R after RR of R, either of P, Q, R P, Q, R P, Q, R, S - P and Q can form a 2/3 quorum. - S can't form a Quorum as it can only get 2/5 R & S votes. - R can form a 3/5 or 4/5 quorum, in which case at least 1 of P and Q has to vote for R, invalidating a possibility of 2/3 quorum by just P and Q. Follower RR of Q (assuming P is leader) P, Q server.1=P server.2=Q server.3=R R, S server.1=P server.2=Q server.3=R server.4=S server.5=T P server.1=P server.2=Q server.3=R Q, R, S server.1=P server.2=Q server.3=R server.4=S server.5=T P, Q, R, [S] after RR of Q, either of P, Q, R P, Q, R, S P, Q, R, S P, Q, R, S Now S can form a 3/5 quorum with Q, R, S, but then P won't get additional vote to form a 2/3 quorum. Leader RR of P P server.1=P server.2=Q server.3=R Q, R, S server.1=P server.2=Q server.3=R server.4=S server.5=T P, Q, R, S server.1=P server.2=Q server.3=R server.4=S server.5=T [P], Q, R, S, T after RR of P P, Q, R, S Now all 4 nodes has same configuration and forms a 4/5 quorum. T Up P, Q, R, S, T server.1=P server.2=Q server.3=R server.4=S server.5=T P, Q, R, S after T up P, Q, R, S, T With T up the quorum becomes 5/5
  8. 2 node addition to a 3 node Ensemble – with

    migration In reality (P, Q, R) -> (A, B, C, D, E) 2021-09-24 Apache Kafka Meetup Japan #9 8
  9. Preparation • Kafka Cluster (v2.4) will continue to process Produce/Consume

    Requests even if ZK cluster goes down • Operations needing Zookeeper such as Topic Create AdminClient API will timeout • Our migration strategy maintains Zookeeper Quorum most of the time, except • During Zookeeper Leader restart, causing a Leader Re-election • During Follower RR during (P, Q, R) -> (P, Q, C) migration, after ‘C up’ step • However, as a precaution, Disable services calling AdminClient APIs needing Zookeeper dependency • Services creating/manipulating topics • Services triggering Preferred Replica Election • Basically, anything that could cause ZK metadata modification 2021-09-24 Apache Kafka Meetup Japan #9 9
  10. 2021-09-24 Apache Kafka Meetup Japan #9 10 Configuration before RR

    or start Final Configuration Quorum(s) Remark Initial Configuration P, Q, R server.1=P server.2=Q server.3=R P, Q, R First Replacement – R Down P, Q, R server.1=P server.2=Q server.3=R P, Q First Replacement – C Up [ Follower RR of Q, Leader RR of P ] P, Q server.1=P server.2=Q server.3=R C server.1=P server.2=Q server.3=C P, Q, C server.1=P server.2=Q server.3=C P, Q after RR of Q P, Q OR Q, C after RR of P P, Q, C - We start C first while quorum is still P, Q, so that C syncs up. - When we RR one of P|Q, lets say Q, the old quorum is lost momentarily during RR. And after RR, Q has both P and C in the config. So based on the timing of votes it could either P, Q or Q, C. - After final RR all nodes has same config. So no confusion D Up P, Q, C server.1=P server.2=Q server.3=C D server.1=P server.2=Q server.3=C server.4=D server.5=E P, Q, C D will start catching up Vote of D would be ignored by others.
  11. 2021-09-24 Apache Kafka Meetup Japan #9 11 Configuration before RR

    or start Final Configuration Quorum(s) Remark Follower RR of C (assuming P is leader) P, Q, C server.1=P server.2=Q server.3=C D server.1=P server.2=Q server.3=C server.4=D server.5=E P, Q server.1=P server.2=Q server.3=C C, D server.1=P server.2=Q server.3=C server.4=D server.5=E P, Q, C after RR of C, either of P, Q, C P, Q, C P, Q, C, D Leadership would usually not be disturbed unless P loses quorum, which happens if there is a disruption in Q while C is being restarted. Follower RR of Q (assuming P is leader) P, Q server.1=P server.2=Q server.3=C C, D server.1=P server.2=Q server.3=C server.4=D server.5=E P server.1=P server.2=Q server.3=C Q, C, D server.1=P server.2=Q server.3=C server.4=D server.5=E P, Q, C after RR of Q, either of P, Q, C P, Q, C, D P, Q, C, D P, Q, C, D Similar explanation as above. Even if there is a disruption, the configs and the absence of live ZK node E ensures that only a 2/3 quorum with P as leader, or a 3/5 with Q, C, or D as leader can happen and not both. Leader RR of P P server.1=P server.2=Q server.3=C Q, C, D server.1=P server.2=Q server.3=C server.4=D server.5=E P, Q, C, D server.1=P server.2=Q server.3=C server.4=D server.5=E P, Q, C after RR of P P, Q, C, D
  12. 2021-09-24 Apache Kafka Meetup Japan #9 12 Configuration before RR

    or start Final Configuration Quorum(s) Remark E Up P, Q, C, D, E server.1=P server.2=Q server.3=C server.4=D server.5=E P, Q, C, D, after D up P, Q, C, D, E Since all other nodes already have E in their config
  13. Update zookeeper.connect and RR Kafka Cluster • We are midway

    through the Zookeeper migration • P, Q out of P, Q, R which current broker config has are still active • Update zookeeper.config to A, B, C, D, E • C, D, E are already active • Connection attempt to A, B will fail until P, Q are replaced with them • Identify and push the controller Broker Node restart to the end • to minimize controller movement • RR Kafka cluster again at the end of Zookeeper migration • Only if the client connection count/load needs a rebalance 2021-09-24 Apache Kafka Meetup Japan #9 13
  14. 2021-09-24 Apache Kafka Meetup Japan #9 14 Configuration before RR

    or start Final Configuration Quorum(s) Remark Second Replacement – Q Down P, Q, C, D, E server.1=P server.2=Q server.3=C server.4=D server.5=E P, Q, C, D, E After Q Down P, C, D, E Follower RR of P, D, E After Config Update Assuming C as Leader P, C, D, E server.1=P server.2=Q server.3=C server.4=D server.5=E P, D, E server.1=P server.2=B server.3=C server.4=D server.5=E C server.1=P server.2=Q server.3=C server.4=D server.5=E P, C, D, E After all RRs P, C, D, E Leader RR of C After Config Update Assuming D as new Leader P, D, E server.1=P server.2=B server.3=C server.4=D server.5=E C server.1=P server.2=Q server.3=C server.4=D server.5=E P, C, D, E server.1=P server.2=B server.3=C server.4=D server.5=E P, C, D, E after RR of C P, C, D, E Second Replacement – B Up P, B, C, D, E server.1=P server.2=B server.3=C server.4=D server.5=E P, C, D, E After B Up P, B, C, D, E
  15. 2021-09-24 Apache Kafka Meetup Japan #9 15 Configuration before RR

    or start Final Configuration Quorum(s) Remark Third Replacement – P Down P, Q, C, D, E server.1=P server.2=B server.3=C server.4=D server.5=E P, B, C, D, E After P Down B, C, D, E Follower RR of B, C, E After Config Update Assuming D as Leader B, C, D, E server.1=P server.2=B server.3=C server.4=D server.5=E B, C, E server.1=A server.2=B server.3=C server.4=D server.5=E D server.1=P server.2=B server.3=C server.4=D server.5=E B, C, D, E After all RRs B, C, D, E Leader RR of D After Config Update Assuming E as new Leader B, C, E server.1=A server.2=B server.3=C server.4=D server.5=E D server.1=P server.2=B server.3=C server.4=D server.5=E B, C, D, E server.1=A server.2=B server.3=C server.4=D server.5=E B, C, D, E after RR of C B, C, D, E Third Replacement – A Up A, B, C, D, E server.1=A server.2=B server.3=C server.4=D server.5=E B, C, D, E After A Up A, B, C, D, E
  16. Replacement of Lower zkid nodes • Reason for the Operation

    Sequence applied for Second/Third replacement • If we start replacement node before first updating the config on higher zkid nodes • Smaller Sid tries to connect to larger sid and drops connection after losing challenge. [ref] • Larger Sid after winning the challenge initiates a connection back to the smaller sid node (After closing any existing connections) [ref] • Since Larger sid node has an older config with older hostname for smaller sid, the connection fails as zk is stopped on older hostname [ref] • This leads to connection error logs similar to that listed in https://issues.apache.org/jira/browse/ZOOKEEPER-2938 2021-09-24 Apache Kafka Meetup Japan #9 16
  17. Conclusion • Zookeeper v3.4.x online migration/expansion is not a piece

    of cake • needs carefully planned ‘config change and restart’ sequence • Kafka cluster can tolerate a brief unavailability of Zookeeper • Unless the brokers itself face issues simultaneously • So we can work with minimum active quorum based flow to avoid split-brains • When replacing a lower zkid ZK host(name) i.e zkid < 5 for a 5 node ensemble, Follow • Down Old Host • Update config with new Hostname and restart (at least) all higher zkid nodes • Up New Host • Update config and restart any pending lower zkid hosts • Upgrade Zookeeper to v3.5.x or higher which has dynamic configuration to prevent carefully sequenced manual operation 2021-09-24 Apache Kafka Meetup Japan #9 17
  18. Thank You 2021-09-24 Apache Kafka Meetup Japan #9 18