Slide 1

Slide 1 text

akka cluster management & split brain resolution Niko Will, innoQ @n1ko_w1ll

Slide 2

Slide 2 text

@n1ko_w1ll „one Actor is no Actor - they come in systems“ — Carl Hewitt

Slide 3

Slide 3 text

@n1ko_w1ll akka-cluster > Gossip > Convergence > Failure Detector > Leader > Seed Nodes

Slide 4

Slide 4 text

@n1ko_w1ll membership lifecycle joining leaving exiting removed up join (leader action) (leader action) (leader action) leave Source: http://doc.akka.io/docs/akka/current/common/cluster.html happy path

Slide 5

Slide 5 text

@n1ko_w1ll joining

Slide 6

Slide 6 text

@n1ko_w1ll seed nodes > config > akka.cluster.seed-nodes > manually > JMX > cluster management tool > programmatically Cluster(system).joinSeedNodes(seedNodes)

Slide 7

Slide 7 text

@n1ko_w1ll wait for cluster > leader changes member from "Joining" to „Up" > akka.cluster.min-nr-of-members > akka.cluster.role..min-nr-of-members > start processing jobs / messages when „Up“ Cluster(system).registerOnMemberUp { system.actorOf(Props(classOf[FactorialFrontend], upToN, true), name = "factorialFrontend") }

Slide 8

Slide 8 text

@n1ko_w1ll leaving

Slide 9

Slide 9 text

@n1ko_w1ll stopping actor system > leave cluster gracefully > manually with JMX or cluster management tool > programmatically with val cluster = Cluster(system) cluster.leave(cluster.selfAddress) Runtime.getRuntime().addShutdownHook(...) > (with SBR, others will down unreachable members)

Slide 10

Slide 10 text

@n1ko_w1ll cleanup Cluster(system).registerOnMemberRemoved { system.registerOnTermination(System.exit(0)) system.terminate() new Thread { override def run(): Unit = { if (Try(Await.ready(system.whenTerminated, 10.seconds)).isFailure) System.exit(-1) } }.start() } > included in akka 2.5 (Coordinated Shutdown)

Slide 11

Slide 11 text

@n1ko_w1ll

Slide 12

Slide 12 text

@n1ko_w1ll …and then things go south [info] [INFO] [04/05/2017 10:42:22.753] [ClusterSystem-akka.actor.default- dispatcher-14] [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://[email protected]:2551] - Leader can currently not perform its duties, reachability status: [ akka.tcp://[email protected]:2551 -> akka.tcp://[email protected]:64768: Unreachable [Unreachable] (1) ], member status: [ akka.tcp://[email protected]:2551 Up seen=true, akka.tcp://[email protected]:64768 Up seen=false ]

Slide 13

Slide 13 text

@n1ko_w1ll

Slide 14

Slide 14 text

@n1ko_w1ll network partitions > CAP theorem > Consistency > Availability > Partition tolerance a b c d e

Slide 15

Slide 15 text

@n1ko_w1ll split brain resolution

Slide 16

Slide 16 text

@n1ko_w1ll split brain resolution > Split Brain Resolver available with Lightbend Subscription > Static Quorum > Keep Majority > Keep Oldest > Keep Referee

Slide 17

Slide 17 text

@n1ko_w1ll BYOSPR bring your own split brain resolver

Slide 18

Slide 18 text

@n1ko_w1ll membership lifecycle joining leaving exiting removed up join (leader action) (leader action) (leader action) leave Source: http://doc.akka.io/docs/akka/current/common/cluster.html down unreachable* (fd*) (fd*) (fd*) (fd*)

Slide 19

Slide 19 text

@n1ko_w1ll register for cluster events class SimpleClusterListener extends Actor with ActorLogging { val cluster = Cluster(context.system) override def preStart(): Unit = { // subscribe to cluster changes cluster.subscribe(self, initialStateMode = InitialStateAsEvents, classOf[ClusterDomainEvent]) } override def postStop(): Unit = cluster.unsubscribe(self) def receive = ??? }

Slide 20

Slide 20 text

@n1ko_w1ll register for cluster events def receive = { case state: CurrentClusterState => log.info("Cluster state is: {}", state) case MemberUp(member) => log.info("Member is Up: {}", member.address) case MemberWeaklyUp(member) => log.info("Member is weakly Up: {}", member.address) case UnreachableMember(member) => log.info("Member detected as unreachable: {}", member) case MemberRemoved(member, prevStatus) => log.info("Member is Removed: {} after {}“, member.address, prevStatus) case _: ClusterDomainEvent => // ignore }

Slide 21

Slide 21 text

@n1ko_w1ll current cluster state Cluster(system).state final case class CurrentClusterState( members: immutable.SortedSet[Member] = immutable.SortedSet.empty, unreachable: Set[Member] = Set.empty, seenBy: Set[Address] = Set.empty, leader: Option[Address] = None, roleLeaderMap: Map[String, Option[Address]] = Map.empty) { ... }

Slide 22

Slide 22 text

@n1ko_w1ll health indicator > expose cluster health state > as addition to akka cluster HTTP management (since version 2.5) > helps monitoring cluster > healthy if no unreachable members Cluster(system).state.unreachable.isEmpty

Slide 23

Slide 23 text

@n1ko_w1ll unreachable members > detected by akka cluster failure detector > akka.cluster.failure-detector > „leader can not perform its duties“ > no new members can join > does not influence running members > unreachable members are still part of the cluster

Slide 24

Slide 24 text

@n1ko_w1ll downing

Slide 25

Slide 25 text

@n1ko_w1ll downing joining leaving exiting removed up join (leader action) (leader action) (leader action) leave Source: http://doc.akka.io/docs/akka/current/common/cluster.html down unreachable* (fd*) (fd*) (fd*) (fd*)

Slide 26

Slide 26 text

@n1ko_w1ll downing > will remove member from cluster > cluster will take over their responsibilities > auto-downing (development only) > mark all unreachable as DOWN > network partition will lead to several clusters > every member can mark itself and others as DOWN Cluster(system).down(nodeAddress)

Slide 27

Slide 27 text

@n1ko_w1ll split brain resolution > write a custom DowningProvider > akka.cluster.downing-provider-class > akka.cluster.auto-down-unreachable-after > custom strategies using cluster state > reachable vs. unreachable members > member state WEAKLY_UP > is member known? > is known member reachable?

Slide 28

Slide 28 text

@n1ko_w1ll membership lifecycle Source: http://doc.akka.io/docs/akka/current/common/cluster.html joining leaving exiting removed up join (leader action) (leader action) (leader action) leave down unreachable* (fd*) (fd*) (fd*) (fd*) weakly up (leader action) (leader action) (fd*)

Slide 29

Slide 29 text

@n1ko_w1ll node 3 node 4 custom SBR DC 1 DC 2 node 1 node 2 node 3 node 4

Slide 30

Slide 30 text

@n1ko_w1ll def scheduleMajorityCheck() = { check.foreach(_.cancel()) check = Some(scheduler.scheduleOnce(7 seconds, self, CheckForMajority)) } def receive = { case CheckForMajority if cluster.state.unreachable.isEmpty => check = None case CheckForMajority => check = None val unreachable = cluster.state.unreachable val reachable = cluster.state.members.diff(unreachable) if (reachable.size > unreachable.size) unreachable.map(_.address).foreach(cluster.down) else if (reachable.size < unreachable.size) reachable.map(_.address).foreach(cluster.down)

Slide 31

Slide 31 text

@n1ko_w1ll case MemberUp(member) => knownAddresses += getHostname(member) -> member scheduleMajorityCheck() case MemberWeaklyUp(member) => knownAddresses.get(getHostname(member)) .filter(cluster.state.unreachable.contains) .map(_.address).foreach(cluster.down) scheduleMajorityCheck() case MemberRemoved(member, _) => knownAddresses -= getHostname(member) scheduleMajorityCheck() case UnreachableMember(member) => scheduleMajorityCheck() }

Slide 32

Slide 32 text

@n1ko_w1ll

Slide 33

Slide 33 text

@n1ko_w1ll Thank you. Questions? Comments @n1ko_w1ll Niko Will [email protected] innoQ Deutschland GmbH Krischerstr. 100 40789 Monheim am Rhein Germany Phone: +49 2173 3366-0 innoQ Schweiz GmbH Gewerbestr. 11 CH-6330 Cham Switzerland Phone: +41 41 743 0116 www.innoq.com Ohlauer Straße 43 10999 Berlin Germany Phone: +49 2173 3366-0 Ludwigstr. 180E 63067 Offenbach Germany Phone: +49 2173 3366-0 Kreuzstraße 16
 80331 München Germany Phone: +49 2173 3366-0