Upgrade to Pro — share decks privately, control downloads, hide ads and more …

akka cluster management and split brain resolution

akka cluster management and split brain resolution

Scala Swarm 22.06.2017 in Porto

Akka is a toolkit that brings the actor model to the JVM and helps developers to build scalable, resilient and responsive applications. With location transparency and asynchronous message passing it is designed to work distributed from the ground up.

While distributed systems help to solve some problems like availability, a new set of problems arise. For example how do we scale the cluster up or down? What happens if the network is at least partially not available? Akka provides a comprehensive set of cluster features that enable developers to monitor and manage the cluster manually or in most cases even automatically.

In this talk I will introduce some of these features and explain what you need to be aware of. You will learn how to start a cluster correctly and add / (gracefully) remove nodes to / from a running cluster. Additionally I will show how to handle failure scenarios like network partitions by using an existing or implementing a custom split brain resolver.

Niko Will

June 23, 2017
Tweet

More Decks by Niko Will

Other Decks in Programming

Transcript

  1. @n1ko_w1ll about me > Living in south Germany (Lake of

    Constance) > Developer since 2005 > Consultant at innoQ since 2017 > follow me on Twitter: @n1ko_w1ll
  2. @n1ko_w1ll agenda > why a cluster > akka-cluster > setting

    the ground > membership lifecycle (joining / leaving) > split brain resolution > cluster events & cluster state > membership lifecycle (unreachable / weakly up / down)
  3. @n1ko_w1ll why a cluster > compute power > state does

    not fit in memory > fault-tolerance
  4. @n1ko_w1ll akka-cluster > set of member nodes > membership state

    is a CRDT > communicated via Gossip > Convergence > Failure Detector > Leader > Seed Nodes
  5. @n1ko_w1ll membership lifecycle joining leaving exiting removed up join (leader

    action) (leader action) (leader action) leave Source: http://doc.akka.io/docs/akka/current/common/cluster.html happy path
  6. @n1ko_w1ll seed nodes > config > akka.cluster.seed-nodes > manually >

    JMX > HTTP API (or command line tool before akka 2.5) > programmatically Cluster(system).joinSeedNodes(seedNodes)
  7. @n1ko_w1ll wait for cluster > leader changes member from "Joining"

    to „Up" > akka.cluster.min-nr-of-members > akka.cluster.role.<role>.min-nr-of-members > start processing jobs / messages when „Up“ Cluster(system).registerOnMemberUp { system.actorOf(Props(classOf[FactorialFrontend], upToN, true), name = "factorialFrontend") }
  8. @n1ko_w1ll stopping actor system > leave cluster gracefully > manually

    with JMX or HTTP API (cluster management tool) > programmatically with val cluster = Cluster(system) cluster.leave(cluster.selfAddress) Runtime.getRuntime().addShutdownHook(...) > (with SBR, others will down unreachable members)
  9. @n1ko_w1ll cleanup Cluster(system).registerOnMemberRemoved { system.registerOnTermination(System.exit(0)) system.terminate() new Thread { override

    def run(): Unit = { if (Try(Await.ready(system.whenTerminated, 10.seconds)).isFailure) System.exit(-1) } }.start() } > included in akka 2.5 (Coordinated Shutdown)
  10. @n1ko_w1ll …and then things go south [info] [INFO] [04/05/2017 10:42:22.753]

    [ClusterSystem-akka.actor.default- dispatcher-14] [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://[email protected]:2551] - Leader can currently not perform its duties, reachability status: [ akka.tcp://[email protected]:2551 -> akka.tcp://[email protected]:64768: Unreachable [Unreachable] (1) ], member status: [ akka.tcp://[email protected]:2551 Up seen=true, akka.tcp://[email protected]:64768 Up seen=false ]
  11. @n1ko_w1ll split brain resolution > Split Brain Resolver available with

    Lightbend Subscription > Static Quorum > Keep Majority > Keep Oldest > Keep Referee
  12. @n1ko_w1ll register for cluster events class SimpleClusterListener extends Actor with

    ActorLogging { val cluster = Cluster(context.system) override def preStart(): Unit = { // subscribe to cluster changes cluster.subscribe(self, initialStateMode = InitialStateAsEvents, classOf[ClusterDomainEvent]) } override def postStop(): Unit = cluster.unsubscribe(self) def receive = ??? }
  13. @n1ko_w1ll register for cluster events def receive = { case

    state: CurrentClusterState => log.info("Cluster state is: {}", state) case MemberUp(member) => log.info("Member is up: {}", member.address) case MemberWeaklyUp(member) => log.info("Member is weakly up: {}", member.address) case UnreachableMember(member) => log.info("Member detected as unreachable: {}", member) case MemberRemoved(member, prevStatus) => log.info("Member is removed: {} after {}“, member.address, prevStatus) case _: ClusterDomainEvent => // ignore }
  14. @n1ko_w1ll current cluster state Cluster(system).state final case class CurrentClusterState( members:

    immutable.SortedSet[Member] = immutable.SortedSet.empty, unreachable: Set[Member] = Set.empty, seenBy: Set[Address] = Set.empty, leader: Option[Address] = None, roleLeaderMap: Map[String, Option[Address]] = Map.empty) { ... }
  15. @n1ko_w1ll health indicator > expose cluster health state > as

    addition to akka cluster HTTP management (since version 2.5) > helps monitoring cluster > healthy if no unreachable members Cluster(system).state.unreachable.isEmpty
  16. @n1ko_w1ll membership lifecycle joining leaving exiting removed up join (leader

    action) (leader action) (leader action) leave Source: http://doc.akka.io/docs/akka/current/common/cluster.html down unreachable* (fd*) (fd*) (fd*) (fd*)
  17. @n1ko_w1ll unreachable members > detected by akka cluster failure detector

    > akka.cluster.failure-detector > „leader can not perform its duties“ > no new members can join > does not influence running members > unreachable members are still part of the cluster > their responsibilities will not failover (e.g. singletons or shards)
  18. @n1ko_w1ll membership lifecycle joining leaving exiting removed up join (leader

    action) (leader action) (leader action) leave Source: http://doc.akka.io/docs/akka/current/common/cluster.html down unreachable* (fd*) (fd*) (fd*) (fd*)
  19. @n1ko_w1ll downing > will remove member from cluster > cluster

    will take over their responsibilities > auto-downing (development only) > mark all unreachable as DOWN > network partition will lead to several clusters > every member can mark itself and others as DOWN Cluster(system).down(nodeAddress)
  20. @n1ko_w1ll membership lifecycle Source: http://doc.akka.io/docs/akka/current/common/cluster.html joining leaving exiting removed up

    join (leader action) (leader action) (leader action) leave down unreachable* (fd*) (fd*) (fd*) (fd*) weakly up (leader action) (leader action) (fd*)
  21. @n1ko_w1ll split brain resolution > write a custom DowningProvider >

    akka.cluster.downing-provider-class > akka.cluster.auto-down-unreachable-after > custom strategies using cluster state > reachable vs. unreachable members > member state WEAKLY_UP > is member known? > is known member reachable?
  22. @n1ko_w1ll node 3 node 4 custom SBR DC 1 DC

    2 node 1 node 2 node 3 node 4
  23. @n1ko_w1ll def scheduleMajorityCheck() = { check.foreach(_.cancel()) check = Some(scheduler.scheduleOnce(7 seconds,

    self, CheckForMajority)) } def receive = { case CheckForMajority if cluster.state.unreachable.isEmpty => check = None case CheckForMajority => check = None val unreachable = cluster.state.unreachable val reachable = cluster.state.members.diff(unreachable) if (reachable.size > unreachable.size) unreachable.map(_.address).foreach(cluster.down) else if (reachable.size < unreachable.size) reachable.map(_.address).foreach(cluster.down)
  24. @n1ko_w1ll case MemberUp(member) => knownAddresses += getHostname(member) -> member scheduleMajorityCheck()

    case MemberWeaklyUp(member) => knownAddresses.get(getHostname(member)) .filter(cluster.state.unreachable.contains) .map(_.address).foreach(cluster.down) scheduleMajorityCheck() case MemberRemoved(member, _) => knownAddresses -= getHostname(member) scheduleMajorityCheck() case UnreachableMember(member) => scheduleMajorityCheck() }
  25. @n1ko_w1ll Thank you. Questions? Comments @n1ko_w1ll Niko Will [email protected] innoQ

    Deutschland GmbH Krischerstr. 100 40789 Monheim am Rhein Germany Phone: +49 2173 3366-0 innoQ Schweiz GmbH Gewerbestr. 11 CH-6330 Cham Switzerland Phone: +41 41 743 0116 www.innoq.com Ohlauer Straße 43 10999 Berlin Germany Phone: +49 2173 3366-0 Ludwigstr. 180E 63067 Offenbach Germany Phone: +49 2173 3366-0 Kreuzstraße 16
 80331 München Germany Phone: +49 2173 3366-0