Lithium: a split-brain resolver for Akka-Cluster

Lithium: a split-brain resolver for Akka-Cluster

When using Akka-Cluster, when some nodes become unreachable, no one can join or even leave the cluster anymore. To bring back the cluster to a fully working state, the unreachable nodes must be downed. However, because there is no way of knowing if a node has crashed or is victim of a network partition, if done incorrectly the downing could lead to data corruption, a split-brain, and a headache fixing it.

In order to automatically and correctly recover from unreachable nodes, Lightbend provides a resolver through it’s subscription. For individuals and companies that cannot afford the subscription, some open-source solutions exist but do not come near it in terms of features and correctness. To fix that gap, I developed an open-source split-brain resolver called Lithium as part of my EPFL master project.

In this talk I will introduce Lithium, explain how it works helps with recovering the cluster from unreachable nodes, its internals, and everything to know to set it up.


Dennis van der Bij

October 09, 2019


  1. Lithium A split-brain resolver for Akka-Cluster Dennis van der Bij

    @MrDnx DennisVDB
  2. OMS • SwissBorg’s OMS (order management system) • Aggregates the

    prices of 4 crypto-exchanges • Best-execution 2
  3. OMS’ objectives • Best-execution • High availability 3

  4. OMS cluster Node-2 Node-3 Node-1 Node-4 Node-5 • Persistent actors

    • Singleton actors • … You are here S Super-important singleton 4
  5. Unreachable nodes Node-2 Node-3 Node-1 Node-4 Node-5 • S cannot

    be reached • Need to start S on a reachable node • Singleton actors are not migrated when nodes are unreachable S Partition A Partition B Dead or alive? 5
  6. Membership state • Leader chosen deterministically • Leader manages state

    transitions on convergence • Convergence cannot be reached with unreachable nodes • Eventually-perfect FD* • Nodes cannot become fully-fledged members or gracefully leave the cluster *Hayashibara, Naohiro, et al. "The φ accrual failure detector." Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004 Joining Up Leaving Exiting Removed Down Leader Leader Leader Leader 6
  7. Remove from membership state Node-2 Node-3 Node-1 Node-2 Node-3 Node-1

    S 7
  8. Remove from membership state Node-4 Node-5 S 8

  9. Split-brain Node-2 Node-3 Node-1 Node-4 Node-5 Network partition One cluster

    becomes two clusters S S 9
  10. Split-brain resolver • Prevent split-brains to happen in the 1st

    place • Pick only one partition that will survive - Survivor will down the unreachable nodes - Non-survivors will down themselves 10
  11. Existing solutions • Lightbend SBR - Multiple strategies - Multi-DC

    - Starting at $50’000 per year • Four OSS SBR’s - Two used in production, single strategy (MOIA) - Two others, multiple strategies (fail my tests) 11
  12. Lithium • Strategies - Static-quorum, keep-majority, keep-oldest, and keep-referee •

    Multi-datacenter support • Tests, tests, tests 12
  13. Static-quorum • Pick partition with at least N nodes •

    Downs the cluster: more than nodes, no partition with at least N nodes. 2N − 1 13
  14. Static-quorum 14 Node-2 Node-3 Node-1 Node-4 Node-5 N = 3

  15. Static-quorum 15 Node-2 Node-3 Node-1 Node-4 Node-5 Node-2 Node-3 Node-1

    N = 3
  16. 16 Keep-majority • Pick partition with a majority of nodes

    (or lowest address) • Downs the cluster: no partition with a majority 16
  17. Keep-majority 17 Node-2 Node-3 Node-1 Node-4 Node-5

  18. Keep-majority 18 Node-2 Node-3 Node-1 Node-4 Node-5 Node-2 Node-3 Node-1

  19. 19 Keep-oldest • Pick partition containing the oldest member •

    Oldest member hosts the singleton instance • Nearly entire cluster is downed when oldest is alone 19 19
  20. Keep-oldest 20 Node-2 Node-3 Node-1 Node-4 Node-5 Oldest

  21. Keep-oldest 21 Node-2 Node-3 Node-1 Node-4 Node-5 Node-4 Node-5 Oldest

  22. 22 Keep-referee • Pick the partition containing the “referee” node

    • Downs most of the cluster when the referee is alone 22 22
  23. Keep-referee 23 Node-2 Node-3 Node-1 Node-4 Node-5 Referee

  24. Keep-referee 24 Node-2 Node-3 Node-1 Node-4 Node-5 Node-4 Node-5 Referee

  25. Choosing a strategy 25 Use “role” to only take in

    account certain members
  26. How it works • Provide instance of DowningProvider • Each

    cluster member runs an instance of Lithium 26
  27. Demo 27

  28. Tests, tests, tests • ~70% LOCs are tests • Unit

    tests + property-based tests • “multi-jvm” tests 28
  29. Scenarios • Use property-based tests to detect edge-cases • Splits

    during membership changes 29
  30. Multi-JVM tests • Simulate a cluster locally • Split links

    between members programmatically • Observe how it gets resolved 30
  31. Demo 31

  32. Comparison Static-quorum Keep-majority Keep-oldest Keep-referee Multi-DC Lithium ARD N/A N/A

  33. 33 @MrDnx DennisVDB