Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Without Resilience, Nothing Else Matters

Without Resilience, Nothing Else Matters

It doesn’t matter how beautiful, loosely coupled, scalable, highly concurrent, non-blocking, responsive and performant your application is—if it isn't running, then it's 100% useless. Without resilience, nothing else matters.

Most developers understand what the word resilience means, at least superficially, but way too many lack a deeper understanding of what it really means in the context of the system that they are working on now. I find it really sad to see, since understanding and managing failure is more important today than ever. Outages are incredibly costly—for many definitions of cost—and can sometimes take down whole businesses.

In this talk we will explore the essence of resilience. What does it really mean? What is its mechanics and characterizing traits? How do other sciences and industries manage it, and what can we learn from that? We will see that everything hints at the same conclusion; that failure is inevitable and needs to be embraced, and that resilience is by design.

Jonas Bonér

June 12, 2017
Tweet

More Decks by Jonas Bonér

Other Decks in Programming

Transcript

  1. Without Resilience
    Nothing Else Matters
    Jonas Bonér
    CTO Lightbend
    @jboner

    View Slide

  2. View Slide

  3. This Is Fault Tolerance
    “But it ain’t how hard you’re hit;
    it’s about how hard you can get
    hit, and keep moving forward.
    How much you can take, and
    keep moving forward. That’s
    how winning is done.”
    - Rocky Balboa

    View Slide

  4. Resilience
    Is Beyond
    Fault
    Tolerance

    View Slide

  5. Resilience
    “The ability of a substance or
    object to spring back into shape.
    The capacity to recover quickly
    from difficulties.”
    -Merriam Webster

    View Slide

  6. Antifragility
    “Antifragility is beyond resilience and
    robustness. The resilient resists shock and
    stays the same; the antifragile gets better.”
    - Nassem Nicholas Taleb
    Antifragile: Things That Gain from Disorder - Nassim Nicholas Taleb

    View Slide

  7. “We can model and understand in isolation. 

    But, when released into competitive nominally
    regulated societies, their connections proliferate, 

    their interactions and interdependencies multiply, 

    their complexities mushroom. 

    And we are caught short.”
    - Sidney Dekker
    Drift into Failure - Sidney Dekker

    View Slide

  8. Software Systems Today Are
    Incredibly Complex
    Netflix Twitter

    View Slide

  9. We need to study
    Resilience in
    Complex
    Systems

    View Slide

  10. Complicated System

    View Slide

  11. Complex System

    View Slide

  12. Complicated ≠ Complex

    View Slide

  13. “Complex systems run in degraded mode.”
    “Complex systems run as broken systems.”
    - richard Cook
    How Complex Systems Fail - Richard Cook

    View Slide

  14. “Counterintuitive. That’s [Jay] Forrester’s
    word to describe complex systems.
    Leverage points are not intuitive. Or if they
    are, we intuitively use them backward,
    systematically worsening whatever
    problems we are trying to solve.”
    - Donella Meadows
    Leverage Points: Places to Intervene in a System - Donella Meadows

    View Slide

  15. “Humans should not be
    involved in setting timeouts.”
    “Human involvement in
    complex systems is the biggest
    source of trouble.”
    - Ben Christensen, Netflix

    View Slide

  16. Humans Generally
    Make Things Worse

    View Slide

  17. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
    Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
    Economic
    Failure
    Boundary
    Unacceptable
    Workload
    Boundary
    Operating Point
    FAILURE
    Accident
    Boundary
    Operating at the Edge of Failure

    View Slide

  18. Economic
    Failure
    Boundary
    Unacceptable
    Workload
    Boundary
    Accident
    Boundary
    Management Pressure
    Towards Economic Efficiency
    Gradient Towards
    Least Effort
    Counter Gradient
    For More Resilience
    ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
    Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
    Operating at the Edge of Failure

    View Slide

  19. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
    Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
    Economic
    Failure
    Boundary
    Unacceptable
    Workload
    Boundary
    Accident
    Boundary
    Error Margin
    Marginal
    Boundary
    Operating at the Edge of Failure

    View Slide

  20. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
    Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
    Accident
    Boundary
    Marginal
    Boundary
    ?
    Operating at the Edge of Failure

    View Slide

  21. ‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
    Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
    Operating at the Edge of Failure
    Accident
    Boundary
    Marginal
    Boundary

    View Slide

  22. Embrace
    Failure

    View Slide

  23. Resilience
    is by
    Design
    Photo courtesy of FEMA/Joselyne Augustino

    View Slide

  24. “Autonomy makes information local,
    leading to greater certainty and stability.”
    - Mark Burgess
    In Search of Certainty - Mark Burgess

    View Slide

  25. Promise Theory
    Think in
    Promises
    Not
    Commands

    View Slide

  26. Promise Theory
    Promises converge towards
    A definite outcome from
    unpredictable beginnings
    㱺 improved Stability
    Commands diverge into
    unpredictable outcomes from
    definite beginnings
    㱺 decreased Stability

    View Slide

  27. Resilience in
    Biological
    Systems

    View Slide

  28. Meerkats
    Puppies! Now that I’ve got your attention, complexity theory - Nicolas Perony, TED talk

    View Slide

  29. “In three words, in the animal kingdom,
    simplicity leads to complexity 

    which leads to resilience.”
    - Nicolas Perony
    Puppies! Now that I’ve got your attention, complexity theory - Nicolas Perony, TED talk

    View Slide

  30. Resilience in
    Social
    Systems

    View Slide

  31. Dealing in Security
    Understanding vital services, and how they keep you safe
    1 INDIVIDUAL (you)
    6 ways to die
    3 sets of essential services
    7 layers of PROTECTION
    Dealing in Security - Mike Bennet, Vinay Gupta

    View Slide

  32. What we can learn from
    Resilience in
    Biological and
    Social Systems
    1. Feature Diversity and redundancy
    2. Inter-Connected network structure
    3. Wide distribution across all scales
    4. Capacity to self-adapt & self-organize
    Toward Resilient Architectures 1: Biology Lessons - Michael Mehaffy, Nikos A. Salingaros
    Applying resilience thinking: Seven principles for building resilience in social-ecological systems - Reinette Biggs et. al.

    View Slide

  33. Resilience in
    Computer
    Systems

    View Slide

  34. View Slide

  35. We Need To
    Manage
    Failure
    Not Try To Avoid It

    View Slide

  36. Let It
    Crash

    View Slide

  37. Crash
    Only
    Software
    Crash-Only Software - George Candea, Armando Fox
    Stop = Crash Safely
    Start = Recover Fast

    View Slide

  38. Recursive Restartability
    Turning the Crash-Only Sledgehammer into a Scalpel
    Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel - George Candea, Armando Fox

    View Slide

  39. Traditional
    State Management Object
    Critical state
    that needs protection
    Client
    Thread boundary
    Synchronous dispatch Thread boundary
    ?
    Utterly broken

    View Slide

  40. “Accidents come from relationships
    not broken parts.”
    - Sidney dekker
    Drift into Failure - Sidney Dekker

    View Slide

  41. Requirements for a
    Sane Failure Model
    1. Contained—Avoid cascading failures
    2. Reified—as messages
    3. Signalled—Asynchronously
    4. Observed—by 1-N
    5. Managed—Outside failed Context
    Failures need to be

    View Slide

  42. Bulkhead
    Pattern

    View Slide

  43. Enter Supervision

    View Slide

  44. Out of the Tar Pit - Ben Moseley , Peter Marks
    • Input Data
    • Derived Data
    Critical
    We need a way out of the
    State Tar Pit

    View Slide

  45. Essential
    State
    Out of the Tar Pit - Ben Moseley , Peter Marks
    Essential
    Logic
    Accidental
    State and
    Control
    We need a way out of the
    State Tar Pit

    View Slide

  46. The
    Vending
    Machine
    Pattern

    View Slide

  47. Think Vending Machine
    Coffee
    Machine
    Programmer
    Inserts coins
    Gets coffee
    Add more coins

    View Slide

  48. Think Vending Machine
    Programmer
    Service
    Guy
    Inserts coins
    Gets coffee
    Out of
    coffee beans
    failure
    Adds
    more
    beans
    Out of coffee beans error
    WRONG Coffee
    Machine

    View Slide

  49. Think Vending Machine
    Service
    Client
    Supervisor
    Request
    Response
    Validation Error
    Application
    Failure
    Manages
    Failure

    View Slide

  50. Error
    Kernel
    Pattern
    Onion-layered state & Failure management
    Making reliable distributed systems in the presence of software errors - Joe Armstrong
    On Erlang, State and Crashes - Jesper Louis Andersen

    View Slide

  51. Onion Layered
    State Management
    Error Kernel
    Object
    Critical state
    that needs protection
    Client
    Supervision
    Supervision
    Thread boundary
    Supervision

    View Slide

  52. Demo
    Time
    Let’s model a resilient vending machine, in Akka

    View Slide

  53. Demo Runner
    object VendingMachineDemo extends App {


    val system = ActorSystem("vendingMachineDemo")

    val coffeeMachine = system.actorOf(Props[CoffeeMachineManager], "coffeeMachineManager")

    val customer = Inbox.create(system) // emulates the customer

    … // test runs

    system.shutdown()

    }
    https://gist.github.com/jboner/d24c0eb91417a5ec10a6

    View Slide

  54. Test Happy Path
    // Insert 2 coins and get an Espresso

    customer.send(coffeeMachine, Coins(2))

    customer.send(coffeeMachine, Selection(Espresso))

    val Beverage(coffee1) = customer.receive(5.seconds)

    println(s"Got myself an $coffee1")

    assert(coffee1 == Espresso)
    https://gist.github.com/jboner/d24c0eb91417a5ec10a6

    View Slide

  55. Test User Error
    customer.send(coffeeMachine, Coins(1))

    customer.send(coffeeMachine, Selection(Latte))

    val NotEnoughCoinsError(message) = customer.receive(5.seconds)

    println(s"Got myself a validation error: $message")

    assert(message == "Please insert [1] coins")
    https://gist.github.com/jboner/d24c0eb91417a5ec10a6

    View Slide

  56. Test System Failure
    // Insert 1 coin (had 1 before) and try to get my Latte

    // Machine should:

    // 1. Fail

    // 2. Restart

    // 3. Resubmit my order

    // 4. Give me my coffee

    customer.send(coffeeMachine, Coins(1))

    customer.send(coffeeMachine, TriggerOutOfCoffeeBeansFailure)

    customer.send(coffeeMachine, Selection(Latte))

    val Beverage(coffee2) = customer.receive(5.seconds)

    println(s"Got myself a $coffee2")

    assert(coffee2 == Latte)

    https://gist.github.com/jboner/d24c0eb91417a5ec10a6

    View Slide

  57. Protocol
    // Coffee types

    trait CoffeeType

    case object BlackCoffee extends CoffeeType

    case object Latte extends CoffeeType

    case object Espresso extends CoffeeType


    // Commands

    case class Coins(number: Int)

    case class Selection(coffee: CoffeeType)

    case object TriggerOutOfCoffeeBeansFailure


    // Events

    case class CoinsReceived(number: Int)


    // Replies

    case class Beverage(coffee: CoffeeType)


    // Errors

    case class NotEnoughCoinsError(message: String)


    // Failures

    case class OutOfCoffeeBeansFailure(customer: ActorRef,

    pendingOrder: Selection,

    nrOfInsertedCoins: Int) extends Exception
    https://gist.github.com/jboner/d24c0eb91417a5ec10a6

    View Slide

  58. CoffeeMachine
    class CoffeeMachine extends Actor {

    val price = 2

    var nrOfInsertedCoins = 0

    var outOfCoffeeBeans = false

    var totalNrOfCoins = 0


    def receive = { … }


    override def postRestart(failure: Throwable): Unit = { … }
    }
    https://gist.github.com/jboner/d24c0eb91417a5ec10a6

    View Slide

  59. CoffeeMachine
    def receive = {

    case Coins(nr) =>

    nrOfInsertedCoins += nr

    totalNrOfCoins += nr

    println(s"Inserted [$nr] coins")

    println(s"Total number of coins in machine is [$totalNrOfCoins]")


    case selection @ Selection(coffeeType) =>

    if (nrOfInsertedCoins < price)

    sender.tell(NotEnoughCoinsError(
    s”Insert [${price - nrOfInsertedCoins}] coins"), self)

    else {

    if (outOfCoffeeBeans)

    throw new OutOfCoffeeBeansFailure(sender, selection, nrOfInsertedCoins)

    println(s"Brewing your $coffeeType")

    sender.tell(Beverage(coffeeType), self)

    nrOfInsertedCoins = 0

    }


    case TriggerOutOfCoffeeBeansFailure =>

    outOfCoffeeBeans = true

    }
    https://gist.github.com/jboner/d24c0eb91417a5ec10a6

    View Slide

  60. CoffeeMachine
    override def postRestart(failure: Throwable): Unit = {

    println(s"Restarting coffee machine...")

    failure match {

    case OutOfCoffeeBeansFailure(customer, pendingOrder, coins) =>

    nrOfInsertedCoins = coins

    outOfCoffeeBeans = false

    println(s"Resubmitting pending order $pendingOrder")

    context.self.tell(pendingOrder, customer)

    }

    }
    https://gist.github.com/jboner/d24c0eb91417a5ec10a6

    View Slide

  61. Supervisor
    class CoffeeMachineManager extends Actor {

    override val supervisorStrategy =

    OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1.minute) {

    case e: OutOfCoffeeBeansFailure =>

    println(s"ServiceGuy notified: $e")

    Restart

    case _: Exception =>

    Escalate

    }


    // to simplify things he is only managing 1 single machine

    val machine = context.actorOf(
    Props[CoffeeMachine], name = "coffeeMachine")


    def receive = {

    case request => machine.forward(request)

    }

    }
    https://gist.github.com/jboner/d24c0eb91417a5ec10a6

    View Slide

  62. So.........
    Sorry...but Not really.
    Are We Done?

    View Slide

  63. We can not
    keep putting
    all eggs in the
    same basket

    View Slide

  64. We need to
    Maintain Diversity
    and Redundancy

    View Slide

  65. The Network
    is Reliable
    NOT
    Really

    View Slide

  66. Here, We are living in the
    Looming
    Shadow of
    Impossibility
    Theorems
    CAP: Consistency is impossible
    FLP: Consensus is impossible

    View Slide

  67. Towards Resilient Distributed Systems
    Isolation
    • Autonomous Microservices
    • Resilient Protocols
    • Virtualization
    Data Resilience
    • Eventual & Causal Consistency
    • Event Logging
    • Flow Control / Feedback Control
    Self-healing
    • Decentralized Architectures
    • Gossip Protocols
    • Failure Detection
    Embrace the Network
    •Asynchronicity
    •Location Transparency

    View Slide

  68. Microservices
    1. Autonomy
    2. Isolation
    3. Mobility
    4. Single Responsibility
    5. Exclusive StatE

    View Slide

  69. An autonomous Service
    can only promise
    its own behavior
    Apply Promise Theory

    View Slide

  70. We need to decompose the system using
    Consistency Boundaries

    View Slide

  71. Inside Data
    Our current present—state
    Outside Data
    Blast from the past—facts
    Between Services
    Hope for the future—commands
    Data on the inside vs Data on the outside - Pat Helland

    View Slide

  72. WITHIN the Consistency Boundary
    we can have STRONG CONSISTENCY

    View Slide

  73. BETWEEN
    Consistency
    Boundaries
    it is a
    ZOO

    View Slide

  74. We need Systems that are Decoupled in
    Time and Space

    View Slide

  75. Embrace the Network
    • Go Asynchronous
    • Make distribution first class
    • Learn from the mistakes of RPC, EJB & CORBA
    • Leverage Location Transparency
    • Actor Model does it right

    View Slide

  76. Location Transparency
    One communication abstraction
    across all dimensions of scale
    Core 㱺 Socket 㱺 CPU 㱺
    Container 㱺 Server 㱺 Rack 㱺
    Data Center 㱺 GLobal

    View Slide

  77. Resilient Protocols
    are tolerant to
    • Message loss
    • Message reordering
    • Message duplication
    Embrace ACID 2.0
    • Associative
    • Commutative
    • Idempotent
    • Distributed
    Depend on
    • Asynchronous Communication
    • Eventual Consistency

    View Slide

  78. “To make a system of interconnected components
    crash-only, it must be designed so that components
    can tolerate the crashes and temporary unavailability
    of their peers. This means we require: [1] strong
    modularity with relatively impermeable component
    boundaries, [2] timeout-based communication and
    lease-based resource allocation, and [3] self-
    describing requests that carry a time-to-live and
    information on whether they are idempotent.”
    - George Candea, Armando Fox
    Crash-Only Software - George Candea, Armando Fox

    View Slide

  79. "Software components should be designed such
    that they can deny service for any request or call.
    Then, if an underlying component can say No,
    apps must be designed to take No for an answer
    and decide how to proceed: give up, wait and
    retry, reduce fidelity, etc.”
    - George Candea, Armando Fox
    Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel - George Candea, Armando Fox

    View Slide

  80. Services need to learn to accept
    NO for an answer

    View Slide

  81. Member
    Node
    Member
    Node
    Member
    Node
    Member
    Node
    Member
    Node
    Member
    Node
    Member
    Node
    Member
    Node
    Member
    Node
    Member
    Node
    Decentralized
    Epidemic Gossip Protocols
    Gossip Of membership, Data & Meta Data
    Failure detection heartbeat

    View Slide

  82. STRONG
    Consistency
    Is the wrong default

    View Slide

  83. “Two-phase commit is the
    anti-availability protocol.”
    - Pat Helland
    Standing on Distributed Shoulders of Giants - Pat Helland

    View Slide

  84. Eventual
    Consistency
    We have to rely on
    But relax, it’s how the world works

    View Slide

  85. Transactions
    But I really need

    View Slide

  86. “In general, application developers
    simply do not implement large
    scalable applications assuming
    distributed transactions.”
    - Pat Helland
    Life Beyond Distributed Transactions - Pat Helland

    View Slide

  87. Guess.
    Apologize.
    Compensate.
    Use a protocol of

    View Slide

  88. “The truth is the log. The database is a
    cache of a subset of the log.”
    - Pat Helland
    Immutability Changes Everything - Pat Helland

    View Slide

  89. CRUD is DEAD

    View Slide

  90. Event Logging
    • Work with Facts—immutable values
    • Event Sourcing
    • DB of Facts—Keep all history
    • Just replay on failure
    • Free Auditing, Debugging, Replication
    • Single Writer PRinciple
    • Avoids OO-Relational impedence mismatch
    • CQRS—Separate the Read & Write Model

    View Slide

  91. Let’s model a resilient & Event Logged vending machine, in Akka
    Demo
    Time

    View Slide

  92. Event Logged CoffeeMachine
    // Events

    case class CoinsReceived(number: Int)

    class CoffeeMachine extends PersistentActor {

    val price = 2

    var nrOfInsertedCoins = 0

    var outOfCoffeeBeans = false

    var totalNrOfCoins = 0


    override def persistenceId = "CoffeeMachine"


    override def receiveCommand: Receive = {

    case Coins(nr) =>

    nrOfInsertedCoins += nr

    println(s"Inserted [$nr] coins")

    persist(CoinsReceived(nr)) { evt =>

    totalNrOfCoins += nr

    println(s"Total number of coins in machine is [$totalNrOfCoins]")

    }

    }
    override def receiveRecover: Receive = {

    case CoinsReceived(coins) =>

    totalNrOfCoins += coins

    println(s"Total number of coins in machine is [$totalNrOfCoins]")

    }

    }
    https://gist.github.com/jboner/1db37eeee3ed3c9422e4

    View Slide

  93. “An escalator can never break: it can only
    become stairs. You should never see an
    Escalator Temporarily Out Of Order sign, just
    Escalator Temporarily Stairs. Sorry for the
    convenience.”
    - Mitch Hedberg

    View Slide

  94. Graceful
    Degradation

    View Slide

  95. Circuit Breaker

    View Slide

  96. Little’s Law
    L = λW
    Queue Length = Arrival Rate * Response Time
    W = L/λ
    Response Time = Queue Length / Arrival Rate
    W: Response Time
    L: Queue Length

    View Slide

  97. Flow Control
    Always Apply BackPressure

    View Slide

  98. Feedback
    Control

    View Slide

  99. “Continuously compare the actual
    output to its desired reference value;
    then apply a change to the system
    inputs that counteracts any deviation of
    the actual output from the reference.”
    - Philipp K. Janert
    Feedback Control for Computer Systems - Philipp K. Janet
    The Feedback Principle

    View Slide

  100. Feedback Control

    View Slide

  101. Influencing a
    Complex System

    View Slide

  102. Places to Intervene
    in a Complex System
    1. The constants, parameters or numbers
    2. The sizes of buffers relative to their flows
    3. The structure of material stocks and flows
    4. The lengths of delays, relative to the rate of system change
    5. The strength of negative feedback loops
    6. The gain around driving positive feedback loops
    7. The structure of information flows
    8. The rules of the system
    9. The power to add, change, evolve, or self-organize structure
    10. The goals of the system
    11. The mindset or paradigm out of which the system arises
    12. The power to transcend paradigms
    Leverage Points: Places to Intervene in a System - Donella Meadows:

    View Slide

  103. Triple Loop Learning
    Loop 1: Follow the rules
    Loop 2: Change the rules
    Loop 3: Learn how to learn
    Triple Loop Learning - Chris Argyris

    View Slide

  104. Testing

    View Slide

  105. What can we learn from Arnold?
    Blow things up

    View Slide

  106. Shoot
    Your App
    Down

    View Slide

  107. Pull the Plug
    …and see what happens

    View Slide

  108. View Slide

  109. Executive
    Summary

    View Slide

  110. “Complex systems run as broken systems.”
    - richard Cook
    How Complex Systems Fail - Richard Cook

    View Slide

  111. Resilience
    is by
    Design
    Photo courtesy of FEMA/Joselyne Augustino

    View Slide

  112. Without Resilience
    Nothing Else Matters

    View Slide

  113. References
    Drift into Failure - http://www.amazon.com/Drift-into-Failure-Components-Understanding-ebook/dp/B009KOKXKY

    How Complex Systems Fail - http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf

    Leverage Points: Places to Intervene in a System - http://www.donellameadows.org/archives/leverage-points-places-to-intervene-in-a-system/
    Going Solid: A Model of System Dynamics and Consequences for Patient Safety - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1743994/

    Resilience in Complex Adaptive Systems: Operating at the Edge of Failure - https://www.youtube.com/watch?v=PGLYEDpNu60

    Puppies! Now that I’ve got your attention, Complexity Theory - https://www.ted.com/talks/
    nicolas_perony_puppies_now_that_i_ve_got_your_attention_complexity_theory

    How Bacteria Becomes Resistant - http://www.abc.net.au/science/slab/antibiotics/resistance.htm

    Towards Resilient Architectures: Biology Lessons - http://www.metropolismag.com/Point-of-View/March-2013/Toward-Resilient-Architectures-1-Biology-Lessons/

    Dealing in Security - http://resiliencemaps.org/files/Dealing_in_Security.July2010.en.pdf

    What is resilience? An introduction to social-ecological research - http://www.stockholmresilience.org/download/18.10119fc11455d3c557d6d21/1398172490555/
    SU_SRC_whatisresilience_sidaApril2014.pdf
    Applying resilience thinking: Seven principles for building resilience in social-ecological systems - http://www.stockholmresilience.org/download/
    18.10119fc11455d3c557d6928/1398150799790/SRC+Applying+Resilience+final.pdf

    Crash-Only Software - https://www.usenix.org/legacy/events/hotos03/tech/full_papers/candea/candea.pdf

    Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel - http://roc.cs.berkeley.edu/papers/recursive_restartability.pdf

    Out of the Tar Pit - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.93.8928

    Bulkhead Pattern - http://skife.org/architecture/fault-tolerance/2009/12/31/bulkheads.html

    Making Reliable Distributed Systems in the Presence of Software Errors - http://www.erlang.org/download/armstrong_thesis_2003.pdf

    On Erlang, State and Crashes - http://jlouisramblings.blogspot.be/2010/11/on-erlang-state-and-crashes.html

    Akka Supervision - http://doc.akka.io/docs/akka/snapshot/general/supervision.html

    Release It!: Design and Deploy Production-Ready Software - https://pragprog.com/book/mnee/release-it

    Feedback Control for Computer Systems - http://www.amazon.com/Feedback-Control-Computer-Systems-Philipp/dp/1449361692

    The Network in Reliable - http://queue.acm.org/detail.cfm?id=2655736

    Data on the Outside vs Data on the Inside - https://msdn.microsoft.com/en-us/library/ms954587.aspx

    Life Beyond Distributed Transactions - http://adrianmarriott.net/logosroot/papers/LifeBeyondTxns.pdf

    Immutability Changes Everything - http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf

    Standing on Distributed Shoulders of Giants - https://queue.acm.org/detail.cfm?id=2953944

    Thinking in Promises - http://shop.oreilly.com/product/0636920036289.do

    In Search Of Certainty - http://shop.oreilly.com/product/0636920038542.do

    Reactive Microservices Architecture - http://www.oreilly.com/programming/free/reactive-microservices-architecture-orm.csp
    Reactive Streams - http://reactive-streams.org

    Vending Machine Akka Supervision Demo - https://gist.github.com/jboner/d24c0eb91417a5ec10a6

    Persistent Vending Machine Akka Supervision Demo - https://gist.github.com/jboner/1db37eeee3ed3c9422e4

    View Slide

  114. Thank
    You

    View Slide

  115. Without Resilience
    Nothing Else Matters
    Jonas Bonér
    CTO Lightbend
    @jboner

    View Slide