Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Reboot, Reloaded

The Reboot, Reloaded

"Have you tried turning it off and on again?" is one of the most common jokes in our industry. However, behind the comical aspect lies a fundamental architectural pattern - state encapsulation and recovery by clearing state data. This pattern is mostly used implicitly, but when used as a design principle unlocks a new paradigm of programming: Recovery Oriented Computing (ROC).
This talks indroduces ROC and fundamental techniques for this paradigm, which can also be applied individually to great effect: seperation of state, crash-only software, micro-reboots, single recovery path, redo logs and checkpoints, pre-emptive
reboots, and more.

Avishai Ish-Shalom

December 26, 2021
Tweet

More Decks by Avishai Ish-Shalom

Other Decks in Technology

Transcript

  1. The Reboot, reloaded
    @nukemberg (Avishai Ish-Shalom)

    View Slide

  2. Recovery Oriented Computing

    View Slide

  3. Why do we do this???

    View Slide

  4. Because it works
    But how??

    View Slide

  5. Resetting state

    View Slide

  6. View Slide

  7. View Slide

  8. Implicit state
    ● Threads state machine
    ● Memory allocations
    ● Database/backend connections
    ● Client connections
    ● Client sessions
    ● Telemetry counters
    ● Throttling/rate limiting
    ● Caches
    ● JIT
    ● File descriptors
    You get the point

    View Slide

  9. View Slide

  10. The initialization phase
    Prepare assets for the main service loop
    ● But what if it fails?
    ● New failure modes!
    ○ E.g. different behaviors on DNS, DB failure
    ● Multiple, non compatible recovery paths
    ● Zombie processes
    ● Hard to detect dependencies
    ○ Circular dependencies, anyone?
    ○ (Looking at you DNS)

    View Slide

  11. View Slide

  12. Recovery oriented computing (ROC)
    Problems will always happen
    ● Make it easy to identify problems
    ● Limit the impact of problems
    ● Built in recovery mechanisms
    Modularity is key!!!

    View Slide

  13. The software aging problem
    ● State/configuration drift and corruption
    ● Counters overflow
    ● Disks, memory fill up
    ● Caches become stale
    ● Data structures become fragmented/sparse
    Hard to track and monitor
    A + B + C - A != B + C
    $ python hashmap.py
    a = {'test': 1}, size(a) = 41943136
    b = {'test': 1}, size(b) = 232

    View Slide

  14. Reboot as a software recovery mechanism
    ● Return to known good config/state
    ● Initialize connections, etc
    ● Nuke unknown state from orbit
    ● Make those hidden dependencies visible
    But, a reboot is expensive!

    View Slide

  15. State segregation
    ● Each “state ring” owns some resources and data
    ● Each “state ring” has its own reset procedure
    ● Reset clears data, reinitialize resources
    ● After reset, known good state or hard failure
    ○ Makes the state ring easy to monitor
    ● No impact on state in other state rings

    View Slide

  16. View Slide

  17. Micro-reboot and other magical creatures
    Different state rings with different reboot costs
    E.g. 50ms to micro reboot

    View Slide

  18. View Slide

  19. Macro reboot
    ● How do you reboot a service? A whole system?
    ● Some state must be preserved
    ○ In particular, business state
    ○ All useful systems are stateful to some degree
    ● Some state is in transit
    ○ The network itself has is state
    ● Invalidate client sessions, backend sessions, etc
    ○ Generation ID is your friend
    Rebooting a system does not mean rebooting its parts!

    View Slide

  20. Generation ID
    ● UUID or counter
    ● Reset on reboot
    ● Invalidate sessions, connections

    View Slide

  21. “Crash only” software
    ● Significant part of most software dedicated to error handling
    ○ Complex
    ○ Scattered
    ○ Buggy
    ● Single error handling mechanism, single exit point: a crash
    ● Single recovery path, single entry point: boot
    ● Crash report (state dump) for post-mortem debugging
    Crash, dump, reboot

    View Slide

  22. Single recovery path ⇒ Single optimization path
    ● Routine reboots make boot profiling easier
    ● Parallelization
    ● Reducing the “initialization” phase
    ● Lazy/optional dependencies

    View Slide

  23. Business flows need love too
    ● Design business flows to support reset
    ○ Sometimes customer consent is needed
    ○ Not every flow can be reset
    ● Clear pending state and return to last known good state
    ● Triggered by support or automation
    ○ Very handy during crisis
    ● Don’t forget crash reports
    Organizational processes anyone?

    View Slide

  24. Preemptive reboots
    Everything breaks, eventually. Better it happens on your terms
    ● Trigger reboot based on max lifetime, resource usage, errors or just random
    ● Reboot becomes integral part of software lifecycle
    ● Keeps components within known boundaries of “healthy” operation
    ● Makes upgrades, deployments, etc easier
    ● Flush out problem early; If it won’t boot, better you know now
    ● Yes, there’s overhead - just like GC

    View Slide

  25. View Slide

  26. Reboot like a boss
    ● Orchestrated reboots - preserve overall capacity
    ○ Leases, random backoffs or turn based
    ● Use soft dependencies
    ○ Best way to avoid dependency cycles and deadlocks
    ● Learn the art of stateful reboots
    ○ Checkpoints, redo logs, generational memory
    ○ Deserves an entire sequel (The Reboot, Resurrections)

    View Slide

  27. Beyond reboots
    ● Continuous repair/Anti entropy
    ● System wide undo
    ● System replay
    And lots more

    View Slide

  28. Everything dies
    * Except LISP. LISP is eternal

    View Slide

  29. And a new generation is born

    View Slide

  30. The circle of life must continue
    ● A reliable system is built of less reliable parts
    ● The essential properties of the system survive, the parts don’t
    Are you still you when all your cells have been replaced?

    View Slide

  31. References
    Berkeley/Stanford ROC project homepage
    ● Computer Immunology
    ● Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel
    ● Microreboot – A Technique for Cheap Recovery
    ● Crash-Only Software

    View Slide

  32. ~$ sudo reboot

    View Slide