Slide 1

Slide 1 text

The Reboot, reloaded @nukemberg (Avishai Ish-Shalom)

Slide 2

Slide 2 text

Recovery Oriented Computing

Slide 3

Slide 3 text

Why do we do this???

Slide 4

Slide 4 text

Because it works But how??

Slide 5

Slide 5 text

Resetting state

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Implicit state ● Threads state machine ● Memory allocations ● Database/backend connections ● Client connections ● Client sessions ● Telemetry counters ● Throttling/rate limiting ● Caches ● JIT ● File descriptors You get the point

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

The initialization phase Prepare assets for the main service loop ● But what if it fails? ● New failure modes! ○ E.g. different behaviors on DNS, DB failure ● Multiple, non compatible recovery paths ● Zombie processes ● Hard to detect dependencies ○ Circular dependencies, anyone? ○ (Looking at you DNS)

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Recovery oriented computing (ROC) Problems will always happen ● Make it easy to identify problems ● Limit the impact of problems ● Built in recovery mechanisms Modularity is key!!!

Slide 13

Slide 13 text

The software aging problem ● State/configuration drift and corruption ● Counters overflow ● Disks, memory fill up ● Caches become stale ● Data structures become fragmented/sparse Hard to track and monitor A + B + C - A != B + C $ python hashmap.py a = {'test': 1}, size(a) = 41943136 b = {'test': 1}, size(b) = 232

Slide 14

Slide 14 text

Reboot as a software recovery mechanism ● Return to known good config/state ● Initialize connections, etc ● Nuke unknown state from orbit ● Make those hidden dependencies visible But, a reboot is expensive!

Slide 15

Slide 15 text

State segregation ● Each “state ring” owns some resources and data ● Each “state ring” has its own reset procedure ● Reset clears data, reinitialize resources ● After reset, known good state or hard failure ○ Makes the state ring easy to monitor ● No impact on state in other state rings

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

Micro-reboot and other magical creatures Different state rings with different reboot costs E.g. 50ms to micro reboot

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Macro reboot ● How do you reboot a service? A whole system? ● Some state must be preserved ○ In particular, business state ○ All useful systems are stateful to some degree ● Some state is in transit ○ The network itself has is state ● Invalidate client sessions, backend sessions, etc ○ Generation ID is your friend Rebooting a system does not mean rebooting its parts!

Slide 20

Slide 20 text

Generation ID ● UUID or counter ● Reset on reboot ● Invalidate sessions, connections

Slide 21

Slide 21 text

“Crash only” software ● Significant part of most software dedicated to error handling ○ Complex ○ Scattered ○ Buggy ● Single error handling mechanism, single exit point: a crash ● Single recovery path, single entry point: boot ● Crash report (state dump) for post-mortem debugging Crash, dump, reboot

Slide 22

Slide 22 text

Single recovery path ⇒ Single optimization path ● Routine reboots make boot profiling easier ● Parallelization ● Reducing the “initialization” phase ● Lazy/optional dependencies

Slide 23

Slide 23 text

Business flows need love too ● Design business flows to support reset ○ Sometimes customer consent is needed ○ Not every flow can be reset ● Clear pending state and return to last known good state ● Triggered by support or automation ○ Very handy during crisis ● Don’t forget crash reports Organizational processes anyone?

Slide 24

Slide 24 text

Preemptive reboots Everything breaks, eventually. Better it happens on your terms ● Trigger reboot based on max lifetime, resource usage, errors or just random ● Reboot becomes integral part of software lifecycle ● Keeps components within known boundaries of “healthy” operation ● Makes upgrades, deployments, etc easier ● Flush out problem early; If it won’t boot, better you know now ● Yes, there’s overhead - just like GC

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

Reboot like a boss ● Orchestrated reboots - preserve overall capacity ○ Leases, random backoffs or turn based ● Use soft dependencies ○ Best way to avoid dependency cycles and deadlocks ● Learn the art of stateful reboots ○ Checkpoints, redo logs, generational memory ○ Deserves an entire sequel (The Reboot, Resurrections)

Slide 27

Slide 27 text

Beyond reboots ● Continuous repair/Anti entropy ● System wide undo ● System replay And lots more

Slide 28

Slide 28 text

Everything dies * Except LISP. LISP is eternal

Slide 29

Slide 29 text

And a new generation is born

Slide 30

Slide 30 text

The circle of life must continue ● A reliable system is built of less reliable parts ● The essential properties of the system survive, the parts don’t Are you still you when all your cells have been replaced?

Slide 31

Slide 31 text

References Berkeley/Stanford ROC project homepage ● Computer Immunology ● Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel ● Microreboot – A Technique for Cheap Recovery ● Crash-Only Software

Slide 32

Slide 32 text

~$ sudo reboot