Problems hiding in plain sight It just takes longer for small-scale users to notice problems due to e.g. randomness 100 days time servers 3 3 days time servers 100 Small-scale users Large-scale users
“deliberately leave significant headroom for workload growth, occasional ‘black swan’ events, load spikes, machine failures, hardware upgrades, and large-scale partial failures (e.g., a power supply bus duct)” Source: (Verma et. al., 2015) Google Finding: “Failure is the Norm”
What does this mean for server systems? ✨ ✨ ✨ 1 2 3 Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config A Power On OS v1 Config A Power On OS v1 Config A Power On 1 2 3 Example: Sysadmin A gets three new servers, and install the same operating system onto all of them, with exactly the same configuration. In the beginning, the system is completely ordered, all instances are identically configured.
What does this mean for server systems? 1 2 3 Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config A Power On OS v2 Config A Power On OS v2 Config A Power On 1 2 3 After some time, a critical “v2” security upgrade to the operating system becomes available, and sysadmin A upgrades servers 2 and 3, but not 1, as it is running a critical database service, so A is afraid to disturb it.
What does this mean for server systems? 1 2 3 Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out Slow disk access time OS v1 Config B Power On OS v2 Config A Power On OS v2 Config A Power On 1 2 3 Server 1 complains about slow disk access time, due to a misconfiguration in the operating system. Sysadmin A fixes it imperatively on the computer that complains until it stops, but none of the other servers.
What does this mean for server systems? 1 2 3 Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config B Power On OS v2 Config A Power Off OS v2 Config A Power On 1 2 3 Sysadmin A has noticed that the amount of users has dropped because of a seasonal trend, so A decides to turn server 2 off to save on energy costs.
What does this mean for server systems? 1 2 3 Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out Slow disk access time OS v1 Config B Power On OS v2 Config A Power Off OS v2 Config C Power On 1 2 3 The next week, when sysadmin A is on vacation, server 3 complains about the same error as server 1 earlier. Sysadmin B “solves” the issue (in another way than A for server 1), but does nothing to the other servers.
What does this mean for server systems? 1 2 3 Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config B Power On OS v2 Config A Power Off OS v3 Config C Power On 1 2 3 desired state change Now, a new version of the operating system is released with a very cool feature that would be useful to the sysadmins. However, upgrading is risky because of incompatibilities, so they only upgrade server 3 to try it out.
What does this mean for server systems? 1 2 3 Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config B Power On OS v2 Config A Power Off OS v3 Config C Power Out 1 2 3 emergent state change Suddenly, a thunderstorm enters the area where the servers are, and the lightning strikes. Due to the lack of overvoltage protection, server 3’s power supply becomes unusable, and thus shuts down.
= Automated reconcile loops with “human-like” operational knowledge Coined in 2016 by Brandon Phillips, back then at CoreOS Operators: Encode human-like knowledge