Slide 1

Slide 1 text

How we operate an OpenStack Public Cloud Jan Peschke · Teamlead Cloud @ SysEleven GmbH Steffen Neubauer · Cloud Architect @ SysEleven GmbH

Slide 2

Slide 2 text

Meltdown Spectre L1 Terminal Fault

Slide 3

Slide 3 text

Easy fix: Update 3 1. Install update 2. Reboot your computer 3. Repeat Right?

Slide 4

Slide 4 text

Customer expectation 4 1. Always available 2. Secure

Slide 5

Slide 5 text

(Spoiler alert) 2500 reboots per year

Slide 6

Slide 6 text

Never change a
 running system? 6 - Because of security threats, we have to change it constantly! - Regular small changes are better than rare, huge changes

Slide 7

Slide 7 text

Humans are smarter than machines? 7 - Humans make mistakes when doing repetitive tasks - Example: - When accidentally rebooting two servers, storage is down

Slide 8

Slide 8 text

How to reboot 2500 times?

Slide 9

Slide 9 text

Always be aware of the cluster state! 9 - Should-be state: Configuration management - Actual state: Consul

Slide 10

Slide 10 text

10 Ist-Zustand vs. Soll-Zustand - Soll-Zustand: Konfigurationsmanagement

Slide 11

Slide 11 text

1. Every node checks regularly, if a reboot is necessary 2. If yes, it gets a lock (so nobody else can reboot) 3. If successful, it asks Consul if all cluster services are OK 4. Local tasks are executed (e.g. live-migrate VMs) 5. Create monitoring downtime (consul maintenance) 6. Run reboot command 7. Once the node is back: 8. Remove the downtime 9. Run local tasks (e.g. re-enable nova compute) 10.Check if all services are OK again 11. Release the lock

Slide 12

Slide 12 text

12

Slide 13

Slide 13 text

Open source since today: GitHub.com/syseleven/rebootmgr