Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Escaping Operations Hell

Escaping Operations Hell

Imagine you are responsible for a system: the system breaks, the user realizes it, informs you but you do not have a clue what is going on. And guess what? The clock is ticking. Welcome to operations hell!

Two years ago we were exactly in that situation and not only once. Let me tell you: it is as horrible as it sounds and we don't ever want to be there ever again. So we started our journey to escape this hell.

For sure, our system still breaks but we learned and improved a lot. Sometimes, we can prevent that the user experiences any impact at all or at least have a short meantime to recover.

Join me on this exciting journey from being called unexpectedly to preventing outages and see what helped us to escape from there and to defeat all the demons we met on that way.

25.10.2019, code.talks, Hamburg, https://www.codetalks.de/
Video: https://www.youtube.com/watch?v=EuKLna3wDww

Silvia Schreier

October 25, 2019
Tweet

More Decks by Silvia Schreier

Other Decks in Programming

Transcript

  1. Let me tell you a story… photo by hannah grace

    https://unsplash.com/photos/hIvsDdNT_f8
  2. a brave team of developers photo by Hugo L. Casanova

    https://unsplash.com/photos/GDre1q4wEJk
  3. a food wholesale online shop and fulfillment solution. photo by

    ja ma https://unsplash.com/photos/-gOUx23DNks
  4. It was supposed to run happily ever after. There were

    even users on the system. photo by Anders Jildén https://unsplash.com/photos/O85h02qZ24w
  5. But suddenly everything was on fire. photo by raquel raclette

    https://unsplash.com/photos/MYjFOiVWWT8
  6. ~250 Microservices Kubernetes Cassandra, Solr, Kafka, Postgres JVM (Java, Kotlin,

    Clojure), Go, Python, Node.js, React 11 teams ~80 developers Users are METRO customers & employees 20 countries Depending on 20 external systems
  7. How to get out of here? photo by Carolina Pimenta

    https://unsplash.com/photos/ELO-NmuvFCM
  8. Information from the users Nothing is working. The system is

    slow. We have performance problems. We see wrong data. photo by Jens Johnsson https://unsplash.com/photos/qFYBki6u3Ik
  9. Improve structure and flow of user information. photo by Kelly

    Sikkema https://unsplash.com/photos/SiOW0btU0zk
  10. At the beginning you might feel lost with all that

    data. photo by Rosie Fraser https://unsplash.com/photos/1L71sPT5XKc
  11. Which metrics are important? CPU, memory & network traffic Request

    count, response times and status codes Threads & their status
  12. Start building up your own toolbox and share it with

    others. photo by Hunter Haley https://unsplash.com/photos/s8OO2-t-HmQ
  13. You will get to know the patterns. Therefore, trust your

    gut feeling! photo by James McDonald https://unsplash.com/photos/GZMjMukr5zU
  14. How can we start fighting the fire? photo by Andrei

    Slobtsov https://unsplash.com/photos/7RfP8lLkHwI
  15. Have you tried to turning it off and on again?

    photo by Aleksandar Cvetanovic https://unsplash.com/photos/cw_uvISXkCI
  16. Get to know your system and its critical components. photo

    by Michał Parzuchowski https://unsplash.com/photos/geNNFqfvw48
  17. This helps to understand dependencies and potential side effects. photo

    by Hunter Haley https://unsplash.com/photos/ZiQkhI7417A
  18. Additionally, you can prioritize better and find suitable workarounds! photo

    by Cupcake Media https://unsplash.com/photos/JfOT-WwO1Ig
  19. What if your monitoring says everything is fine? But actually

    it isn‘t? photo by Katya Austin https://unsplash.com/photos/4Vg6ez9jaec
  20. Your system is not alone! There is so much more

    around. photo by Bryan Goff https://unsplash.com/photos/RF4p4rTM-2s
  21. Get an overview of your landscape. photo by Silas Baisch

    https://unsplash.com/photos/bNpAPNJCHsY
  22. Start monitoring your system from different sources. photo by Donald

    Giannatti https://unsplash.com/photos/Wj1D-qiOseE
  23. There are coincidences you will never believe. photo by Brett

    Jordan https://unsplash.com/photos/4aB1nGtD_Sg
  24. What is happening around my system? What is happening in

    my system? Prevent failure proactively
  25. Check common early warning signs in past incidents and anomalities.

    photo by Michael Dam https://unsplash.com/photos/RF4p4rTM-2s
  26. Come up with alerts so you know it before the

    user realizes. photo by Liam Briese https://unsplash.com/photos/8iwplTLLSWg
  27. Fine tune the alerts to reduce false positives. photo by

    Filip Barna https://unsplash.com/photos/SlIu4D_rTPo
  28. What is happening around my system? What is happening in

    my system? Prevent failure proactively Data-driven operations
  29. Communication Prevent failure proactively What is happening around my system?

    What is happening in my system? Data-driven operations
  30. Communication Responsibility Prevent failure proactively What is happening around my

    system? What is happening in my system? Data-driven operations
  31. You will write code differently if you might need to

    debug it at 3am! photo by Bailey Torres https://unsplash.com/photos/C5vBBUkyBss
  32. Communication Responsibility People Prevent failure proactively What is happening around

    my system? What is happening in my system? Data-driven operations
  33. Avoid politics! But learn the game! photo by Ricardo Gomez

    Angel https://unsplash.com/photos/w6diABfADkg
  34. Never underestimate knowing whom to ask for solving an issue.

    photo by Brittany Colette https://unsplash.com/photos/GFLMi4c8XMg
  35. Communication Responsibility Culture Prevent failure proactively What is happening around

    my system? What is happening in my system? Data-driven operations People
  36. Post Mortems and Culture of Failure photo by Agence Olloweb

    https://unsplash.com/photos/d9ILr-dbEdg
  37. Communication Responsibility People Culture Continuous Improvement Prevent failure proactively What

    is happening around my system? What is happening in my system? Data-driven operations
  38. Start to get out of here! photo by JR Korpa

    https://unsplash.com/photos/-AsMlld5e2I