Pro Yearly is on sale from $80 to $50! »

Escaping Operations Hell

Escaping Operations Hell

Imagine you are responsible for a system: the system breaks, the user realizes it, informs you but you do not have a clue what is going on. And guess what? The clock is ticking. Welcome to operations hell!

Two years ago we were exactly in that situation and not only once. Let me tell you: it is as horrible as it sounds and we don't ever want to be there ever again. So we started our journey to escape this hell.

For sure, our system still breaks but we learned and improved a lot. Sometimes, we can prevent that the user experiences any impact at all or at least have a short meantime to recover.

Join me on this exciting journey from being called unexpectedly to preventing outages and see what helped us to escape from there and to defeat all the demons we met on that way.

25.10.2019, code.talks, Hamburg, https://www.codetalks.de/
Video: https://www.youtube.com/watch?v=EuKLna3wDww

285971c4a2aec35b8ab5f54cb66f7d1b?s=128

Silvia Schreier

October 25, 2019
Tweet

Transcript

  1. Escaping Operations Hell Silvia Schreier, @aivlis_s code.talks 2019, Hamburg photo

    by JR Korpa https://unsplash.com/photos/-AsMlld5e2I
  2. Let me tell you a story… photo by hannah grace

    https://unsplash.com/photos/hIvsDdNT_f8
  3. Once upon a time photo by Cederic X https://unsplash.com/photos/21DP3hytVHw

  4. a brave team of developers photo by Hugo L. Casanova

    https://unsplash.com/photos/GDre1q4wEJk
  5. received the mission photo by Ricardo Cruz https://unsplash.com/photos/P8LZaU52NME

  6. to develop and operate photo by Fleur https://unsplash.com/photos/dQf7RZhMOJU

  7. a food wholesale online shop and fulfillment solution. photo by

    ja ma https://unsplash.com/photos/-gOUx23DNks
  8. There were good days photo by Johann Siemens https://unsplash.com/photos/EPy0gBJzzZU

  9. and also bad days. photo by Dieter Pelz https://unsplash.com/photos/dQf7RZhMOJU

  10. It was supposed to run happily ever after. There were

    even users on the system. photo by Anders Jildén https://unsplash.com/photos/O85h02qZ24w
  11. But suddenly everything was on fire. photo by raquel raclette

    https://unsplash.com/photos/MYjFOiVWWT8
  12. Welcome to Operations Hell! photo by JR Korpa https://unsplash.com/photos/-AsMlld5e2I

  13. ~250 Microservices Kubernetes Cassandra, Solr, Kafka, Postgres JVM (Java, Kotlin,

    Clojure), Go, Python, Node.js, React 11 teams ~80 developers Users are METRO customers & employees 20 countries Depending on 20 external systems
  14. How to get out of here? photo by Carolina Pimenta

    https://unsplash.com/photos/ELO-NmuvFCM
  15. Where is it burning? photo by Jens Johnsson https://unsplash.com/photos/qFYBki6u3Ik

  16. What is happening in my system?

  17. Information from the users Nothing is working. The system is

    slow. We have performance problems. We see wrong data. photo by Jens Johnsson https://unsplash.com/photos/qFYBki6u3Ik
  18. Improve structure and flow of user information. photo by Kelly

    Sikkema https://unsplash.com/photos/SiOW0btU0zk
  19. What is burning? photo by Jens Johnsson https://unsplash.com/photos/qFYBki6u3Ik

  20. Meanwhile the developers photo by Jan Voth https://janvoth.com/

  21. are searching through the logs. photo by Cristina Gottardi https://unsplash.com/photos/8hJQKRIQZMY

  22. You want monitoring! photo by Chris Leipelt https://unsplash.com/photos/4UgUpo3YdKk

  23. You want monitoring!

  24. At the beginning you might feel lost with all that

    data. photo by Rosie Fraser https://unsplash.com/photos/1L71sPT5XKc
  25. Which metrics are important? CPU, memory & network traffic Request

    count, response times and status codes Threads & their status
  26. Start building up your own toolbox and share it with

    others. photo by Hunter Haley https://unsplash.com/photos/s8OO2-t-HmQ
  27. You will get to know the patterns. Therefore, trust your

    gut feeling! photo by James McDonald https://unsplash.com/photos/GZMjMukr5zU
  28. Sometimes you might not understand graphs but recognize patterns.

  29. How can we start fighting the fire? photo by Andrei

    Slobtsov https://unsplash.com/photos/7RfP8lLkHwI
  30. Have you tried to turning it off and on again?

    photo by Aleksandar Cvetanovic https://unsplash.com/photos/cw_uvISXkCI
  31. Get to know your system and its critical components. photo

    by Michał Parzuchowski https://unsplash.com/photos/geNNFqfvw48
  32. Know the business and the usage of your system.

  33. This helps to understand dependencies and potential side effects. photo

    by Hunter Haley https://unsplash.com/photos/ZiQkhI7417A
  34. Additionally, you can prioritize better and find suitable workarounds! photo

    by Cupcake Media https://unsplash.com/photos/JfOT-WwO1Ig
  35. What is happening around my system? What is happening in

    my system?
  36. What if your monitoring says everything is fine? But actually

    it isn‘t? photo by Katya Austin https://unsplash.com/photos/4Vg6ez9jaec
  37. Your system is not alone! There is so much more

    around. photo by Bryan Goff https://unsplash.com/photos/RF4p4rTM-2s
  38. Get an overview of your landscape. photo by Silas Baisch

    https://unsplash.com/photos/bNpAPNJCHsY
  39. Start monitoring your system from different sources. photo by Donald

    Giannatti https://unsplash.com/photos/Wj1D-qiOseE
  40. There are coincidences you will never believe. photo by Brett

    Jordan https://unsplash.com/photos/4aB1nGtD_Sg
  41. What is happening around my system? What is happening in

    my system? Prevent failure proactively
  42. Check common early warning signs in past incidents and anomalities.

    photo by Michael Dam https://unsplash.com/photos/RF4p4rTM-2s
  43. Come up with alerts so you know it before the

    user realizes. photo by Liam Briese https://unsplash.com/photos/8iwplTLLSWg
  44. Fine tune the alerts to reduce false positives. photo by

    Filip Barna https://unsplash.com/photos/SlIu4D_rTPo
  45. What is happening around my system? What is happening in

    my system? Prevent failure proactively Data-driven operations
  46. What is your current stability? photo by Harshal Desai https://unsplash.com/photos/0hCIrw8dVfE

  47. What are your SLIs? Response times Error rates Availability /

    latency Functional tests
  48. What is your target? photo by Annie Spratt https://unsplash.com/photos/t3IYuQZRDNE

  49. What are your SLOs? photo by Crystal Kwok https://unsplash.com/photos/9XsXOdkdxPQ

  50. Communication Prevent failure proactively What is happening around my system?

    What is happening in my system? Data-driven operations
  51. Direct communication is key. photo by Paweł Czerwiński https://unsplash.com/photos/-0xCCPIbl3M photo

    by Kirsty TG https://unsplash.com/photos/xmY3qMBfzBs
  52. Communication Responsibility Prevent failure proactively What is happening around my

    system? What is happening in my system? Data-driven operations
  53. Ensure clear responsibility and embrace it. photo by Anton Shuvalov

    https://unsplash.com/photos/tOJDsuU9MlE
  54. You will write code differently if you might need to

    debug it at 3am! photo by Bailey Torres https://unsplash.com/photos/C5vBBUkyBss
  55. Communication Responsibility People Prevent failure proactively What is happening around

    my system? What is happening in my system? Data-driven operations
  56. It is about people and cooperation. photo by Jan Voth

    https://janvoth.com/
  57. Build your network. Know your counterparts. photo by Jan Voth

    https://janvoth.com/
  58. Be nice! Collect karma and invest it carefully! photo by

    Jan Voth https://janvoth.com/
  59. Avoid politics! But learn the game! photo by Ricardo Gomez

    Angel https://unsplash.com/photos/w6diABfADkg
  60. Never underestimate knowing whom to ask for solving an issue.

    photo by Brittany Colette https://unsplash.com/photos/GFLMi4c8XMg
  61. Communication Responsibility Culture Prevent failure proactively What is happening around

    my system? What is happening in my system? Data-driven operations People
  62. Accept your system will fail! photo by chuttersnap https://unsplash.com/photos/cGXdjyP6-NU

  63. Be prepared. Have a plan. photo by Glenn Carstens-Peters https://unsplash.com/photos/RLw-UC03Gwc

  64. Post Mortems and Culture of Failure photo by Agence Olloweb

    https://unsplash.com/photos/d9ILr-dbEdg
  65. Communication Responsibility People Culture Continuous Improvement Prevent failure proactively What

    is happening around my system? What is happening in my system? Data-driven operations
  66. Don‘t panic! photo by Dharm Singh https://unsplash.com/photos/S2eX-jJSiOM

  67. Start to get out of here! photo by JR Korpa

    https://unsplash.com/photos/-AsMlld5e2I
  68. @aivlis_s @wearemetronom photo by Jan Voth https://janvoth.com/