Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SRE: the good, the bad, and the ouch

SRE: the good, the bad, and the ouch

SRE sounds like a plan with no drawbacks … but making it work in practice can be trickier than the theory says. This talk shares stories of SRE wins and SRE accidents.

Holly Cummins

April 28, 2022
Tweet

More Decks by Holly Cummins

Other Decks in Programming

Transcript

  1. SRE: the good, the bad, and the ouch Holly Cummins

    Red Hat WTF is SRE April 28, 2022
  2. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE @holly_cummins confession: i am

    not an SRE Robert Barron Cansu Kavılı Örnek but some of my good friends are
  3. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE thanks for the stories,

    Robert and Cansu Robert Barron IBM Garage Cansu Kavılı Örnek Red Hat Open Innovation Labs
  4. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE thanks for the stories,

    Robert and Cansu Robert Barron IBM Garage Cansu Kavılı Örnek Red Hat Open Innovation Labs
  5. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE thanks for the stories,

    Robert and Cansu Robert Barron IBM Garage Cansu Kavılı Örnek Red Hat Open Innovation Labs
  6. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE manual repetitive siloed not

    aligned to business goals unable to handle complexity of cloud native old ops
  7. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE SRE: what ops would

    be like if it was done by software engineers software engineer
  8. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE SRE: what ops would

    be like if it was done by software engineers software engineer
  9. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE SRE: what ops would

    be like if it was done by software engineers ops software engineer
  10. @holly_cummins IBM Garage true story “we are just as good:

    we have scripts” the cunning rebrand
  11. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE team mainframe team mobile

    we’re responsible for stability of the mainframe we’re responsible for stability of the front end
  12. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE team mainframe team mobile

    the ambassador we’re responsible for stability of the mainframe we’re responsible for stability of the front end we’re responsible for stability of the mainframe … as long as it’s used correctly
  13. @holly_cummins IBM Garage true story “we have a ticket per

    team, not per incident” dots aren’t connected
  14. “we want to do SRE but we don’t have enough

    permissions on our systems”
  15. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE “it takes us 15

    minutes just to get permission to run a standard set of SQL diagnostic statements”
  16. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE “it takes us 15

    minutes just to get permission to run a standard set of SQL diagnostic statements”
  17. @holly_cummins IBM Garage true story “we do post-mortems after every

    incident … maybe” the gap between intent and reality
  18. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE measure the number of

    incidents measure the number of post-mortems see if they match
  19. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE advanced metrics: how many

    people were in the post-mortem? does it include more than the people directly involved?
  20. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE advanced metrics: how many

    people were in the post-mortem? does it include more than the people directly involved? did we invite more than our own team?
  21. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE if involvement in an

    incident is punished, people will avoid engaging with systems
  22. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE “great idea, go build

    that!” if ideas are punished with extra work, people will try not to have ideas
  23. @holly_cummins IBM Garage true story “we count how many incidents

    we have; if the number goes down, it means we are working better” the perverse incentive
  24. @holly_cummins IBM Garage true story “we never seem to complete

    the work we planned” the email timesink
  25. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE theory reality 1 sprint

    50% story points 10% story points 50% unplanned work (tickets)
  26. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE theory reality 1 sprint

    50% story points 10% story points 50% unplanned work (tickets) 50% tickets
  27. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE theory reality 1 sprint

    50% story points 10% story points 50% unplanned work (tickets) 50% tickets 40% ??
  28. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE theory reality 1 sprint

    50% story points 10% story points 50% unplanned work (tickets) 50% tickets 40% ??
  29. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE “can you just …

    “ “how do I do this?” “where is this documented?”
  30. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE “can you just …

    “ “how do I do this?” “where is this documented?”
  31. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE this wasn’t a team

    failure it was a data quality issue it was a process issue
  32. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE what is failure in

    a complex system? if a system goes down but user experience is fine, does that count?
  33. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE measure “what have I

    learned” measure “have I made sure it won’t happen again”
  34. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE “we can’t ship until

    we have more confidence in the quality” you can fix that
  35. @holly_cummins IBM Garage true client story “we can’t release this

    microservice… we deploy all our microservices at the same time… because otherwise nothing works.” the monolithic microservices
  36. @holly_cummins IBM Garage true client story “every time we change

    code, something breaks” the peril of microservices
  37. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE just because a system

    runs across 6 containers doesn’t mean it’s decoupled
  38. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE they tested it …

    but stubbed out one component. that component was the one that broke.
  39. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE you can’t a/b test

    a $370 million rocket the ariane failed in 36 seconds
  40. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE team mainframe team mobile

    we’re responsible for stability of the front end remember this bank?
  41. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE team mainframe team mobile

    the ambassador we’re responsible for stability of the front end we’re responsible for stability of the mainframe … as long as it’s used correctly remember this bank?
  42. @holly_cummins PREVAIL Technical Conference 2021 #WTFisSRE one team, range of

    techniques canary deploys CI/CD pipelines one team CI/CD pipelines big-bang deploys onto AIX