Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The True Cost of Moving Fast & Breaking Things

Komodor
October 12, 2024

The True Cost of Moving Fast & Breaking Things

Komodor

October 12, 2024
Tweet

More Decks by Komodor

Other Decks in Technology

Transcript

  1. The Fast [DEV] & The Furious [CFO] The true cost

    of moving fast and breaking things
  2. The Region-Beta Paradox TL;DR: people are generally better off being

    in a really bad situation then a mediocre one Identifying you've outran your ability to do something can actually be a good thing, because it will motivate you to do something about it!
  3. Signs I Ran Too Fast devVelocity.then > devVelocity.now select *

    from developers_who_can_solve_this; – 1 line returned timeouts in query(find_rootcause) is taking more than expected and affecting business locate `$$$`
  4. Signs I Ran Too Fast Our development velocity is slowing

    down - what used to take us a fraction of the time now takes too long select * from developers_who_can_solve_this; – 1 line returned timeouts in query(find_rootcause) is taking more than expected and affecting business locate `$$$`
  5. Signs I Ran Too Fast Our development velocity is slowing

    down - what used to take us a fraction of the time now takes too long We all have that one engineer that knows how to fix that one thing that keeps on happening and without him everything goes bananas timeouts in query(find_rootcause) is taking more than expected and affecting business locate `$$$`
  6. Signs I Ran Too Fast Our development velocity is slowing

    down - what used to take us a fraction of the time now takes too long We all have that one engineer that knows how to fix that one thing that keeps on happening and without him everything goes bananas We spend too long trying to understand the root causes for production issues We can’t make sense of what’s draining our resources
  7. What KPIs Should I Measure, and Why? % of an

    engineering team solving critical bugs: how many people are actually solving the problem MTTR & MTTD: how long it takes to resolve an incident, once the RC is detected. How long to detect an incident in the first place? Onboarding to activation time: how long before a new dev becomes effective in the team and starts adding value (i.e ready to take on-call shifts) Costs: cloud providers, 3rd party integrations
  8. Oops, I debt it again… Case study - R&D MTTD

    increased in the last Quarter by 20% Question: Why does it take us longer to discover issues? 1. Lack of knowledge by the devs doing on-call duty? ◦ Train on-call engineers ◦ Document playbooks for what to do when such errors arise ◦ Use tools that democratize knowledge around that area 2. Lack of tooling to allow us to pinpoint the root cause quickly? ◦ Logs metrics and APMs for the win 3. Lack of processes to determine who should handle which errors? ◦ Are the right people looking at the issues?
  9. Don’t hit me one more time… Always add a better

    next - look at the bigger picture! • IF the problem was mitigated by creating metrics and logs for the issues that we’re currently facing • THEN install a process that enforces APMs logs or metrics for new features as part of a DoD for any delivery