The overnight failure

The overnight failure

this talk is based on a true horror story.

Imagine your work week ends after releasing a set of features to production. Your team is happy and you feel good about yourself. A call about a problem with payments wakes you up the next morning. You find out your most valuable users were charged hundreds of times, consuming their credit card limits, leaving others in overdraft. They're angry because they can't even buy milk at the store.

Learn how a "perfect bug storm" caused the problem, how our processes failed to catch it and how hard it was to gain our users trust back.

Presented at: RubyConf Taiwan 2016

92d08794b535e41a4082c57ea547546e?s=128

Sebastian Sogamoso

December 02, 2016
Tweet

Transcript

  1. 2.
  2. 5.

    !

  3. 6.
  4. 7.
  5. 8.
  6. 9.
  7. 10.
  8. 11.
  9. 12.
  10. 13.
  11. 15.
  12. 16.
  13. 19.
  14. 20.
  15. 21.
  16. 22.
  17. 27.
  18. 28.
  19. 29.
  20. 34.
  21. 37.
  22. 38.
  23. 39.
  24. 40.

    The Story The story Recap • Users carpool everyday •

    The billing process is ran once a week • It charges the passengers • And pays the driver
  25. 56.
  26. 59.

    The failure Weekly process was ran 06:00 User couldn't buy

    milk and bread for breakfast 06:25 Users reported bug 06:34
  27. 62.

    The failure Manager: hey sorry to call you this early

    but we have a problem with payments in production and a lot of customers are complaining about it
  28. 65.

    The failure Weekly process was ran 06:00 User couldn't buy

    milk and bread for breakfast 06:25 06:43 Users reported bug Manager woke me up 06:34
  29. 66.

    7

  30. 67.
  31. 68.

    7

  32. 69.

    7

  33. 70.

    The failure Weekly process was ran 06:00 User couldn't buy

    milk and bread for breakfast 06:25 06:43 Users reported bug 06:58 Manager woke me up Stopped processing payment jobs 06:34
  34. 79.

    The failure Weekly process was ran Deployed a fix to

    production 06:00 22:50 User couldn't buy milk and bread for breakfast 06:25 06:43 Users reported bug 06:58 Manager woke me up Stopped processing payment jobs 06:34
  35. 80.

    The failure Weekly process was ran Deployed a fix to

    production 06:00 22:50 User couldn't buy milk and bread for breakfast 06:25 06:43 22:55 Users reported bug 06:58 Manager woke me up Stopped processing payment jobs 06:34 Started looking for a new job
  36. 81.

    The failure Weekly process was ran Deployed a fix to

    production 06:00 22:50 User couldn't buy milk and bread for breakfast 06:25 06:43 Users reported bug 06:58 Manager woke me up Stopped processing payment jobs 06:34
  37. 96.

    The reputation of your product is going to be the

    average of customers’ experiences Public Relationships Lessons learned
  38. 97.