The overnight failure

The overnight failure

this talk is based on a true horror story.

Imagine your work week ends after releasing a set of features to production. Your team is happy and you feel good about yourself. A call about a problem with payments wakes you up the next morning. You find out your most valuable users were charged hundreds of times, consuming their credit card limits, leaving others in overdraft. They're angry because they can't even buy milk at the store.

Learn how a "perfect bug storm" caused the problem, how our processes failed to catch it and how hard it was to gain our users trust back.

Presented at: RubyConf Taiwan 2016

92d08794b535e41a4082c57ea547546e?s=128

Sebastian Sogamoso

December 02, 2016
Tweet

Transcript

  1. Overnight Failure The

  2. ֦অ

  3. My name is Sebastián Tweet to me @sebasoga

  4. cookpad.com/tw

  5. !

  6. None
  7. None
  8. None
  9. None
  10. None
  11. None
  12. None
  13. None
  14. #rubyfriends

  15. None
  16. None
  17. ! Made in Taiwan

  18. Overnight Failure The

  19. The story

  20. The story

  21. The story

  22. The story

  23. A B The story

  24. A B The story

  25. A B The story

  26. A B The story

  27. The story

  28. The story

  29. The story

  30. The story A B

  31. The story A B 1

  32. The story A B

  33. The story A B 2

  34. The story

  35. The story One week

  36. The story A B 9

  37. The story

  38. The story

  39. The story

  40. The Story The story Recap • Users carpool everyday •

    The billing process is ran once a week • It charges the passengers • And pays the driver
  41. The Story The story 7

  42. The Story The story 7

  43. The Story The story 7

  44. The Story The story 7 Driver Passenger Trips

  45. The Story The story 7

  46. The Story The story 7

  47. The Story The story 7

  48. The Story The story 7

  49. The failure

  50. Black Saturday The failure

  51. 7 6:00am The failure

  52. The failure Weekly process was ran 06:00

  53. The failure

  54. The failure 6:25am

  55. The failure 6:25am

  56. The failure Weekly process was ran 06:00 User couldn't buy

    milk and bread for breakfast 06:25
  57. The failure @#$&!! @#$&! @#$&!

  58. The failure 6:34am @#$&!! @#$&! @#$&! @#$&!!! @#$&!

  59. The failure Weekly process was ran 06:00 User couldn't buy

    milk and bread for breakfast 06:25 Users reported bug 06:34
  60. The failure

  61. The failure 6:43am

  62. The failure Manager: hey sorry to call you this early

    but we have a problem with payments in production and a lot of customers are complaining about it
  63. The failure Me: Sure, I’ll take a look right away.

    Let’s talk over chat
  64. The failure Me: (think he hung up) Me: F********k!!!!!! Manager:

    I’m still in the call
  65. The failure Weekly process was ran 06:00 User couldn't buy

    milk and bread for breakfast 06:25 06:43 Users reported bug Manager woke me up 06:34
  66. 7

  67. 7 ✓

  68. 7

  69. 7

  70. The failure Weekly process was ran 06:00 User couldn't buy

    milk and bread for breakfast 06:25 06:43 Users reported bug 06:58 Manager woke me up Stopped processing payment jobs 06:34
  71. 7 The failure

  72. 7 The failure

  73. 7 The failure

  74. 7 The failure

  75. 7 The failure

  76. Driver Passenger Trips The failure

  77. The failure

  78. 7 The failure

  79. The failure Weekly process was ran Deployed a fix to

    production 06:00 22:50 User couldn't buy milk and bread for breakfast 06:25 06:43 Users reported bug 06:58 Manager woke me up Stopped processing payment jobs 06:34
  80. The failure Weekly process was ran Deployed a fix to

    production 06:00 22:50 User couldn't buy milk and bread for breakfast 06:25 06:43 22:55 Users reported bug 06:58 Manager woke me up Stopped processing payment jobs 06:34 Started looking for a new job
  81. The failure Weekly process was ran Deployed a fix to

    production 06:00 22:50 User couldn't buy milk and bread for breakfast 06:25 06:43 Users reported bug 06:58 Manager woke me up Stopped processing payment jobs 06:34
  82. Lessons learned

  83. Lessons learned Engineering Support Public Relationships

  84. Engineering Lessons learned

  85. Engineering Lessons learned Make sure you understand what caused the

    issue
  86. Engineering Lessons learned What other sorts of bugs you're hitting?

  87. Engineering Lessons learned Document your mistakes publicly

  88. Support Lessons learned

  89. Support Lessons learned Respond fast while being careful

  90. Support Lessons learned Be proactive in reaching out to your

    users when something went wrong
  91. Support Lessons learned Overcompensate for the damage you caused

  92. Public Relationships Lessons learned

  93. Public Relationships Lessons learned Admit that you made a mistake

  94. Public Relationships Lessons learned Apologize like a human

  95. Public Relationships Lessons learned Tell the whole truth

  96. The reputation of your product is going to be the

    average of customers’ experiences Public Relationships Lessons learned
  97. Individual customers may not be forgiving, a customer base is

    Public Relationships Lessons learned
  98. Notes from a fellow developer

  99. Embarrassing things happen

  100. It is all temporary

  101. If possible take a few days off after dealing with

    a stressing incident
  102. sourcediving.com

  103. 拽拽 @sebasoga