The overnight failure

The overnight failure

This talk is based on a true story.

Here I share the story of how I created a big bug and the lessons I learned from this experience. In hope that you can learn from them, so that when it happens to you (it eventually will), you are better prepared.

Presented at: RubyConf 2017

92d08794b535e41a4082c57ea547546e?s=128

Sebastian Sogamoso

November 10, 2017
Tweet

Transcript

  1. The Overnight Failure

  2. None
  3. None
  4. My name is Sebastián Tweet to me @sebasoga

  5. cookpad.com

  6. Why?

  7. We all have broken the internet

  8. We normally don’t talk about it in public

  9. Impostor syndrome

  10. None
  11. None
  12. None
  13. None
  14. No matter the scale, our bugs affect our users lives

  15. We learn more from failure than success

  16. What’s the worst thing that could happen to you at

    work?
  17. The Overnight Failure

  18. How?

  19. None
  20. None
  21. None
  22. A B

  23. A B

  24. A B

  25. A B

  26. None
  27. None
  28. None
  29. None
  30. None
  31. A B 0

  32. A B A B 1

  33. A B A B 1

  34. A B 2

  35. None
  36. None
  37. A B 9

  38. Once a week

  39. None
  40. None
  41. None
  42. Once a week

  43. Once a week

  44. Once a week

  45. Once a week

  46. None
  47. Passenger Driver Total for trips

  48. Once a week

  49. Once a week

  50. Once a week

  51. Once a week

  52. Recap • Users carpooled everyday • The payment process ran

    once a week • Passengers were charged • Drivers were paid
  53. What?

  54. Once a week 6:00am

  55. Saturday Weekly process was ran 06:00

  56. Black Saturday

  57. None
  58. None
  59. None
  60. 6:25am

  61. Black Saturday Weekly process was ran 06:00 06:25 User couldn’t

    pay for breakfast
  62. @#$&! @#$&!

  63. 6:25am @#$&! @#$&!! @#$&! @#$&! @#$&!!!

  64. Black Saturday Weekly process was ran 06:00 06:25 User couldn’t

    pay for breakfast 06:34 Users reported bug 22:50
  65. None
  66. 6:34am

  67. Boss: hey, sorry to call you this early but we

    have a problem with payments in production and a lot of customers are complaining about it
  68. Me: Sure, I’ll take a look right away. Let’s talk

    over chat
  69. Me: (thinking I ended the call) Me: F********k!!!!!!

  70. Me: (thinking I ended the call) Me: F********k!!!!!! Boss: I’m

    still in the line
  71. Black Saturday Weekly process was ran 06:00 06:25 06:43 User

    couldn’t pay for breakfast 06:34 Users reported bug Manager woke me up
  72. Lots of duplicated charges

  73. Once a week

  74. Once a week

  75. Once a week

  76. Once a week

  77. Refunded charges

  78. Once a week

  79. Reversed transfers

  80. Once a week 7:28am

  81. Black Saturday Weekly process was ran 06:00 06:25 06:43 User

    couldn’t pay for breakfast 06:34 Users reported bug Manager woke me up Problem contained 07:28
  82. None
  83. None
  84. Passenger: UserID: 9 Driver: User ID: 100 $10.00

  85. Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID:

    9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 0 0 0
  86. Once a week

  87. Once a week

  88. Once a week

  89. Wrote tests, a fix and deployed it 10:50pm

  90. Black Saturday Weekly process was ran 06:00 06:25 06:43 User

    couldn’t pay for breakfast 06:34 Users reported bug Manager woke me up 22:50 Deployed a fix to production Problem contained 07:28
  91. None
  92. Black Saturday 06:25 06:34 Users reported bug 22:55 Started looking

    for a new job Problem contained 07:28 Weekly process was ran 06:00 06:25 06:43 User couldn’t pay for breakfast 06:34 Users reported bug Manager woke me up 22:50 Deployed a fix to production Problem contained 07:28 Started looking for a new job
  93. Black Saturday Weekly process was ran 06:00 06:25 06:43 User

    couldn’t pay for breakfast 06:34 Users reported bug Manager woke me up 22:50 Deployed a fix to production Problem contained 07:28
  94. Thousands of users affected by the bug Users were charged

    up-to 200 times A single user was charged over $5k Maxed out credit cards. Emptied bank accounts
  95. Refunds take up- to 5 business days

  96. Reached out to users to offer an expedited reimbursement option

  97. Why?

  98. Embarrassing things happen

  99. Tests won’t save you

  100. Code review won’t save you

  101. QA won’t save you

  102. Software is built by humans

  103. We need to make admiting mistakes easy

  104. Trust that we won’t be judged

  105. Make sure you understand what happened

  106. Move slooooow

  107. Document the problem

  108. Document the fix

  109. Document the lesson learnt

  110. Don’t git blame

  111. You are not your failures

  112. No one will care about the bugs you create…

  113. When you die

  114. It’s all temporary

  115. #RubyConf

  116. We are hiring

  117. The end @sebasoga