Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The overnight failure

The overnight failure

This talk is based on a true horror story.

It is very likely that you too have "broken the internet" at some point in your career. Here I share the story of how I did it this time and the lessons I learned from this experience.

Presented at: wroc_love.rb 2017

92d08794b535e41a4082c57ea547546e?s=128

Sebastian Sogamoso

March 18, 2017
Tweet

Transcript

  1. The Overnight Failure

  2. My name is Sebastián Tweet to me @sebasoga

  3. None
  4. Why?

  5. We all have broken the internet

  6. We normally don’t talk about it in public

  7. Impostor syndrome

  8. None
  9. None
  10. None
  11. None
  12. No matter the scale, our bugs affect our users lives

  13. Big problems are usually the situations in which we learn

    the most
  14. What’s the worst thing that could happen to you at

    work?
  15. The Overnight Failure

  16. How?

  17. None
  18. None
  19. None
  20. A B

  21. A B

  22. A B

  23. A B

  24. None
  25. None
  26. None
  27. None
  28. None
  29. A B 0

  30. A B A B 1

  31. A B A B 1

  32. A B 2

  33. None
  34. None
  35. 9 A B

  36. Once a week

  37. None
  38. None
  39. None
  40. Once a week

  41. Once a week

  42. Once a week

  43. Once a week

  44. None
  45. Passenger Driver Total for trips

  46. Once a week

  47. Once a week

  48. Once a week

  49. Once a week

  50. Recap • Users carpooled everyday • Billing process ran once

    a week • Passengers were charged • Driver were paid
  51. What?

  52. Once a week 6:00am

  53. Saturday Wekly process was ran 06:00

  54. Black Saturday

  55. None
  56. None
  57. None
  58. 6:25am

  59. Black Saturday Wekly process was ran 06:00 06:25 User couldn’t

    pay for breakafast
  60. @#$&! @#$&!

  61. 6:25am @#$&! @#$&!! @#$&! @#$&! @#$&!!!

  62. Black Saturday Wekly process was ran 06:00 06:25 User couldn’t

    pay for breakafast 06:34 Users reported bug
  63. None
  64. 6:34am

  65. Manager: hey, sorry to call you this early but we

    have a problem with payments in production and a lot of customers are complaining about it
  66. None
  67. None
  68. * photo provided by my former boss

  69. Me: Sure, I’ll take a look right away. Let’s talk

    over chat
  70. Me: (thinking I ended the call) Me: F********k!!!!!!

  71. Me: (thinking I ended the call) Me: F********k!!!!!! Manager: I’m

    still in the call
  72. Black Saturday Wekly process was ran 06:00 06:25 06:43 User

    couldn’t pay for breakafast 06:34 Users reported bug Manager woke me up
  73. Lots of duplicated charges

  74. Once a week

  75. Once a week

  76. Once a week

  77. Once a week

  78. Refunded charges

  79. Once a week

  80. Reversed transfers

  81. Once a week 7:28am

  82. Black Saturday Wekly process was ran 06:00 06:25 07:28 06:43

    User couldn’t pay for breakafast 06:34 Users reported bug Manager woke me up Problem contained
  83. None
  84. None
  85. Passenger: UserID: 9 Driver: User ID: 100 $10.00

  86. Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID:

    9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 0 0 0
  87. Once a week

  88. Once a week

  89. Once a week

  90. Wrote tests, a fix and deployed it 10:50pm

  91. Black Saturday Wekly process was ran 06:00 06:25 06:43 User

    couldn’t pay for breakafast 06:34 Users reported bug Manager woke me up 22:50 Deployed a fix to production Problem contained 07:28
  92. Black Saturday Wekly process was ran 06:00 06:25 06:43 User

    couldn’t pay for breakafast 06:34 Users reported bug Manager woke me up 22:50 Deployed a fix to production 22:55 Started looking for a new job Problem contained 07:28
  93. Black Saturday Wekly process was ran 06:00 06:25 06:43 User

    couldn’t pay for breakafast 06:34 Users reported bug Manager woke me up 22:50 Deployed a fix to production Problem contained 07:28
  94. Thousands of users affected by the bug Users were charged

    up-to 200 times A single user was charged over $5k Maxed out credit cards. Emptied bank accounts
  95. And people still say Ruby is slow

  96. Refunds take up- to 5 business days

  97. Reached out to users to offer an expedite reimbursement option

  98. Why?

  99. Embarrassing things happen

  100. We need to make admiting mistakes easy

  101. Trust that we won’t be judged

  102. Make sure you understand what happened

  103. Move slow

  104. Document the problem

  105. Document the fix

  106. Document the lesson learnt

  107. You are not your failures

  108. It’s all temporary

  109. #ibrokeshit

  110. cookpad.com

  111. sourcediving.com

  112. We are hiring

  113. Dzięki @sebasoga