The Overnight Failure

The Overnight Failure

This talk is based on a true horror story.

It is very likely that you too have created a big problem in production at some point in your career, wether by creating was a bug or running the wrong command. Here I share the story of how I did it this time and the lessons I learned from this experience.

Presented at: Euruko 2017

92d08794b535e41a4082c57ea547546e?s=128

Sebastian Sogamoso

September 30, 2017
Tweet

Transcript

  1. Szia

  2. My name is Sebastián Tweet to me @sebasoga

  3. !

  4. None
  5. None
  6. None
  7. The Overnight Failure

  8. Why?

  9. We all have broken the internet

  10. We normally don’t talk about it in public

  11. Impostor syndrome

  12. None
  13. None
  14. None
  15. None
  16. No matter the scale, our bugs affect our users lives

  17. Big problems are usually the situations in which we learn

    the most
  18. What’s the worst thing that could happen to you at

    work?
  19. The Overnight Failure

  20. How?

  21. None
  22. None
  23. None
  24. A B

  25. A B

  26. A B

  27. A B

  28. None
  29. None
  30. None
  31. None
  32. None
  33. A B 0

  34. A B A B 1

  35. A B A B 1

  36. A B 2

  37. None
  38. None
  39. 9 A B

  40. Once a week

  41. None
  42. None
  43. None
  44. Once a week

  45. Once a week

  46. Once a week

  47. Once a week

  48. None
  49. Passenger Driver Total for trips

  50. Once a week

  51. Once a week

  52. Once a week

  53. Once a week

  54. Recap • Users carpooled everyday • The payment process ran

    once a week • Passengers were charged • Drivers were paid
  55. What?

  56. Once a week 6:00am

  57. Saturday Wekly process was ran 06:00

  58. Black Saturday

  59. None
  60. None
  61. None
  62. 6:25am

  63. Black Saturday Wekly process was ran 06:00 06:25 User couldn’t

    pay for breakafast
  64. @#$&! @#$&!

  65. 6:25am @#$&! @#$&!! @#$&! @#$&! @#$&!!!

  66. Black Saturday Wekly process was ran 06:00 06:25 User couldn’t

    pay for breakafast 06:34 Users reported bug
  67. None
  68. 6:34am

  69. Boss: hey, sorry to call you this early but we

    have a problem with payments in production and a lot of customers are complaining about it
  70. Me: Sure, I’ll take a look right away. Let’s talk

    over chat
  71. Me: (thinking I ended the call) Me: F********k!!!!!!

  72. Me: (thinking I ended the call) Me: F********k!!!!!! Boss: I’m

    still in the line
  73. Black Saturday Wekly process was ran 06:00 06:25 06:43 User

    couldn’t pay for breakafast 06:34 Users reported bug Manager woke me up
  74. Lots of duplicated charges

  75. Once a week

  76. Once a week

  77. Once a week

  78. Once a week

  79. Refunded charges

  80. Once a week

  81. Reversed transfers

  82. Once a week 7:28am

  83. Black Saturday Wekly process was ran 06:00 06:25 07:28 06:43

    User couldn’t pay for breakafast 06:34 Users reported bug Manager woke me up Problem contained
  84. None
  85. None
  86. Passenger: UserID: 9 Driver: User ID: 100 $10.00

  87. Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID:

    9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 Passenger: UserID: 9 Driver: User ID: 100 $10.00 0 0 0
  88. Once a week

  89. Once a week

  90. Once a week

  91. Wrote tests, a fix and deployed it 10:50pm

  92. Black Saturday Wekly process was ran 06:00 06:25 06:43 User

    couldn’t pay for breakafast 06:34 Users reported bug Manager woke me up 22:50 Deployed a fix to production Problem contained 07:28
  93. Black Saturday Wekly process was ran 06:00 06:25 06:43 User

    couldn’t pay for breakafast 06:34 Users reported bug Manager woke me up 22:50 Deployed a fix to production 22:55 Started looking for a new job Problem contained 07:28
  94. Black Saturday Wekly process was ran 06:00 06:25 06:43 User

    couldn’t pay for breakafast 06:34 Users reported bug Manager woke me up 22:50 Deployed a fix to production Problem contained 07:28
  95. None
  96. Thousands of users affected by the bug Users were charged

    up-to 200 times A single user was charged over $5k Maxed out credit cards. Emptied bank accounts
  97. And Matz thinks Ruby is slow…

  98. Refunds take up- to 5 business days

  99. Reached out to users to offer an expedite reimbursement option

  100. Why?

  101. Embarrassing things happen

  102. Tests won’t save you

  103. Code review won’t save you

  104. QA won’t save you

  105. Software is built by humans

  106. We need to make admiting mistakes easy

  107. Trust that we won’t be judged

  108. Make sure you understand what happened

  109. Move slow

  110. Document the problem

  111. Document the fix

  112. Document the lesson learnt

  113. Don’t git blame

  114. You are not your failures

  115. It’s all temporary

  116. #EuRuKo2017

  117. Kösz @sebasoga