Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Peeking Behind the Curtain: The Anatomy of a Major Incident

Peeking Behind the Curtain: The Anatomy of a Major Incident

A largely unredacted true-story recap of a real major incident that occurred at PagerDuty. The incident is shared in full to demonstrate how we approach system failures with response and postmortem processes you can use with your own teams.

George Miranda

September 10, 2019
Tweet

More Decks by George Miranda

Other Decks in Technology

Transcript

  1. @gmiranda23 This is a true story The events in this

    presentation took place in San Francisco and Toronto in January 2017
  2. @gmiranda23 • We learn via transparency • The story is

    largely unredacted • Some details have been omitted for brevity • PagerDuty the company, NOT the product
  3. @gmiranda23 Db2Kafka Service Incident Log Entries Service Docker Mesos /

    marathon Linux Kernel publishes change events from web monolith to Kafka for other services to consume stores log entries for incidents THE SERVICES
  4. @gmiranda23 THE PEOPLE Eric Incident Commander Peter Scribe Ken Deputy

    Luke Communications Liaison Major incident response principal roles David Core on-call Cees Core eng Evan SRE on-call Renee IM People on-call Zayna Mobile on-call JD IM Data on-call Priyam EM on-call Subject Matter Experts (SMEs)
  5. @gmiranda23 [3:21 PM] David: SME !ic page Officer URL: Chat

    BOT Paging Incident Commander(s)
 ✔ Eric has been paged. ✔ Ken has been paged. ✔ Peter has been paged. Incident triggered in the following service: https://pd.pagerduty.com/services/PERDDFI David: SME db2kafka is down, and I'm not sure what's going on kicked off the major incident process [3:21 PM] Eric: IC Taking IC [incident command] Eric took the Incident Commander role (he was the IC primary on-call)
  6. @gmiranda23 THE INCIDENT COMMANDER • The Wartime General: decision maker

    during a major incident • GOAL: drive the incident to resolution quickly and effectively • Gather data: ask subject matter experts to diagnose and debug various aspects of the system • Listen: collect proposed repair actions from SMEs • Decide: decide on a course of action • Act (via delegation): once a decision is made, ask the team to act on it. IC should always delegate all diagnosis and repair actions to the rest of the team.
  7. @gmiranda23 Priyam: SME I’m here from EM [event management] Evan:

    SME lmk if you need SRE Ken: DEPUTY @renee, please join the call [3:22 PM] Ken took the deputy role Other Subject Matter Experts joined
  8. @gmiranda23 • The Sidekick: right hand person for the IC

    • Monitor the status of the incident • Follow up on time-boxed tasks • Be prepared to page other people • Keep the IC focused on the incident THE DEPUTY (BACKUP IC)
  9. @gmiranda23 Peter: SCRIBE I am now the scribe Eric: Looking

    to find Mesos experts Evan: Looking for logs & dashboards Zayna: SME seeing a steady rise in crashes in Android app around trigger incident log entires [3:24 PM] JD: SME No incident log entries will be generated due to Log Entry Service not being able to query db2kafka [3:25 PM] Eric: David, what have you looked at? David: trolling logs, see errors David: tried restarting, doesn’t help [3:23 PM] Ken: DEPUTY Notifications are still going out, subject lines are filled in but not email bodies (they use incident log entries) Renee: SME Peter becomes the scribe Discussing customer-visible impact of the incident Ken is both deputy and scribe
  10. @gmiranda23 THE SCRIBE • The Record-keeper • Add notes to

    the chatroom when findings are determined or significant actions are taken • Add TODOs to the room that indicate follow- ups for later (generally after the incident) • Generate notes that will be helpful when reconstructing the timeline of events
  11. @gmiranda23 Luke: CUST LIAISON suggested message: "There is currently an

    issue affecting the incident log entries component of our web application causing the application to display errors. We are actively investigating." [3:29 PM] David: No incident log entries can be created Renee: no incident details, error msg in the UI [3:27 PM] Peter: SCRIBE Eric: Comms rep on the phone? Luke Eric to Luke: Please compose a message to our customers Peter: SCRIBE Eric: What’s the customer impact? [3:26 PM] Peter: SCRIBE Luke to post the message [to our status page and Twitter] Peter: SCRIBE Incident Commander asked the customer liaison to write a message to customers The message was sent out to customers
  12. @gmiranda23 THE COMMUNICATIONS LIAISON • The link to the customer

    • Monitor customer and business impact • Provide regular updates to customers • Provide regular updates to business stakeholders
  13. @gmiranda23 Cees: SME I’m away from any laptops, just arrived

    at a pub for dinner. [3:36 PM] @cees Would you join us on the bridge? We have a few Mesos questions Eric: IC Evan: might need to kick new hardware if system is actually unreachable. Evan: replica01 is reachable David: replica02 is not reachable. David: replica03 is not reachable. David: only 3 replicas for mesos Eric: We are down to only one host Evan: Seeing some stuff. Memory exhaustion. [3:37 PM] Peter: SCRIBE TODO: Create a runbook for mesos to stop the world and start again Peter: SCRIBE David added Cees to the incident Eric: Is there a runbook for mesos? David: Yes, but not for this issue. [3:34 PM] Peter: SCRIBE Scribe captured a TODO to record & remember a follow-up that should happen after the incident is resolved We paged a Mesos expert who is not on-call The Mesos expert joined the chat
  14. @gmiranda23 David: Only 3 replicas in that cluster, we have

    another cluster in another region Eric: Two options: kick more slaves or restart marathon [3:38 PM] Peter: SCRIBE Evan: OOM killer has kicked in on replica01 [3:39 PM] Peter: SCRIBE Eric: Stop replicas in region1, startup db2kafka in region1 Evan: replica02 is alive! Eric: Waiting 2 minutes [3:47 PM] Peter: SCRIBE David: Consider bringing up another cluster? Cees: Should be doable pretty quickly [3:44 PM] Peter: SCRIBE Eric to evan: please reboot replica02 and replica03 [3:41 PM] Peter: SCRIBE Restart replicas first Cees: SME replica01 is now down [3:42 PM] Evan: SME They are considering bringing up another Mesos cluster in another region replica02 is back up after reboot, so they hold off on flipping to region2 Noticed that oom-killer (out-of-memory killer) killed the docker process on replica01
  15. @gmiranda23 Evan: Replica02 is quiet. Evan: Replica02 is trying to

    start, exiting with code 137 [3:49 PM] Peter: SCRIBE Evan: Replica03 is quiet. Evan: 137 means it’s being killed by OOM, OOM is killing docker containers continuously Peter: SCRIBE [3:53 PM] Proposed Action: David is going to configure marathon to allow more memory Peter: SCRIBE [3:54 PM] Proposed Action: Evan to force reboot replica01 Peter: SCRIBE [3:56 PM] David: Db2kafka appears to be running Eric: Looks like all things are running Renee: Things are fine with notifications JD: Log Entry Service is seeing progress Peter: SCRIBE [3:55 PM] Customer impact: there are 4 tickets so far and 2 customers chatting with us, which is another 2 tickets Luke: CUST LIAISON They realized the problem: oom-killer is killing the docker containers over and over The resolution action was to redeploy db2kafka with a higher cgroup/Docker memory limit: 2GB (vs 512 MB before) The customer liaison provided an update on the customer impact The system is recovering
  16. @gmiranda23 INCIDENT SUMMARY Contributing Factors • Increase in traffic caused

    db2kafka to increase its memory usage • This caused the Linux oom-killer to kill the process • Then, mesos / marathon immediately restarted it, it ramped up memory again, oom-killer killed it, and so on. • After doing this restart-kill cycle multiple times, we hit a race-condition bug in the Linux kernel causing a kernel panic and killing the host • Other services running on the host were impacted, notably the Log Entries Service
  17. @gmiranda23 RESPONDER ROLES Incident Commander • Most important role, crucial

    to fast decision making and action • Takes practice and experience Deputy • Do anything necessary to keep Incident Commander focused • Tracks follow-ups & action items (e.g. IC says "Evan, do X, report back in 5 mins") Scribe • Provides context in chatroom (not a stenographer) • Produces realtime documentation helpful for the the post-mortem Communications Liaison • Communicates status to customers & stakeholders
  18. @gmiranda23 1. Pick an owner for the post-mortem
 David, the

    first responder on the Core team, became the owner 2. Schedule the post-mortem meeting
 scheduled for Jan 11 (incident happened on Jan 6) 3. Create a draft of the post-mortem report 4. Conduct the post-mortem meeting 5. Finalize the post-mortem report based on the discussion in the meeting Recommend: have a fast turnaround for the post- mortem, SLA of 3-5 days One owner only, owner can pull in additional resources as needed POST-MORTEM PROCESS
  19. @gmiranda23 THE POST-MORTEM REPORT • Description of what happened +

    timeline • Customer and business impact • Root cause / contributing factors • How we resolved the incident • What went well, what didn’t go so well • Action items }These components are written by the owner ahead of the PM meeting } These components are discussed and created in the PM meeting
  20. @gmiranda23 THE TIMELINE Time (UTC) Event 22:50 Spike in ile_groups_requests

    traffic 22:50:26 First instance of OOM-killer killing db2kafka 22:57 First page to Core on-call (David) 22:57 David acknowledges the page 23:09 replica03 freezes 23:12 Incident manually escalated to Core Secondary (Joseph) 23:18 Incident manually assigned to Cees 23:20 SEV-2 incident manually triggered 23:25 Ken starts trying to reach Cees (not on-call) 23:26 Discussion about customer impact 23:29 "Investigating issue" tweet from support account 23:29 Ken Rose starts trying to reach Div (not on-call) 23:30:39 replica02 freezes 23:34 Evan says that replica03 is unreachable 23:35 Cees joins call 23:37:00 Evan says replica01 is reachable, 02 and 03 are not. David confirms. 23:39 Evan says OOM killer has kicked in on slave01 23:39 Rich says replica02 is failing status check in cloud provider Time-To-Detect of 7min Time-To-Acknowledge <1min Incident escalated to SEV-2 in 23min David tried to get help from secondary and from Cees
  21. @gmiranda23 THE TIMELINE Time (UTC) Event 23:41:00 Eric requests Evan

    to reboot replica02 and replica03 23:41 replica01 is frozen. All replicas now frozen. 23:42 Evan says replica01 is now unreachable as well 23:43 Discussion about Chef changes affecting Mesos, and discussion about bringing web2kafka up in another Mesos cluster. 23:46:28 replica02 is back up 23:46:37 replica03 is back up 23:49 Evan reports replica02 failing to start containers, they are exiting with code 137 23:51 Evan reports code 137 is "killed by OOM" 23:53 start db2kafka with 2 GB allowed memory, David takes action 23:53 force reboot replica01, Evan takes action 23:53:43 Last insance of OOM-killer killing db2kafka 23:56 David confirms db2kafka running. Discussion about backlog, other services catching up. 23:59:08 replica01 is back up 0:05 David reports db2kafka has cleared backlog 0:10 Recovery tweet posted 0:11 Other services have caught up / are catching up fine. Call closed. Issue is diagnosed Remediation actions are taken Issue is recovering Issue is fully recovered
  22. @gmiranda23 UNDERSTANDING THE INCIDENT Contributing Factors Determined to be an

    increase in traffic, or possibly larger records, causing db2kafka to increase its memory usage. It was exacerbated by the Linux kernel bug causing Mesos slaves to freeze up. What happened Resolution The temporary resolution was to increase db2kafka's cgroup/Docker memory limit to 2 GB from 512 MB. Longer-term, we should investigate why this memory usage increased.
  23. @gmiranda23 INCIDENT IMPACT Time in SEV-2 1h 9m Time in

    SEV-1 0 Notifications delivered out of SLA 0 Customers experiencing 500s on incident details page 1% Customers experiencing error message on /incidents page when clicking “Show Details” link 2% Customers experiencing errors in Android app 1% Support requests raised 8
  24. @gmiranda23 THE POST-MORTEM MEETING 1. Go through the timeline 2.

    Ensure everyone agrees the timeline is accurate and comprehensive 3. Talk about what the group has learned from going through the timeline. This can include but is not limited to action items.
 Do this in a brainstorming format 4. Talk about what went well, what didn’t go so well
 Capture this in the Post-Mortem report 5. Create the list of action items
  25. @gmiranda23 INCIDENT LEARNINGS WHAT WENT WELL • Non on-call Core

    team members were able to help resolve issue • SRE on-calls were very helpful in identifying issues and restarting boxes • Notifications still went out, with less detailed bodies WHAT DIDN'T GO SO WELL • Core primary on-call was unsure about when to initiate SEV-2 incident response due to lack of db2kafka SLA • Failing db2kafka caused cascading failures in other Mesos-hosted applications • Various UI components and the Android app did not handle missing log entries well (lack of graceful degradation)
  26. @gmiranda23 ACTION ITEMS Summary Created Status Resolution Android app shouldn't

    crash when no log entries are viewable. 7-Jan-17 DONE Done Ensure all UI/API components degrade gracefully when ILE / ALEs are missing 11-Jan-17 DONE Done Reproduce cgroup oom-killer race condition in Mesos, verify does not exist with 3.14 11-Jan-17 DONE Done Update db2kafka ops guide with how-to-start-without-mesos instructions 11-Jan-17 DONE Done Exponentially backoff db2kafka restart time 11-Jan-17 DONE Done Investigate and fix possible unbounded memory usage in db2kafka 11-Jan-17 DONE Done Upgrade Mesos slave boxes to kernel >= 3.14 11-Jan-17 DONE Won't Do Update operations guide and define SLA for db2Kafka 11-Jan-17 DONE Done
  27. @gmiranda23 THE PUBLIC POST-MORTEM • Create a higher-level version of

    the PM for public consumption • 3 sections • Summary • What happened? • What are we doing about this? Posted on our status page: https://status.pagerduty.com/incidents/510k1bnvwv6g
  28. @gmiranda23 • Sharing helps us figure out how to improve

    • Use the Incident Command System • Publishing a post-mortem is what closes an incident • Use open-source resources to develop your own plan (or contribute to the community!)
  29. @gmiranda23 • Incident Command - https://response.pagerduty.com • Blameless Post-Mortems -

    https://postmortems.pagerduty.com • Stakeholder Comms - https://stakeholders.pagerduty.com • Business Incident Response - https://business-response.pagerduty.com • Simulated Incident Call & more - https://ops-guides.pagerduty.com FURTHER RESOURCES