Slide 158
Slide 158 text
•Step 0 - Incident classification including; SEV descriptions and levels,
the SEV timeline and the TTD timeline
•Step 1 - Organization-wide critical service monitoring including; key
dashboards and KPI metrics emails
•Step 2 - Service ownership and metrics including; measuring TTD by
service, service triage, service ownership, building a service ownership
service (SOS) and service alerting.
•Step 3 - On-Call Principles including; pareto principle, rotation
structure, alert threshold maintenance and escalation practices.
•Step 4 - Chaos Engineering including; chaos days and continuous chaos.
•Step 5 - Self-Healing Systems including; when automation incidents
occur, monitoring and metrics for self-healing system automation