Definitive Guide to GameDays - Road to Resilient Services

Definitive Guide to GameDays Road to Resilient Services Ho Ming
Li @HoReaL Solutions Architect @ Gremlin

Ho Ming Li Solutions Architect @HoReaL SW Development Release Engineering
QA Engineering Professional Services Enterprise Support Solutions Architecture

What is Chaos Engineering?

Thoughtful, planned, experiments designed to reveal weaknesses in your services

Business Continuity Plan Disaster Recovery Failover

Unplug Cables

Chaos Experiment = Micro Exercise

See ACTUAL behaviours of your service under failure scenarios

What is a GameDay?

Dedicated time for teams to collaboratively focus on using Chaos
Engineering practices to reveal weaknesses in your services

Goal: Build (& Operate) Resilient Services

Run GameDays

Goal: Run your own GameDay

Why Run a GameDay?

Why do I want to run a GameDay?

Business Owners Cost of Downtime

Managers Retain Talent

Engineers Please stop paging me!

It’s a Team Game

Managing sensitive projects: a lateral approach - Olivier d'Herbemont, Bruno
César, Tom Curtin, Pascal Etcheber

Who Topics to expect Useful Topics Influencers Engineers On-Call How
to practice CE Continuous Chaos Waverers Senior Management What is CE Cost of Downtime Passives Engineering Managers Why practice CE Incident & on-call reduction Moaners Specific Individuals “I’m too busy” We don’t learn by “always doing things the way we’ve always done them” Opponents Support “There’s already so much chaos” Impact of incident management Fanatics Specific Individuals “I believe in unit tests” CE is unit tests for alerting and monitoring Skeptics Specific Individuals “We won’t get value from this” Defence protection & training Mutineers Specific Individuals “We don’t need to do this” Data on top 5 most unreliable services & focus on resilience

WHAT ARE THEIR MAJOR CHALLENGES? What could they gain by
collaborating with you?

“Sharpen your Saw”

Reduce Incident Severity and Frequency 10x

Planning a GameDay

Who’s in?

It’s a Team Game

THE Chaos Crew to make it happen Executive: CTO /
VP Engineering Budget / Objectives Executive Assistant / Organizer Invitations / Coordination Engineering Director / Manager Prioritization / E. Availability Engineers / Subject Matter Expert Architecture / Experiments New Hires / Interns Learning / New Perspectives

When & Where? Date, Time, Venue, Environment

Service about to Launch? Major re-architecture? Wait till known issues
are fixed?

Other Priorities? Never a good time “Sharpen your Saw” Start
Finding Time Now

Better In-Person

Choose an Environment ? Staging Production

Start in Staging Mature to Production

lim delta → 0 PRODUCTION + delta

Plan Experiments Run Experiments Analyze Results

Do This Pre-GameDay

Be Thoughtful

Pick 1 Service

What’s your top 5 critical services? What’s your previous outage?

Define Experiments

Whiteboard Service Architecture List Internal/External Dependencies

What could go wrong?

How to (re)Create Failure Scenario?

Scoping Think about the Blast Radius

Start small THEN dial

DON’T START HERE

START HERE

Safety First At what point do you stop the experiment?
Think about the Customers

Be Thoughtful

Experiments should validate...

Application Retries, Timeouts, Fallbacks

Monitoring Metrics, Log Events don’t forget KPIs

Alerting Paging System

People Incident Response

Think End-to-End

Don’t Over-complicate Experiments

Be Thoughtful

Write it all down!

It’s GameDay!

It’s a Team Game

Communicate Review Plan Execute

Lean In

Take Time

How long did it take to launch? How long till
service recovers? How much time left before Fallback fails?

Write it all down!

Validation / Discovery 80/20

How do you ensure that your service Stays Resilient Over
Time?

Met Expectations? Automate Experiments Re-Run Experiments

Plan your Next GameDay

GameDay Starter Pack https://www.gremlin.com/gameday/

The End Beginning

“Sharpen your Saw”

It’s a Team Game

Be Thoughtful

Safely, Thoughtfully, Collaboratively Break Things On Purpose Go Run Your
Own GameDay!

Break things together! Join us. Learn from us. Teach us.
Chaos Engineering Community Slack (https://tinyurl.com/chaoseng)

Thank You! Ho Ming Li @HoReaL Solutions Architect, Gremlin Chaos
Engineering Community Member

- End of Deck -

Definitive Guide to GameDays - Road to Resilien...

Definitive Guide to GameDays - Road to Resilient Services

More Decks by HML

Other Decks in Technology

Featured

Transcript