Slide 1

Slide 1 text

GameDays Practice Thoughtful Chaos Engineering (Gamer Edition) Ho Ming Li Lead Solutions Architect @ Gremlin

Slide 2

Slide 2 text

Lead Solutions Architect @ Gremlin Ex - AWS, NetApp, IBM Ran 10+ GameDays With 5+ Companies a retired gamer Ho Ming Li @HoReaL

Slide 3

Slide 3 text

Rainbows Unicorns

Slide 4

Slide 4 text

Reality Fire Fighting

Slide 5

Slide 5 text

Own Your User’s Experience Because Everything Fails

Slide 6

Slide 6 text

Chaos Engineering Thoughtful, planned, experiments designed to reveal weaknesses in your systems

Slide 7

Slide 7 text

Chaos Engineering Engineer Chaos… ...on your own terms

Slide 8

Slide 8 text

Like a vaccine, we inject harm to build immunity.

Slide 9

Slide 9 text

GameDay Dedicated time for teams to collaboratively focus on using Chaos Engineering practices to reveal weaknesses in your systems

Slide 10

Slide 10 text

GameDay Let’s Run some Chaos Experiments Together Day

Slide 11

Slide 11 text

“Why we do what we do?”

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

“Computers aren’t the thing. They’re the thing that gets us to the thing.” - Halt and Catch Fire Chaos Engineering isn’t the thing. It’s the thing that gets us to Resilience.

Slide 14

Slide 14 text

Know your Objective

Slide 15

Slide 15 text

Why resilience matters - Motivations Business Case Avoid very costly downtime. Engineering Case Improve quality and increase agility. On-Call Case Avoid constant fire fighting and pager fatigue.

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Level 1 Level 80+

Slide 20

Slide 20 text

Challenge to Scale.

Slide 21

Slide 21 text

Inevitably die a horrible death.

Slide 22

Slide 22 text

Let’s do… Active-Active Multi-Region AI-Enabled Detection with Auto-magic Recovery System Using Request Tokens on Globally Decentralized Blockchain But… Can you withstand a critical host going down right now?

Slide 23

Slide 23 text

Start at your level. Make progress.

Slide 24

Slide 24 text

Start Small Start Simple

Slide 25

Slide 25 text

Now, this is You Done your homework Ready to go Go take on the world alone?

Slide 26

Slide 26 text

Form your Party Everyone brings something different to the table. Together Everyone Achieves More

Slide 27

Slide 27 text

CTO / VP Engineering Budget / Objectives Organizer Invitations / Coordination Engineering Director / Manager Prioritization / Engineer Availability Engineers / Subject Matter Expert Architecture / Experiments New Hires / Interns Learning / New Perspectives Other Stakeholders / Observers Understand Impact on their Functions Chaos Party

Slide 28

Slide 28 text

It’s more Fun with Friends

Slide 29

Slide 29 text

“Where do we Start?”

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

Start in Staging Mature to Production

Slide 33

Slide 33 text

Anatomy of a GameDay GameDay Experiment #1 Experiment #2 Attack (Inject Failure) Attack Attack Attack ... ... Experiment #3 Attack Attack ...

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

What could go wrong? What went wrong?

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

What is the scope of the attack?

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

DON’T START HERE

Slide 41

Slide 41 text

START HERE

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

“But I know already what will happen.”

Slide 46

Slide 46 text

World Cup 2018 Guess which team won? If they play again, do you know which team will win? vs vs vs

Slide 47

Slide 47 text

World Cup 2018 Guess which team won? If they play again, do you know which team will win? 1(3) - 1(4) 0 - 2 0 - 1

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

Stop Guessing Prove it!

Slide 50

Slide 50 text

Scenarios Terminate Hosts Inject Latency Consume CPU Fill up Disk Bad Hosts Going Offline Service is Slow Workload runs Hot Logs not Rotating Attacks

Slide 51

Slide 51 text

What to use? Bare Hand Knife Machine Gun Sniper Rifle Shotgun

Slide 52

Slide 52 text

Equip Yourself Chaos Monkey Gremlin Pumba Toxiproxy Chaos Toolkit PowerfulSeal

Slide 53

Slide 53 text

“What am I looking at?”

Slide 54

Slide 54 text

Observibility Get rid of the Fog of War so you can clearly see the map and strategize accordingly. Gain Deep Insight with: - Metrics - Logging - Request Tracing

Slide 55

Slide 55 text

Win/Lose Pass/Fail

Slide 56

Slide 56 text

Level Up Manual Runs - Exploratory - GameDays Automated Runs - Scheduled Runs - Include in Pipeline

Slide 57

Slide 57 text

GameDay #1 .. N GameDay is not just a one time event Think about the next GameDay Track and Measure Success over time

Slide 58

Slide 58 text

Practice Practice Practice

Slide 59

Slide 59 text

“I don’t need Chaos, I do Serverless”

Slide 60

Slide 60 text

Application Level User Experience

Slide 61

Slide 61 text

Chaos All the Things Very Loosely Break down of “an application” “Edge”: DNS, CDN, Cloudflare “Front End”: LB, API, Nginx “Back End”: MySQL, Kafka, ES “Infrastructure”: Kubernetes, Container, Virtual Machine, Physical Server, Data Center "AWS Lambda protects you against some infrastructure failures, but you still need to defend against weakness in your own code." -- Yan Cui, Principal Engineer at DAZN

Slide 62

Slide 62 text

Don’t Forget the Human

Slide 63

Slide 63 text

GameDay Findings Teaser

Slide 64

Slide 64 text

Sort of Expected Attack: Disconnect from DynamoDB. Hypothesis: Frontend gets a 5XX Error from Backend.

Slide 65

Slide 65 text

Magnified Wait App Server to Database slowed Attack: Inject small amount of latency between services Hypothesis: Users experience slowness roughly equates to the injected delay

Slide 66

Slide 66 text

I can see this, but I can’t see that Requiring Database to process messages from Queue Attack: Consumer cannot connect to Database. Hypothesis: Consumer can no longer process messages.

Slide 67

Slide 67 text

Loosely coupled... … or Not Orchestrating Containers in Microservices Attack: Container dies. Hypothesis: Orchestrator will spawn new container.

Slide 68

Slide 68 text

Game On! Start your Journey tinyurl.com/chaoseng meetup.com/pro/chaos