Doing SRE the right way - 2

Understanding the SRE mindset and tools Doing SRE the Right
Way Part ⅔

Piyush Verma CTO, last9.io @realmeson10 2

FAQs Pulumi or Terraform? Automated vs automatic? Ansible or K8s?
Serverless or needless codeless? 3

I used to believe these are the answers to reliability,
and things will begin to fail less But... 4

Will it make a sound, when it breaks? 5

Would this? 6

Would this? 7

Would this? 8

Would this? meson10@meson10-xps-9370$: LAST9_ORG=last9 make launch [2020/06/20 11:27:40:2869] N: liblast9
1.6.0-c15cfb7 [2020/06/20 11:27:40:2870] N: tty configuration: [2020/06/20 11:27:40:2870] N: start command: poetry run [2020/06/20 11:27:40:2870] N: close signal: SIGHUP (1) [2020/06/20 11:27:40:2872] N: Using foreign event loop... [2020/06/20 11:27:40:2873] N: Listening on port: 7681 9

If a tree falls in a forest and no one
is around to hear it, does it make a sound? 10

If a bug happens and no user is around to
see it, should it raise an alert? 11

You got paged, now what? 12

Journey of an Incident 13

~7:30 AM, 25 hours before a country launch, pagerDuty goes
off 14

… ElasticSearch shows Some 5-6 5xx requests 15

Logs come in at 1 mbps, No Correlation-ID to isolate
16

~ 7:35 AM 500s stop. pagerDuty is autoresolved 17

~7:40 AM pingdom alerts pagerDuty public-API is unreachable 18

~7:40 AM Grafana triggers Pagerduty Some 5xx requests 19

~7:40 AM Sentry triggers Pagerduty Elasticsearch -> Elastalert -> Sentry
20

~ 7:45 AM 500s stop. pagerDuty is autoresolved 21

Check rundeck was there a new deployment? ✔ 22

Call Release Manager was there a new deployment? ✔ 23

Call last oncall SRE was there a change? ✔ 24

Grafana ✔ Sentry ✔ Prometheus ✔ APM ✔ 25

Check Firewall Is it dropping trafﬁc? ✔ 26

20 hours later … mount hadn’t run on one db
shard, shard rebooted, data wiped! 27

What’s the next step? 28

- Where else is this failing? - How do we
avoid this? 29

What’s the Root Cause? 30

Failure reasons ➔ New Deployment ➔ Incorrect expectation ➔ Network
failure ➔ Trafﬁc/Load ➔ Conﬁguration change ➔ Service-Provider fault 31

Failure reasons NOT ➔ Individuals 32

RCA #1 Raghu forgot to execute mount command 33

RCA #2 Infrastructure team forgot to execute mount command 34

RCA #3 Why is it possible to SSH? 35

RCA #4 Why don’t we have a tool to alert
conﬁg mismatch 36

To err is human Failures are inevitable 37

Bad Apple theory Skill <> Ownership 38

What’s the Root Cause? - System Conﬁguration Validator - Introduce
FMEA - Non-Latent Conﬁguration Validator 39

- Where else is this failing? - How do we
avoid this? 40

More managers != less failures 41

More process != less failures 42

Curiosity == less failures FMEA Fault Injections 43

Standardization == less failures Infrastructure as Code Policies as Code
44

Clear SLOs == less failures Throughput as TPS vs Concurrency
45

Better tools == less failures 46

What are better tools? 47

Clever Hans 48

There are no better tools There is only better usage
49

Which one is better? 50

Terraform > Pulumi? 51

Ansible > Chef? 53

Prometheus > InﬂuxDB? 54

Elasticsearch > Humio? 55

Infrastructure Terraform Pulumi Conﬁguration Management Ansible Chef Observability Prometheus InﬂuxDB
Logging Elasticsearch Humio Recap 56

Culture will eat tools for breakfast, lunch, and dinner 57

Step1 : Build a culture Adopt tools, Standardize. Frequent RCAs.
Step2: Improve the Culture Share Knowledge. Embrace Failures. 58

To err is human Failures are inevitable 59

Thank you last9.io/failures Piyush Verma

Doing SRE the right way - 2

Doing SRE the right way - 2

More Decks by Piyush Verma

Other Decks in Programming

Featured

Transcript