Doing SRE the right way - 2

Slide 1

Slide 1 text

Understanding the SRE mindset and tools Doing SRE the Right Way Part ⅔

Slide 2

Slide 2 text

Piyush Verma CTO, last9.io @realmeson10 2

Slide 3

Slide 3 text

FAQs Pulumi or Terraform? Automated vs automatic? Ansible or K8s? Serverless or needless codeless? 3

Slide 4

Slide 4 text

I used to believe these are the answers to reliability, and things will begin to fail less But... 4

Slide 5

Slide 5 text

Will it make a sound, when it breaks? 5

Slide 6

Slide 6 text

Would this? 6

Slide 7

Slide 7 text

Would this? 7

Slide 8

Slide 8 text

Would this? 8

Slide 9

Slide 9 text

Would this? meson10@meson10-xps-9370$: LAST9_ORG=last9 make launch [2020/06/20 11:27:40:2869] N: liblast9 1.6.0-c15cfb7 [2020/06/20 11:27:40:2870] N: tty configuration: [2020/06/20 11:27:40:2870] N: start command: poetry run [2020/06/20 11:27:40:2870] N: close signal: SIGHUP (1) [2020/06/20 11:27:40:2872] N: Using foreign event loop... [2020/06/20 11:27:40:2873] N: Listening on port: 7681 9

Slide 10

Slide 10 text

If a tree falls in a forest and no one is around to hear it, does it make a sound? 10

Slide 11

Slide 11 text

If a bug happens and no user is around to see it, should it raise an alert? 11

Slide 12

Slide 12 text

You got paged, now what? 12

Slide 13

Slide 13 text

Journey of an Incident 13

Slide 14

Slide 14 text

~7:30 AM, 25 hours before a country launch, pagerDuty goes off 14

Slide 15

Slide 15 text

… ElasticSearch shows Some 5-6 5xx requests 15

Slide 16

Slide 16 text

Logs come in at 1 mbps, No Correlation-ID to isolate 16

Slide 17

Slide 17 text

~ 7:35 AM 500s stop. pagerDuty is autoresolved 17

Slide 18

Slide 18 text

~7:40 AM pingdom alerts pagerDuty public-API is unreachable 18

Slide 19

Slide 19 text

~7:40 AM Grafana triggers Pagerduty Some 5xx requests 19

Slide 20

Slide 20 text

~7:40 AM Sentry triggers Pagerduty Elasticsearch -> Elastalert -> Sentry 20

Slide 21

Slide 21 text

~ 7:45 AM 500s stop. pagerDuty is autoresolved 21

Slide 22

Slide 22 text

Check rundeck was there a new deployment? ✔ 22

Slide 23

Slide 23 text

Call Release Manager was there a new deployment? ✔ 23

Slide 24

Slide 24 text

Call last oncall SRE was there a change? ✔ 24

Slide 25

Slide 25 text

Grafana ✔ Sentry ✔ Prometheus ✔ APM ✔ 25

Slide 26

Slide 26 text

Check Firewall Is it dropping trafﬁc? ✔ 26

Slide 27

Slide 27 text

20 hours later … mount hadn’t run on one db shard, shard rebooted, data wiped! 27

Slide 28

Slide 28 text

What’s the next step? 28

Slide 29

Slide 29 text

- Where else is this failing? - How do we avoid this? 29

Slide 30

Slide 30 text

What’s the Root Cause? 30

Slide 31

Slide 31 text

Failure reasons ➔ New Deployment ➔ Incorrect expectation ➔ Network failure ➔ Trafﬁc/Load ➔ Conﬁguration change ➔ Service-Provider fault 31

Slide 32

Slide 32 text

Failure reasons NOT ➔ Individuals 32

Slide 33

Slide 33 text

RCA #1 Raghu forgot to execute mount command 33

Slide 34

Slide 34 text

RCA #2 Infrastructure team forgot to execute mount command 34

Slide 35

Slide 35 text

RCA #3 Why is it possible to SSH? 35

Slide 36

Slide 36 text

RCA #4 Why don’t we have a tool to alert conﬁg mismatch 36

Slide 37

Slide 37 text

To err is human Failures are inevitable 37

Slide 38

Slide 38 text

Bad Apple theory Skill <> Ownership 38

Slide 39

Slide 39 text

What’s the Root Cause? - System Conﬁguration Validator - Introduce FMEA - Non-Latent Conﬁguration Validator 39

Slide 40

Slide 40 text

- Where else is this failing? - How do we avoid this? 40

Slide 41

Slide 41 text

More managers != less failures 41

Slide 42

Slide 42 text

More process != less failures 42

Slide 43

Slide 43 text

Curiosity == less failures FMEA Fault Injections 43

Slide 44

Slide 44 text

Standardization == less failures Infrastructure as Code Policies as Code 44

Slide 45

Slide 45 text

Clear SLOs == less failures Throughput as TPS vs Concurrency 45

Slide 46

Slide 46 text

Better tools == less failures 46

Slide 47

Slide 47 text

What are better tools? 47

Slide 48

Slide 48 text

Clever Hans 48

Slide 49

Slide 49 text

There are no better tools There is only better usage 49

Slide 50

Slide 50 text

Which one is better? 50

Slide 51

Slide 51 text

Terraform > Pulumi? 51

Slide 52

Slide 52 text

Slide 53

Slide 53 text

Ansible > Chef? 53

Slide 54

Slide 54 text

Prometheus > InﬂuxDB? 54

Slide 55

Slide 55 text

Elasticsearch > Humio? 55

Slide 56

Slide 56 text

Infrastructure Terraform Pulumi Conﬁguration Management Ansible Chef Observability Prometheus InﬂuxDB Logging Elasticsearch Humio Recap 56

Slide 57

Slide 57 text

Culture will eat tools for breakfast, lunch, and dinner 57

Slide 58

Slide 58 text

Step1 : Build a culture Adopt tools, Standardize. Frequent RCAs. Step2: Improve the Culture Share Knowledge. Embrace Failures. 58