Pro Yearly is on sale from $80 to $50! »

Doing SRE the right way - 2

Doing SRE the right way - 2

Pulumi or Terraform?
How much and what all to automate?
Automated vs automatic?
Ansible or K8s?
Serverless or needless?

Choices, choices, choices - and very expensive these choices are.

In session 2, speaker Piyush Verma, co-founder of Last9 Inc will lay out facts (from popular tech fiction) + trade-offs associated with these choices.

Ee5407f7a79eb620c4fd54c136847b33?s=128

Piyush Verma

June 23, 2020
Tweet

Transcript

  1. Understanding the SRE mindset and tools Doing SRE the Right

    Way Part ⅔
  2. Piyush Verma CTO, last9.io @realmeson10 2

  3. FAQs Pulumi or Terraform? Automated vs automatic? Ansible or K8s?

    Serverless or needless codeless? 3
  4. I used to believe these are the answers to reliability,

    and things will begin to fail less But... 4
  5. Will it make a sound, when it breaks? 5

  6. Would this? 6

  7. Would this? 7

  8. Would this? 8

  9. Would this? meson10@meson10-xps-9370$: LAST9_ORG=last9 make launch [2020/06/20 11:27:40:2869] N: liblast9

    1.6.0-c15cfb7 [2020/06/20 11:27:40:2870] N: tty configuration: [2020/06/20 11:27:40:2870] N: start command: poetry run [2020/06/20 11:27:40:2870] N: close signal: SIGHUP (1) [2020/06/20 11:27:40:2872] N: Using foreign event loop... [2020/06/20 11:27:40:2873] N: Listening on port: 7681 9
  10. If a tree falls in a forest and no one

    is around to hear it, does it make a sound? 10
  11. If a bug happens and no user is around to

    see it, should it raise an alert? 11
  12. You got paged, now what? 12

  13. Journey of an Incident 13

  14. ~7:30 AM, 25 hours before a country launch, pagerDuty goes

    off 14
  15. … ElasticSearch shows Some 5-6 5xx requests 15

  16. Logs come in at 1 mbps, No Correlation-ID to isolate

    16
  17. ~ 7:35 AM 500s stop. pagerDuty is autoresolved 17

  18. ~7:40 AM pingdom alerts pagerDuty public-API is unreachable 18

  19. ~7:40 AM Grafana triggers Pagerduty Some 5xx requests 19

  20. ~7:40 AM Sentry triggers Pagerduty Elasticsearch -> Elastalert -> Sentry

    20
  21. ~ 7:45 AM 500s stop. pagerDuty is autoresolved 21

  22. Check rundeck was there a new deployment? ✔ 22

  23. Call Release Manager was there a new deployment? ✔ 23

  24. Call last oncall SRE was there a change? ✔ 24

  25. Grafana ✔ Sentry ✔ Prometheus ✔ APM ✔ 25

  26. Check Firewall Is it dropping traffic? ✔ 26

  27. 20 hours later … mount hadn’t run on one db

    shard, shard rebooted, data wiped! 27
  28. What’s the next step? 28

  29. - Where else is this failing? - How do we

    avoid this? 29
  30. What’s the Root Cause? 30

  31. Failure reasons ➔ New Deployment ➔ Incorrect expectation ➔ Network

    failure ➔ Traffic/Load ➔ Configuration change ➔ Service-Provider fault 31
  32. Failure reasons NOT ➔ Individuals 32

  33. RCA #1 Raghu forgot to execute mount command 33

  34. RCA #2 Infrastructure team forgot to execute mount command 34

  35. RCA #3 Why is it possible to SSH? 35

  36. RCA #4 Why don’t we have a tool to alert

    config mismatch 36
  37. To err is human Failures are inevitable 37

  38. Bad Apple theory Skill <> Ownership 38

  39. What’s the Root Cause? - System Configuration Validator - Introduce

    FMEA - Non-Latent Configuration Validator 39
  40. - Where else is this failing? - How do we

    avoid this? 40
  41. More managers != less failures 41

  42. More process != less failures 42

  43. Curiosity == less failures FMEA Fault Injections 43

  44. Standardization == less failures Infrastructure as Code Policies as Code

    44
  45. Clear SLOs == less failures Throughput as TPS vs Concurrency

    45
  46. Better tools == less failures 46

  47. What are better tools? 47

  48. Clever Hans 48

  49. There are no better tools There is only better usage

    49
  50. Which one is better? 50

  51. Terraform > Pulumi? 51

  52. 52

  53. Ansible > Chef? 53

  54. Prometheus > InfluxDB? 54

  55. Elasticsearch > Humio? 55

  56. Infrastructure Terraform Pulumi Configuration Management Ansible Chef Observability Prometheus InfluxDB

    Logging Elasticsearch Humio Recap 56
  57. Culture will eat tools for breakfast, lunch, and dinner 57

  58. Step1 : Build a culture Adopt tools, Standardize. Frequent RCAs.

    Step2: Improve the Culture Share Knowledge. Embrace Failures. 58
  59. To err is human Failures are inevitable 59

  60. Thank you last9.io/failures Piyush Verma