Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Doing SRE the right way - 2

Doing SRE the right way - 2

Pulumi or Terraform?
How much and what all to automate?
Automated vs automatic?
Ansible or K8s?
Serverless or needless?

Choices, choices, choices - and very expensive these choices are.

In session 2, speaker Piyush Verma, co-founder of Last9 Inc will lay out facts (from popular tech fiction) + trade-offs associated with these choices.

Piyush Verma

June 23, 2020
Tweet

More Decks by Piyush Verma

Other Decks in Programming

Transcript

  1. Understanding the SRE mindset and
    tools
    Doing SRE the
    Right Way
    Part ⅔

    View full-size slide

  2. Piyush Verma
    CTO, last9.io
    @realmeson10
    2

    View full-size slide

  3. FAQs
    Pulumi or Terraform?
    Automated vs automatic?
    Ansible or K8s?
    Serverless or needless codeless?
    3

    View full-size slide

  4. I used to believe
    these are the answers to reliability,
    and things will begin to fail less
    But...
    4

    View full-size slide

  5. Will it make a sound,
    when it breaks?
    5

    View full-size slide

  6. Would this?
    6

    View full-size slide

  7. Would this?
    7

    View full-size slide

  8. Would this?
    8

    View full-size slide

  9. Would this?
    meson10@meson10-xps-9370$: LAST9_ORG=last9 make launch
    [2020/06/20 11:27:40:2869] N: liblast9 1.6.0-c15cfb7
    [2020/06/20 11:27:40:2870] N: tty configuration:
    [2020/06/20 11:27:40:2870] N: start command: poetry run
    [2020/06/20 11:27:40:2870] N: close signal: SIGHUP (1)
    [2020/06/20 11:27:40:2872] N: Using foreign event loop...
    [2020/06/20 11:27:40:2873] N: Listening on port: 7681
    9

    View full-size slide

  10. If a tree falls in a forest and no
    one is around to hear it,
    does it make a sound?
    10

    View full-size slide

  11. If a bug happens and no user
    is around to see it,
    should it raise an alert?
    11

    View full-size slide

  12. You got paged,
    now what?
    12

    View full-size slide

  13. Journey of an Incident
    13

    View full-size slide

  14. ~7:30 AM,
    25 hours before a country
    launch,
    pagerDuty goes off
    14

    View full-size slide

  15. … ElasticSearch shows
    Some 5-6 5xx requests
    15

    View full-size slide

  16. Logs come in at 1 mbps,
    No Correlation-ID to isolate
    16

    View full-size slide

  17. ~ 7:35 AM
    500s stop.
    pagerDuty is autoresolved
    17

    View full-size slide

  18. ~7:40 AM
    pingdom alerts pagerDuty
    public-API is unreachable
    18

    View full-size slide

  19. ~7:40 AM
    Grafana triggers Pagerduty
    Some 5xx requests
    19

    View full-size slide

  20. ~7:40 AM
    Sentry triggers Pagerduty
    Elasticsearch -> Elastalert -> Sentry
    20

    View full-size slide

  21. ~ 7:45 AM
    500s stop.
    pagerDuty is autoresolved
    21

    View full-size slide

  22. Check rundeck
    was there a new deployment? ✔
    22

    View full-size slide

  23. Call Release Manager
    was there a new deployment? ✔
    23

    View full-size slide

  24. Call last oncall SRE
    was there a change? ✔
    24

    View full-size slide

  25. Grafana ✔
    Sentry ✔
    Prometheus ✔
    APM ✔
    25

    View full-size slide

  26. Check Firewall
    Is it dropping traffic? ✔
    26

    View full-size slide

  27. 20 hours later
    … mount hadn’t run on one db shard,
    shard rebooted,
    data wiped!
    27

    View full-size slide

  28. What’s the next step?
    28

    View full-size slide

  29. - Where else is this failing?
    - How do we avoid this?
    29

    View full-size slide

  30. What’s the Root Cause?
    30

    View full-size slide

  31. Failure reasons
    ➔ New Deployment
    ➔ Incorrect expectation
    ➔ Network failure
    ➔ Traffic/Load
    ➔ Configuration change
    ➔ Service-Provider fault
    31

    View full-size slide

  32. Failure reasons NOT
    ➔ Individuals
    32

    View full-size slide

  33. RCA #1
    Raghu forgot to execute mount command
    33

    View full-size slide

  34. RCA #2
    Infrastructure team forgot to execute mount
    command
    34

    View full-size slide

  35. RCA #3
    Why is it possible to SSH?
    35

    View full-size slide

  36. RCA #4
    Why don’t we have a tool to alert config
    mismatch
    36

    View full-size slide

  37. To err is human
    Failures are inevitable
    37

    View full-size slide

  38. Bad Apple theory
    Skill <> Ownership
    38

    View full-size slide

  39. What’s the Root Cause?
    - System Configuration Validator
    - Introduce FMEA
    - Non-Latent Configuration Validator
    39

    View full-size slide

  40. - Where else is this failing?
    - How do we avoid this?
    40

    View full-size slide

  41. More managers != less
    failures
    41

    View full-size slide

  42. More process != less failures
    42

    View full-size slide

  43. Curiosity == less failures
    FMEA
    Fault Injections
    43

    View full-size slide

  44. Standardization == less
    failures
    Infrastructure as Code
    Policies as Code
    44

    View full-size slide

  45. Clear SLOs == less failures
    Throughput as TPS vs Concurrency
    45

    View full-size slide

  46. Better tools == less failures
    46

    View full-size slide

  47. What are better tools?
    47

    View full-size slide

  48. Clever Hans
    48

    View full-size slide

  49. There are no better tools
    There is only better usage
    49

    View full-size slide

  50. Which one is better?
    50

    View full-size slide

  51. Terraform > Pulumi?
    51

    View full-size slide

  52. Ansible > Chef?
    53

    View full-size slide

  53. Prometheus > InfluxDB?
    54

    View full-size slide

  54. Elasticsearch > Humio?
    55

    View full-size slide

  55. Infrastructure Terraform Pulumi
    Configuration Management Ansible Chef
    Observability Prometheus InfluxDB
    Logging Elasticsearch Humio
    Recap
    56

    View full-size slide

  56. Culture will eat tools
    for breakfast, lunch, and dinner
    57

    View full-size slide

  57. Step1 : Build a culture
    Adopt tools, Standardize.
    Frequent RCAs.
    Step2: Improve the Culture
    Share Knowledge.
    Embrace Failures.
    58

    View full-size slide

  58. To err is human
    Failures are inevitable
    59

    View full-size slide

  59. Thank you
    last9.io/failures
    Piyush Verma

    View full-size slide