Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Doing SRE the right way - 2

Doing SRE the right way - 2

Pulumi or Terraform?
How much and what all to automate?
Automated vs automatic?
Ansible or K8s?
Serverless or needless?

Choices, choices, choices - and very expensive these choices are.

In session 2, speaker Piyush Verma, co-founder of Last9 Inc will lay out facts (from popular tech fiction) + trade-offs associated with these choices.

Piyush Verma

June 23, 2020
Tweet

More Decks by Piyush Verma

Other Decks in Programming

Transcript

  1. Understanding the SRE mindset and
    tools
    Doing SRE the
    Right Way
    Part ⅔

    View Slide

  2. Piyush Verma
    CTO, last9.io
    @realmeson10
    2

    View Slide

  3. FAQs
    Pulumi or Terraform?
    Automated vs automatic?
    Ansible or K8s?
    Serverless or needless codeless?
    3

    View Slide

  4. I used to believe
    these are the answers to reliability,
    and things will begin to fail less
    But...
    4

    View Slide

  5. Will it make a sound,
    when it breaks?
    5

    View Slide

  6. Would this?
    6

    View Slide

  7. Would this?
    7

    View Slide

  8. Would this?
    8

    View Slide

  9. Would this?
    meson10@meson10-xps-9370$: LAST9_ORG=last9 make launch
    [2020/06/20 11:27:40:2869] N: liblast9 1.6.0-c15cfb7
    [2020/06/20 11:27:40:2870] N: tty configuration:
    [2020/06/20 11:27:40:2870] N: start command: poetry run
    [2020/06/20 11:27:40:2870] N: close signal: SIGHUP (1)
    [2020/06/20 11:27:40:2872] N: Using foreign event loop...
    [2020/06/20 11:27:40:2873] N: Listening on port: 7681
    9

    View Slide

  10. If a tree falls in a forest and no
    one is around to hear it,
    does it make a sound?
    10

    View Slide

  11. If a bug happens and no user
    is around to see it,
    should it raise an alert?
    11

    View Slide

  12. You got paged,
    now what?
    12

    View Slide

  13. Journey of an Incident
    13

    View Slide

  14. ~7:30 AM,
    25 hours before a country
    launch,
    pagerDuty goes off
    14

    View Slide

  15. … ElasticSearch shows
    Some 5-6 5xx requests
    15

    View Slide

  16. Logs come in at 1 mbps,
    No Correlation-ID to isolate
    16

    View Slide

  17. ~ 7:35 AM
    500s stop.
    pagerDuty is autoresolved
    17

    View Slide

  18. ~7:40 AM
    pingdom alerts pagerDuty
    public-API is unreachable
    18

    View Slide

  19. ~7:40 AM
    Grafana triggers Pagerduty
    Some 5xx requests
    19

    View Slide

  20. ~7:40 AM
    Sentry triggers Pagerduty
    Elasticsearch -> Elastalert -> Sentry
    20

    View Slide

  21. ~ 7:45 AM
    500s stop.
    pagerDuty is autoresolved
    21

    View Slide

  22. Check rundeck
    was there a new deployment? ✔
    22

    View Slide

  23. Call Release Manager
    was there a new deployment? ✔
    23

    View Slide

  24. Call last oncall SRE
    was there a change? ✔
    24

    View Slide

  25. Grafana ✔
    Sentry ✔
    Prometheus ✔
    APM ✔
    25

    View Slide

  26. Check Firewall
    Is it dropping traffic? ✔
    26

    View Slide

  27. 20 hours later
    … mount hadn’t run on one db shard,
    shard rebooted,
    data wiped!
    27

    View Slide

  28. What’s the next step?
    28

    View Slide

  29. - Where else is this failing?
    - How do we avoid this?
    29

    View Slide

  30. What’s the Root Cause?
    30

    View Slide

  31. Failure reasons
    ➔ New Deployment
    ➔ Incorrect expectation
    ➔ Network failure
    ➔ Traffic/Load
    ➔ Configuration change
    ➔ Service-Provider fault
    31

    View Slide

  32. Failure reasons NOT
    ➔ Individuals
    32

    View Slide

  33. RCA #1
    Raghu forgot to execute mount command
    33

    View Slide

  34. RCA #2
    Infrastructure team forgot to execute mount
    command
    34

    View Slide

  35. RCA #3
    Why is it possible to SSH?
    35

    View Slide

  36. RCA #4
    Why don’t we have a tool to alert config
    mismatch
    36

    View Slide

  37. To err is human
    Failures are inevitable
    37

    View Slide

  38. Bad Apple theory
    Skill <> Ownership
    38

    View Slide

  39. What’s the Root Cause?
    - System Configuration Validator
    - Introduce FMEA
    - Non-Latent Configuration Validator
    39

    View Slide

  40. - Where else is this failing?
    - How do we avoid this?
    40

    View Slide

  41. More managers != less
    failures
    41

    View Slide

  42. More process != less failures
    42

    View Slide

  43. Curiosity == less failures
    FMEA
    Fault Injections
    43

    View Slide

  44. Standardization == less
    failures
    Infrastructure as Code
    Policies as Code
    44

    View Slide

  45. Clear SLOs == less failures
    Throughput as TPS vs Concurrency
    45

    View Slide

  46. Better tools == less failures
    46

    View Slide

  47. What are better tools?
    47

    View Slide

  48. Clever Hans
    48

    View Slide

  49. There are no better tools
    There is only better usage
    49

    View Slide

  50. Which one is better?
    50

    View Slide

  51. Terraform > Pulumi?
    51

    View Slide

  52. 52

    View Slide

  53. Ansible > Chef?
    53

    View Slide

  54. Prometheus > InfluxDB?
    54

    View Slide

  55. Elasticsearch > Humio?
    55

    View Slide

  56. Infrastructure Terraform Pulumi
    Configuration Management Ansible Chef
    Observability Prometheus InfluxDB
    Logging Elasticsearch Humio
    Recap
    56

    View Slide

  57. Culture will eat tools
    for breakfast, lunch, and dinner
    57

    View Slide

  58. Step1 : Build a culture
    Adopt tools, Standardize.
    Frequent RCAs.
    Step2: Improve the Culture
    Share Knowledge.
    Embrace Failures.
    58

    View Slide

  59. To err is human
    Failures are inevitable
    59

    View Slide

  60. Thank you
    last9.io/failures
    Piyush Verma

    View Slide