Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Doing SRE the right way - 2

Doing SRE the right way - 2

Pulumi or Terraform?
How much and what all to automate?
Automated vs automatic?
Ansible or K8s?
Serverless or needless?

Choices, choices, choices - and very expensive these choices are.

In session 2, speaker Piyush Verma, co-founder of Last9 Inc will lay out facts (from popular tech fiction) + trade-offs associated with these choices.

Piyush Verma

June 23, 2020
Tweet

More Decks by Piyush Verma

Other Decks in Programming

Transcript

  1. I used to believe these are the answers to reliability,

    and things will begin to fail less But... 4
  2. Would this? meson10@meson10-xps-9370$: LAST9_ORG=last9 make launch [2020/06/20 11:27:40:2869] N: liblast9

    1.6.0-c15cfb7 [2020/06/20 11:27:40:2870] N: tty configuration: [2020/06/20 11:27:40:2870] N: start command: poetry run [2020/06/20 11:27:40:2870] N: close signal: SIGHUP (1) [2020/06/20 11:27:40:2872] N: Using foreign event loop... [2020/06/20 11:27:40:2873] N: Listening on port: 7681 9
  3. If a tree falls in a forest and no one

    is around to hear it, does it make a sound? 10
  4. If a bug happens and no user is around to

    see it, should it raise an alert? 11
  5. 20 hours later … mount hadn’t run on one db

    shard, shard rebooted, data wiped! 27
  6. Failure reasons ➔ New Deployment ➔ Incorrect expectation ➔ Network

    failure ➔ Traffic/Load ➔ Configuration change ➔ Service-Provider fault 31
  7. What’s the Root Cause? - System Configuration Validator - Introduce

    FMEA - Non-Latent Configuration Validator 39
  8. 52

  9. Step1 : Build a culture Adopt tools, Standardize. Frequent RCAs.

    Step2: Improve the Culture Share Knowledge. Embrace Failures. 58