Slide 1

Slide 1 text

October 16, 2023 Break Glass, Repair Fast, Reconcile Automation All Things Open 2023 1

Slide 2

Slide 2 text

alert: application error rate is high ⚠ 2

Slide 3

Slide 3 text

alert: database is not responding 🚨 3

Slide 4

Slide 4 text

let’s troubleshoot 🔍 4

Slide 5

Slide 5 text

UI/ CLI 5 GitOps Infrastructure as Code Infrastructure API The Layers of Automation

Slide 6

Slide 6 text

Rosemary Wang she/her @joatmon08 joatmon08.github.io 6

Slide 7

Slide 7 text

The (Least Discussed) Operational Pattern Break Glass Reconcile Automation Repair Fast 7

Slide 8

Slide 8 text

UI/ CLI 8 GitOps Infrastructure as Code Infrastructure API The Layers of Automation

Slide 9

Slide 9 text

Break Glass 9

Slide 10

Slide 10 text

Commit Push Deploy “As Code” 10

Slide 11

Slide 11 text

Commit Push Deploy “As Code” Break Glass Fix 11 ⚡

Slide 12

Slide 12 text

Break Glass fix the way you’re “not supposed to” 12

Slide 13

Slide 13 text

UI/ CLI 13 GitOps Infrastructure as Code Infrastructure API 😓 🤔

Slide 14

Slide 14 text

Who can break glass? 14

Slide 15

Slide 15 text

The Access Lifecycle Fix Identify Grant Revoke 15

Slide 16

Slide 16 text

Automate Ephemeral Access 16

Slide 17

Slide 17 text

Automation Workflow 17 define break-glass role break glass resolve add user to role remove user from role pull request / webhook pull request / webhook / scheduled job

Slide 18

Slide 18 text

Example 18 resource "aws_ssoadmin_permission_set_inline_policy" "break_glass" { inline_policy = data.aws_iam_policy_document.break_glass.json instance_arn = tolist(data.aws_ssoadmin_instances.production.arns)[0] permission_set_arn = data.aws_ssoadmin_permission_set.production.arn } resource "aws_ssoadmin_permission_set" "production" { # omitted } break glass resolve terraform apply (separate state) PUT to access management API terraform destroy DELETE to access management API user opens incident alertmanager detects high % of errors user resolves alertmanager resolves resolve after 1 day

Slide 19

Slide 19 text

Repair Fast 19

Slide 20

Slide 20 text

incident resolution 20

Slide 21

Slide 21 text

Some specific steps… Pause Assess Record End State 21

Slide 22

Slide 22 text

UI/ CLI 22 GitOps Infrastructure as Code Infrastructure API stop auto-sync stop auto-approve lock changes Pause Assess Record End State

Slide 23

Slide 23 text

Pause Assess Record End State 23 • Cascading failure from… • Automation? • System dependencies? • Active control loops? • Criticality of resource?

Slide 24

Slide 24 text

Quarantine & Fix 24 very important virtual machine failed resolve terraform apply Quarantine terraform state rm aws_instance.worker # resource "aws_instance" "worker" { # omitted } break glass developer logs into machine import { to = aws_instance.worker id = "" } # uncomment resource delete import block

Slide 25

Slide 25 text

What if it can’t be fixed? 25

Slide 26

Slide 26 text

Immutability create new resources with changes 26

Slide 27

Slide 27 text

“Tainting” Resources 27 manual fix not possible virtual machine failed test dependencies break glass developer logs into machine resolve terraform apply fi x as code terraform taint # module.boundary_worker_rds.aws_instance.worker is tainted, so must be replaced -/+ resource "aws_instance" "worker" {

Slide 28

Slide 28 text

Blue / Green Deployment 28 database cannot be restored, manual fix failed database migration failed # resource "aws_db_instance" “prod_v1” { # } break glass developer logs into machine resolve switch application fi x as code developer generates new resource data "aws_db_snapshot" "latest_prod_snapshot" { # omitted } resource "aws_db_instance" “prod_v2” { snapshot_identifier = data.aws_db_snapshot.latest_prod_snapshot.id }

Slide 29

Slide 29 text

• ⭐ As code • For endpoints… • Session recording • Terminal history • For resources… • Manual modi fi cation 29 Pause Assess Record End State

Slide 30

Slide 30 text

• Identify fi nal state of all resources • Clean up blue/green deployment 30 Pause Assess Record End State

Slide 31

Slide 31 text

Reconcile Automation 31

Slide 32

Slide 32 text

Drift difference between expected and actual environment 32

Slide 33

Slide 33 text

33 Expected (Code) Actual (State) resource "aws_db_instance" "database" { engine = "postgres" engine_version = “13.11” # omitted } { "module": "module.this", "mode": "data", "type": "aws_db_instance", "name": "check", "instances": [ { "index_key": 0, "schema_version": 0, "attributes": { "db_parameter_groups": [ "default.postgres14" ], "engine": "postgres", "engine_version": "14.9", "license_model": "postgresql-license" } } ] }

Slide 34

Slide 34 text

Drift affects future automation 34

Slide 35

Slide 35 text

commit commit2 35 code state drift 🚨 ⚡

Slide 36

Slide 36 text

UI/ CLI 36 GitOps Infrastructure as Code Infrastructure API start sync terraform plan -refresh-only allow changes

Slide 37

Slide 37 text

Idempotence reconciliation should achieve the same result 37

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

Summary 39

Slide 40

Slide 40 text

UI/ CLI 40 GitOps Infrastructure as Code Infrastructure API break glass repair fast reconcile automation

Slide 41

Slide 41 text

Thank you! Rosemary Wang joatmon08.github.io 41