Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Break Glass, Repair Fast, Reconcile Automation

Break Glass, Repair Fast, Reconcile Automation

Originally presented at All Things Open, October 16, 2023.

Rosemary Wang

October 16, 2023
Tweet

More Decks by Rosemary Wang

Other Decks in Technology

Transcript

  1. Automation Workflow 17 define break-glass role break glass resolve add

    user to role remove user from role pull request / webhook pull request / webhook / scheduled job
  2. Example 18 resource "aws_ssoadmin_permission_set_inline_policy" "break_glass" { inline_policy = data.aws_iam_policy_document.break_glass.json instance_arn

    = tolist(data.aws_ssoadmin_instances.production.arns)[0] permission_set_arn = data.aws_ssoadmin_permission_set.production.arn } resource "aws_ssoadmin_permission_set" "production" { # omitted } break glass resolve terraform apply (separate state) PUT to access management API terraform destroy DELETE to access management API user opens incident alertmanager detects high % of errors user resolves alertmanager resolves resolve after 1 day
  3. UI/ CLI 22 GitOps Infrastructure as Code Infrastructure API stop

    auto-sync stop auto-approve lock changes Pause Assess Record End State
  4. Pause Assess Record End State 23 • Cascading failure from…

    • Automation? • System dependencies? • Active control loops? • Criticality of resource?
  5. Quarantine & Fix 24 very important virtual machine failed resolve

    terraform apply Quarantine terraform state rm aws_instance.worker # resource "aws_instance" "worker" { # omitted } break glass developer logs into machine import { to = aws_instance.worker id = "<id here>" } # uncomment resource delete import block
  6. “Tainting” Resources 27 manual fix not possible virtual machine failed

    test dependencies break glass developer logs into machine resolve terraform apply fi x as code terraform taint # module.boundary_worker_rds.aws_instance.worker is tainted, so must be replaced -/+ resource "aws_instance" "worker" {
  7. Blue / Green Deployment 28 database cannot be restored, manual

    fix failed database migration failed # resource "aws_db_instance" “prod_v1” { # } break glass developer logs into machine resolve switch application fi x as code developer generates new resource data "aws_db_snapshot" "latest_prod_snapshot" { # omitted } resource "aws_db_instance" “prod_v2” { snapshot_identifier = data.aws_db_snapshot.latest_prod_snapshot.id }
  8. • ⭐ As code • For endpoints… • Session recording

    • Terminal history • For resources… • Manual modi fi cation 29 Pause Assess Record End State
  9. • Identify fi nal state of all resources • Clean

    up blue/green deployment 30 Pause Assess Record End State
  10. 33 Expected (Code) Actual (State) resource "aws_db_instance" "database" { engine

    = "postgres" engine_version = “13.11” # omitted } { "module": "module.this", "mode": "data", "type": "aws_db_instance", "name": "check", "instances": [ { "index_key": 0, "schema_version": 0, "attributes": { "db_parameter_groups": [ "default.postgres14" ], "engine": "postgres", "engine_version": "14.9", "license_model": "postgresql-license" } } ] }
  11. UI/ CLI 36 GitOps Infrastructure as Code Infrastructure API start

    sync terraform plan -refresh-only allow changes