Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Break Glass, Repair Fast, Reconcile Automation

Break Glass, Repair Fast, Reconcile Automation

Originally presented at All Things Open, October 16, 2023.

Rosemary Wang

October 16, 2023
Tweet

More Decks by Rosemary Wang

Other Decks in Technology

Transcript

  1. October 16, 2023
    Break Glass, Repair Fast,
    Reconcile Automation
    All Things Open 2023
    1

    View full-size slide

  2. alert: application error rate is high

    2

    View full-size slide

  3. alert: database is not responding
    🚨
    3

    View full-size slide

  4. let’s troubleshoot
    🔍
    4

    View full-size slide

  5. UI/
    CLI
    5
    GitOps
    Infrastructure
    as Code
    Infrastructure
    API
    The Layers
    of Automation

    View full-size slide

  6. Rosemary Wang
    she/her
    @joatmon08
    joatmon08.github.io
    6

    View full-size slide

  7. The (Least Discussed) Operational Pattern
    Break Glass

    Reconcile Automation
    Repair Fast
    7

    View full-size slide

  8. UI/
    CLI
    8
    GitOps
    Infrastructure
    as Code
    Infrastructure
    API
    The Layers
    of Automation

    View full-size slide

  9. Break Glass
    9

    View full-size slide

  10. Commit Push Deploy
    “As Code”
    10

    View full-size slide

  11. Commit Push Deploy
    “As Code”
    Break Glass
    Fix
    11

    View full-size slide

  12. Break Glass


    fix the way you’re “not supposed to”
    12

    View full-size slide

  13. UI/
    CLI
    13
    GitOps
    Infrastructure
    as Code
    Infrastructure
    API
    😓
    🤔

    View full-size slide

  14. Who can break glass?
    14

    View full-size slide

  15. The Access
    Lifecycle
    Fix
    Identify
    Grant
    Revoke
    15

    View full-size slide

  16. Automate Ephemeral Access
    16

    View full-size slide

  17. Automation Workflow
    17
    define break-glass role
    break glass
    resolve
    add user to role remove user from role
    pull request / webhook
    pull request / webhook / scheduled job

    View full-size slide

  18. Example
    18
    resource "aws_ssoadmin_permission_set_inline_policy" "break_glass" {
    inline_policy = data.aws_iam_policy_document.break_glass.json
    instance_arn = tolist(data.aws_ssoadmin_instances.production.arns)[0]
    permission_set_arn = data.aws_ssoadmin_permission_set.production.arn
    }
    resource "aws_ssoadmin_permission_set" "production" {
    # omitted
    }
    break glass
    resolve
    terraform apply
    (separate state)

    PUT to access management API
    terraform destroy
    DELETE to access
    management API
    user opens incident
    alertmanager detects high % of errors
    user resolves
    alertmanager resolves
    resolve after 1 day

    View full-size slide

  19. Repair Fast
    19

    View full-size slide

  20. incident resolution
    20

    View full-size slide

  21. Some specific steps…
    Pause

    Assess

    Record

    End State
    21

    View full-size slide

  22. UI/
    CLI
    22
    GitOps
    Infrastructure
    as Code
    Infrastructure
    API
    stop auto-sync
    stop auto-approve
    lock changes
    Pause

    Assess

    Record

    End State

    View full-size slide

  23. Pause

    Assess

    Record

    End State
    23
    • Cascading failure from…

    • Automation?

    • System dependencies?

    • Active control loops?

    • Criticality of resource?

    View full-size slide

  24. Quarantine & Fix
    24
    very important virtual machine failed
    resolve
    terraform apply
    Quarantine
    terraform state rm aws_instance.worker
    # resource "aws_instance" "worker" {
    # omitted
    } break glass
    developer logs into machine
    import {
    to = aws_instance.worker
    id = ""
    }
    # uncomment resource
    delete import block

    View full-size slide

  25. What if it can’t be fixed?
    25

    View full-size slide

  26. Immutability


    create new resources with changes
    26

    View full-size slide

  27. “Tainting” Resources
    27
    manual fix not possible
    virtual machine failed
    test dependencies
    break glass
    developer logs into machine
    resolve
    terraform apply
    fi
    x as code
    terraform taint
    # module.boundary_worker_rds.aws_instance.worker is tainted,
    so must be replaced
    -/+ resource "aws_instance" "worker" {

    View full-size slide

  28. Blue / Green Deployment
    28
    database cannot be restored,
    manual fix failed
    database migration failed
    # resource "aws_db_instance" “prod_v1” {
    # }
    break glass
    developer logs into machine
    resolve
    switch application
    fi
    x as code
    developer generates new resource
    data "aws_db_snapshot" "latest_prod_snapshot" {
    # omitted
    }
    resource "aws_db_instance" “prod_v2” {
    snapshot_identifier =
    data.aws_db_snapshot.latest_prod_snapshot.id
    }

    View full-size slide

  29. • ⭐ As code

    • For endpoints…

    • Session recording

    • Terminal history

    • For resources…

    • Manual modi
    fi
    cation
    29
    Pause

    Assess

    Record

    End State

    View full-size slide

  30. • Identify
    fi
    nal state of all resources

    • Clean up blue/green deployment
    30
    Pause

    Assess

    Record

    End State

    View full-size slide

  31. Reconcile Automation
    31

    View full-size slide

  32. Drift


    difference between expected and actual environment
    32

    View full-size slide

  33. 33
    Expected (Code) Actual (State)
    resource "aws_db_instance" "database" {
    engine = "postgres"
    engine_version = “13.11”
    # omitted
    }
    {
    "module": "module.this",
    "mode": "data",
    "type": "aws_db_instance",
    "name": "check",
    "instances": [
    {
    "index_key": 0,
    "schema_version": 0,
    "attributes": {
    "db_parameter_groups": [
    "default.postgres14"
    ],
    "engine": "postgres",
    "engine_version": "14.9",
    "license_model": "postgresql-license"
    }
    }
    ]
    }

    View full-size slide

  34. Drift


    affects future automation
    34

    View full-size slide

  35. commit commit2
    35
    code
    state
    drift
    🚨

    View full-size slide

  36. UI/
    CLI
    36
    GitOps
    Infrastructure
    as Code
    Infrastructure
    API
    start sync
    terraform plan -refresh-only
    allow changes

    View full-size slide

  37. Idempotence


    reconciliation should achieve the same result
    37

    View full-size slide

  38. UI/
    CLI
    40
    GitOps
    Infrastructure
    as Code
    Infrastructure
    API
    break glass
    repair fast
    reconcile automation

    View full-size slide

  39. Thank you!
    Rosemary Wang
    joatmon08.github.io
    41

    View full-size slide