Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LightningTalk: CloudFormation auto-import + exi...

Roman Neß
December 02, 2021

LightningTalk: CloudFormation auto-import + existing AWS resources = Major Incident 💣

https://talks.cosee.biz/talk/3bcb22e1-9e7a-447e-b068-3d34479bbcd0

Roman ist Infrastructure as Code Enthusiast und erzählt euch von einem Fuckup aus grauer Vorzeit, als noch nicht jede unserer AWS Ressourcen durch Terraform oder CloudFormation gemanaged war. Wie wir durch achtloses Ausführen und Löschen einer CloudFormation Configuration einen mehrstündigen Incident einer beliebten Online-Plattform provoziert haben und was wir dabei gelernt haben, erfahrt ihr in seinem Talk.

Roman Neß

December 02, 2021
Tweet

More Decks by Roman Neß

Other Decks in Programming

Transcript

  1. Cloud infrastructure @ cosee in 2021 Infrastructure as Code with

    Terraform all the way 😍 • all resource inside AWS accounts • AWS accounts inside our AWS organisation • other TF providers for full automation: Gitlab, Kubernetes, PagerDuty https://knowyourmeme.com 
 /memes/swole-doge-vs-cheems
  2. Cloud infrastructure @ cosee in 2016 • A platform with

    thousands of active users 😎 • New AWS resources created and managed with CloudFormation 🏗 • Network, DNS, IAM, … • partially untouched for years 😧 • not managed by Infrastructure as Code 😨 • Some AWS resources shared between production and staging environment 😬 🔧
  3. actual footage of the person on call https://knowyourmeme.com 
 /memes/limmy-waking-up

    T+0: An increased error rate incident triggered during the weekend Course of events (1/2)
  4. T+0: An increased error rate incident triggered during the weekend

    T+30m: A small portion of the backend routes throw errors continuously T+45m: Errors happen when microservices talk to each other T+1h15m: We haven’t changed anything 
 Let’s contact AWS support Course of events (1/2)
  5. Course of events (2/2) T+1h45m: Private Subnet does not seem

    to have outbound internet access T+2h00m: Checking CloudTrail logs for events during the time frame T+2h45m: Apparently a RouteTableAssociation was deleted 
 Private Subnet is using the main RouteTable of the VPC T+2h50m: Created missing RouteTableAssociation. Incident resolved. But what happened? https://knowyourmeme.com 
 /memes/confused-nick-young
  6. Post Mortem • A colleague tries something out with a

    new CloudFormation stacks • Existing RouteTableAssociation is automatically imported into CloudFormation • Managed RouteTableAssociation is deleted with CloudFormation stack • Subnet does a fallback to the main RouteTable (not routing to NatGateway)
  7. Private Subnet does not route to NatGateway; Internet connectivity lost

    Uptime monitor “statping” is hosted in ECS to reproduce the problem 
 (https://github.com/statping/statping)
  8. Learnings • CloudFormation can silently perform unexpected changes (just use

    Terraform 😏) • Use distinct AWS resources for production and staging environments • Test out new services and stacks in new VPCs (or even new accounts) • AWS CloudTrail is your best bet if you need to investigate infrastructure changes • A proper Infrastructure as Code setup can make your infrastructure stateless!