Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Best practices for testing AWS infrastructure.

Ben Whaley
December 04, 2019

Best practices for testing AWS infrastructure.

My take on Yevgeniy "Jim" Brikman's talk entitled, "How to test infrastructure code: automated testing for Terraform, Kubernetes, Docker, Packer and more," presented at AWS re:Invent 2019.

See Jim's original slide deck with more content here: https://www.slideshare.net/brikis98/how-to-test-infrastructure-code-automated-testing-for-terraform-kubernetes-docker-packer-and-more

Ben Whaley

December 04, 2019
Tweet

More Decks by Ben Whaley

Other Decks in Programming

Transcript

  1. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Best practices for testing AWS infrastructure Ben Whaley D V C 0 4 Principal Software Engineer Gruntwork.io
  2. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Slides by Yevgeniy Brikman Co-founder, Gruntwork.io @brikis98
  3. resource "aws_lambda_function" "web_app" { function_name = var.name role = aws_iam_role.lambda.arn

    # ... } resource "aws_api_gateway_integration" "proxy" { type = "AWS_PROXY" uri = aws_lambda_function.web_app.invoke_arn # ... } But how do you test infrastructure provisioned with Terraform?
  4. export class DemoCdkStack extends cdk.Stack { constructor(scope: cdk.Construct, id: string)

    { super(scope, id); new s3.Bucket(this, 'reInventDemoBucket’, { versioned: true, }); } } const app = new cdk.App(); new DemoCdkStack(app, "DemoCdkStack"); How do you test infrastructure provisioned with the AWS CDK?
  5. This talk is about how to test infrastructure code. Automated

    testing techniques: ✓ terraform ✓ cdk ✓ unit tests ✓ integration tests ✓ end-to-end tests Passed: 5. Failed: 0. Skipped: 0. Test run successful.
  6. Compiler / parser / interpreter check terraform terraform validate cdk

    npm run build kubernetes kubectl apply -f <file> --dry-run --validate=true cloudformation aws cloudformation validate-template
  7. Linters terraform • conftest • terraform_validate • tflint cdk cdk

    doctor kubernetes • kube-score • kube-lint • yamllint cloudformation • cfn-python-lint • cfn-nag
  8. $ tflint Error: instance_type is not a valid value on

    main.tf line 3: 3: instance_type = “t2.2xlarge” Linters find this type of problem in advance
  9. Dry run terraform terraform plan cdk cdk diff Snapshots kubernetes

    kubectl apply -f <file> --server-dry-run cloudformation Change Sets
  10. infrastructure-as-code-testing-talk └ examples └ modules └ cdk-app └ bin └

    lib └ cdk-demo-stack.ts └ test cdk-app: Create an S3 bucket
  11. export class DemoCdkStack extends cdk.Stack { constructor(scope: cdk.Construct, id: string)

    { super(scope, id); new s3.Bucket(this, 'reInventDemoBucket’, { versioned: true, }); } } const app = new cdk.App(); new DemoCdkStack(app, "DemoCdkStack"); Create an S3 bucket with versioning enabled
  12. $ cdk deploy DemoCdkStack: deploying... DemoCdkStack: creating CloudFormation changeset... 0/3

    | 1:04:13 PM | CREATE_IN_PROGRESS … [snip] … 3/3 | 1:04:37 PM | CREATE_COMPLETE ✅ DemoCdkStack Stack ARN: arn:aws:cloudformation:us-west-2:…[snip]… When you run cdk deploy, it creates the CloudFormation stack and outputs the stack ARN
  13. infrastructure-as-code-testing-talk └ examples └ modules └ cdk-app └ bin └

    lib └ test └ cdk-demo.test.ts Create cdk-demo.test.ts
  14. test('Bucket Stack', () => { const app = new cdk.App();

    const stack = new CdkDemo.DemoCdkStack(app, 'DemoTestStack’); expectCDK(stack).to(haveResource('AWS::S3::Bucket’, { VersioningConfiguration: { Status: "Enabled" }, })); }); The basic test structure
  15. test('Bucket Stack', () => { const app = new cdk.App();

    const stack = new CdkDemo.DemoCdkStack(app, 'DemoTestStack’); expectCDK(stack).to(haveResource('AWS::S3::Bucket’, { VersioningConfiguration: { Status: "Enabled" }, })); }); 1) CDK uses the Jest testing framework
  16. test('Bucket Stack', () => { const app = new cdk.App();

    const stack = new CdkDemo.DemoCdkStack(app, 'DemoTestStack’); expectCDK(stack).to(haveResource('AWS::S3::Bucket’, { VersioningConfiguration: { Status: "Enabled" }, })); }); 2) Set up the app and stack
  17. test('Bucket Stack', () => { const app = new cdk.App();

    const stack = new CdkDemo.DemoCdkStack(app, 'DemoTestStack’); expectCDK(stack).to(haveResource('AWS::S3::Bucket’, { VersioningConfiguration: { Status: "Enabled" }, })); }); 3) Assert that the stack has an S3 bucket with versioning enabled
  18. $ npm run test PASS test/cdk-demo.test.ts ✓ Bucket Stack (27ms)

    Test Suites: 1 passed, 1 total Tests: 1 passed, 1 total Snapshots: 0 total Time: 2.391s Ran all test suites. Run npm test. You now have a unit test you can run after every commit!
  19. export class DemoCdkStack extends cdk.Stack { constructor(scope: cdk.Construct, id: string)

    { super(scope, id); //new s3.Bucket(this, 'reInventDemoBucket’, { // versioned: true, //}); new s3.Bucket(this, 'reInventDemoBucket’); } } What if somebody makes a change to the bucket, like removing the versioning?
  20. $ npm run test FAIL test/cdk-demo.test.ts ✕ Bucket Stack (29ms)

    None of 1 resources matches resource 'AWS::S3::Bucket' with properties { "VersioningConfiguration": { "Status": "Enabled" } } All of the properties of the resource must match
  21. *CDK says this is a unit test, but from an

    infra standpoint, it’s actually a static test
  22. Instead, break your infra code into small modules and unit

    test those! module module module module module module module module module module module module module module module
  23. resource "aws_lambda_function" "web_app" { function_name = var.name role = aws_iam_role.lambda.arn

    # ... } resource "aws_api_gateway_integration" "proxy" { type = "AWS_PROXY" uri = aws_lambda_function.web_app.invoke_arn # ... } But infrastructure code is all about talking to the outside world!
  24. Therefore, the test strategy is: 1. Deploy real infrastructure 2.

    Validate it works (e.g., via API calls, SSH commands, etc.) 3. Destroy the infrastructure (So it’s really integration testing of a single unit!)
  25. Tools that help with infrastructure “unit” testing: Deploy/Destroy Validate Supported

    Technologies Terratest Yes Yes Terraform, k8s, packer, docker, OS, Cloud APIs kitchen-terraform Yes Yes Terraform Inspec No Yes OS, Cloud APIs awspec No Yes AWS API
  26. infrastructure-as-code-testing-talk └ examples └ hello-world-app └ main.tf └ outputs.tf └

    variables.tf └ modules └ test └ README.md hello-world-app: deploy a “Hello, World” web service
  27. resource "aws_lambda_function" "web_app" { function_name = var.name role = aws_iam_role.lambda.arn

    # ... } resource "aws_api_gateway_integration" "proxy" { type = "AWS_PROXY" uri = aws_lambda_function.web_app.invoke_arn # ... } Under the hood, this example runs on top of AWS Lambda & API Gateway
  28. func TestHelloWorldAppUnit(t *testing.T) { terraformOptions := &terraform.Options{ TerraformDir: "../examples/hello-world-app", }

    defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validate(t, terraformOptions) } The basic test structure
  29. func TestHelloWorldAppUnit(t *testing.T) { terraformOptions := &terraform.Options{ TerraformDir: "../examples/hello-world-app", }

    defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validate(t, terraformOptions) } 1. Tell Terratest where your Terraform code lives
  30. func TestHelloWorldAppUnit(t *testing.T) { terraformOptions := &terraform.Options{ TerraformDir: "../examples/hello-world-app", }

    defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validate(t, terraformOptions) } 2. Run terraform init and terraform apply to deploy your module
  31. func TestHelloWorldAppUnit(t *testing.T) { terraformOptions := &terraform.Options{ TerraformDir: "../examples/hello-world-app", }

    defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validate(t, terraformOptions) } 3. Validate the infrastructure works. We’ll come back to this shortly.
  32. func TestHelloWorldAppUnit(t *testing.T) { terraformOptions := &terraform.Options{ TerraformDir: "../examples/hello-world-app", }

    defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validate(t, terraformOptions) } 4. Run terraform destroy at the end of the test to clean up
  33. func validate(t *testing.T, opts *terraform.Options) { url := terraform.Output(t, opts,

    "url") http_helper.HttpGetWithRetry(t, url, // URL to test 200, // Expected status code "Hello, World!", // Expected body 10, // Max retries 3 * time.Second // Time between retries ) } The validate function
  34. func validate(t *testing.T, opts *terraform.Options) { url := terraform.Output(t, opts,

    "url") http_helper.HttpGetWithRetry(t, url, // URL to test 200, // Expected status code "Hello, World!", // Expected body 10, // Max retries 3 * time.Second // Time between retries ) } 1. Run terraform output to get the web service URL
  35. func validate(t *testing.T, opts *terraform.Options) { url := terraform.Output(t, opts,

    "url") http_helper.HttpGetWithRetry(t, url, // URL to test 200, // Expected status code "Hello, World!", // Expected body 10, // Max retries 3 * time.Second // Time between retries ) } 2. Make HTTP requests to the URL
  36. func validate(t *testing.T, opts *terraform.Options) { url := terraform.Output(t, opts,

    "url") http_helper.HttpGetWithRetry(t, url, // URL to test 200, // Expected status code "Hello, World!", // Expected body 10, // Max retries 3 * time.Second // Time between retries ) } 3. Check the response for an expected status and body
  37. func validate(t *testing.T, opts *terraform.Options) { url := terraform.Output(t, opts,

    "url") http_helper.HttpGetWithRetry(t, url, // URL to test 200, // Expected status code "Hello, World!", // Expected body 10, // Max retries 3 * time.Second // Time between retries ) } 4. Retry the request up to 10 times, as deployment is asynchronous
  38. $ go test -v -timeout 15m -run TestHelloWorldAppUnit … ---

    PASS: TestHelloWorldAppUnit (31.57s) Then run go test. You now have a unit test you can run after every commit!
  39. Examples of other ways to validate: Example Validation Helper Web

    service Containerized or serverless web app HTTP requests Terratest http_helper package Server EC2 instance SSH commands Terratest ssh package Cloud resource SQS queue API calls Terratest aws package Database MySQL SQL queries MySQL driver for Go
  40. Features cloud-nuke Delete resources older than a certain date; in

    a certain region; of a certain type. Janitor Monkey Configurable rules of what to delete. Notify owners of pending deletions. aws-nuke Specify specific AWS accounts and resource types to target. Pro tip #2: run these tools in cron jobs to clean up stale resources
  41. infrastructure-as-code-testing-talk └ examples └ hello-world-app └ docker-kubernetes └ proxy-app └

    web-service └ modules └ test └ README.md Let’s say you have two Terraform modules you want to test together:
  42. infrastructure-as-code-testing-talk └ examples └ hello-world-app └ docker-kubernetes └ proxy-app └

    web-service └ modules └ test └ README.md proxy-app: an app that acts as an HTTP proxy for other web services.
  43. infrastructure-as-code-testing-talk └ examples └ hello-world-app └ docker-kubernetes └ proxy-app └

    web-service └ modules └ test └ README.md web-service: a web service that you want proxied.
  44. variable "url_to_proxy" { description = "The URL to proxy." type

    = string } proxy-app takes in the URL to proxy via an input variable
  45. func TestProxyApp(t *testing.T) { webServiceOpts := configWebService(t) defer terraform.Destroy(t, webServiceOpts)

    terraform.InitAndApply(t, webServiceOpts) proxyAppOpts := configProxyApp(t, webServiceOpts) defer terraform.Destroy(t, proxyAppOpts) terraform.InitAndApply(t, proxyAppOpts) validate(t, proxyAppOpts) } The basic test structure
  46. func TestProxyApp(t *testing.T) { webServiceOpts := configWebService(t) defer terraform.Destroy(t, webServiceOpts)

    terraform.InitAndApply(t, webServiceOpts) proxyAppOpts := configProxyApp(t, webServiceOpts) defer terraform.Destroy(t, proxyAppOpts) terraform.InitAndApply(t, proxyAppOpts) validate(t, proxyAppOpts) } 1. Configure options for the web service
  47. func TestProxyApp(t *testing.T) { webServiceOpts := configWebService(t) defer terraform.Destroy(t, webServiceOpts)

    terraform.InitAndApply(t, webServiceOpts) proxyAppOpts := configProxyApp(t, webServiceOpts) defer terraform.Destroy(t, proxyAppOpts) terraform.InitAndApply(t, proxyAppOpts) validate(t, proxyAppOpts) } 2. Deploy the web service
  48. func TestProxyApp(t *testing.T) { webServiceOpts := configWebService(t) defer terraform.Destroy(t, webServiceOpts)

    terraform.InitAndApply(t, webServiceOpts) proxyAppOpts := configProxyApp(t, webServiceOpts) defer terraform.Destroy(t, proxyAppOpts) terraform.InitAndApply(t, proxyAppOpts) validate(t, proxyAppOpts) } 3. Configure options for the proxy app (passing it the web service options)
  49. func TestProxyApp(t *testing.T) { webServiceOpts := configWebService(t) defer terraform.Destroy(t, webServiceOpts)

    terraform.InitAndApply(t, webServiceOpts) proxyAppOpts := configProxyApp(t, webServiceOpts) defer terraform.Destroy(t, proxyAppOpts) terraform.InitAndApply(t, proxyAppOpts) validate(t, proxyAppOpts) } 4. Deploy the proxy app
  50. func TestProxyApp(t *testing.T) { webServiceOpts := configWebService(t) defer terraform.Destroy(t, webServiceOpts)

    terraform.InitAndApply(t, webServiceOpts) proxyAppOpts := configProxyApp(t, webServiceOpts) defer terraform.Destroy(t, proxyAppOpts) terraform.InitAndApply(t, proxyAppOpts) validate(t, proxyAppOpts) } 5. Validate the proxy app works
  51. func TestProxyApp(t *testing.T) { webServiceOpts := configWebService(t) defer terraform.Destroy(t, webServiceOpts)

    terraform.InitAndApply(t, webServiceOpts) proxyAppOpts := configProxyApp(t, webServiceOpts) defer terraform.Destroy(t, proxyAppOpts) terraform.InitAndApply(t, proxyAppOpts) validate(t, proxyAppOpts) } 6. At the end of the test, destroy the proxy app and the web service
  52. func configProxyApp(t *testing.T, webServiceOpts *terraform.Options) *terraform.Options { url := terraform.Output(t,

    webServiceOpts, "url") return &terraform.Options{ TerraformDir: "../examples/proxy-app", Vars: map[string]interface{}{ "url_to_proxy": url, }, } } The configProxyApp method
  53. func configProxyApp(t *testing.T, webServiceOpts *terraform.Options) *terraform.Options { url := terraform.Output(t,

    webServiceOpts, "url") return &terraform.Options{ TerraformDir: "../examples/proxy-app", Vars: map[string]interface{}{ "url_to_proxy": url, }, } } 1. Read the url output from the web- service module
  54. func configProxyApp(t *testing.T, webServiceOpts *terraform.Options) *terraform.Options { url := terraform.Output(t,

    webServiceOpts, "url") return &terraform.Options{ TerraformDir: "../examples/proxy-app", Vars: map[string]interface{}{ "url_to_proxy": url, }, } } 2. Pass it in as the url_to_proxy input to the proxy-app module
  55. func validate(t *testing.T, opts *terraform.Options) { url := terraform.Output(t, opts,

    "url") http_helper.HttpGetWithRetry(t, url, // URL to test 200, // Expected status code `{"text":"Hello, World!"}`, // Expected body 10, // Max retries 3 * time.Second // Time between retries ) } The validate method
  56. $ go test -v -timeout 15m -run TestProxyApp … ---

    PASS: TestProxyApp (182.44s) Run go test. You’re now testing multiple modules together!
  57. $ go test -v -timeout 15m -run TestProxyApp … ---

    PASS: TestProxyApp (182.44s) But integration tests can take (many) minutes to run…
  58. func TestProxyApp(t *testing.T) { t.Parallel() // The rest of the

    test code } func TestHelloWorldAppUnit(t *testing.T) { t.Parallel() // The rest of the test code } Enable test parallelism in Go by adding t.Parallel() as the 1st line of each test.
  59. $ go test -v -timeout 15m === RUN TestHelloWorldApp ===

    RUN TestProxyApp Now, if you run go test, all the tests with t.Parallel() will run in parallel
  60. resource "aws_iam_role" "role_example" { name = "example-iam-role" } resource "aws_security_group"

    "sg_example" { name = "security-group-example" } Example: module with hard-coded IAM Role and Security Group names
  61. resource "aws_iam_role" "role_example" { name = "example-iam-role" } resource "aws_security_group"

    "sg_example" { name = "security-group-example" } If two tests tried to deploy this module in parallel, the names would conflict!
  62. resource "aws_iam_role" "role_example" { name = var.name } resource "aws_security_group"

    "sg_example" { name = var.name } Example: use variables in all resource names…
  63. uniqueId := random.UniqueId() return &terraform.Options{ TerraformDir: "../examples/proxy-app", Vars: map[string]interface{}{ "name":

    "text-proxy-app-” + uniqueId }, } At test time, set the variables to a randomized value to avoid conflicts
  64. You could use the same strategy… 1. Deploy all the

    infrastructure 2. Validate it works (e.g., via API calls, SSH commands, etc.) 3. Destroy all the infrastructure
  65. e2e Tests Integration Tests Unit Tests Static analysis 60 –

    240+ minutes 5 – 60 minutes 1 – 20 minutes 1 – 60 seconds
  66. e2e Tests Integration Tests Unit Tests Static analysis E2E tests

    are too slow to be useful 60 – 240+ minutes 5 – 60 minutes 1 – 20 minutes 1 – 60 seconds
  67. The more resources your tests deploy, the flakier they will

    be. Type of test # of resources Chance of failure Unit tests 1-10 ~1% Integration 11-100 ~10% End-to-end 400-500 ~40-50%
  68. You can work around the failure rate for unit &

    integration tests with retries Type of test # of resources Chance of failure Unit tests 1-10 ~1% Integration 11-100 ~10% End-to-end 400-500 ~40-50%
  69. module module module module module module module module module module

    module module module module module 1. Deploy a persistent test environment and leave it running.
  70. module module module module module module module module module module

    module module module module module 2. Each time you update a module, deploy & validate just that module
  71. Technique Strengths Weaknesses Static analysis 1.Fast 2.Stable 3.No need to

    deploy real resources 4.Easy to use 1.Very limited in errors you can catch 2.You don’t get much confidence in your code solely from static analysis Unit tests 1.Fast enough (1 – 10 min) 2.Mostly stable (with retry logic) 3.High level of confidence in individual units 1.Need to deploy real resources 2.Requires writing non-trivial code Integration tests 1.Mostly stable (with retry logic) 2.High level of confidence in multiple units working together 1.Need to deploy real resources 2.Requires writing non-trivial code 3.Slow (10 – 30 min) End-to-end tests 1.Build confidence in your entire architecture 1.Need to deploy real resources 2.Requires writing non-trivial code 3.Very slow (60 min – 240+ min)* 4.Can be brittle (even with retry logic)*