Best practices for testing AWS infrastructure.

© 2019, Amazon Web Services, Inc. or its affiliates. All
rights reserved. Best practices for testing AWS infrastructure Ben Whaley D V C 0 4 Principal Software Engineer Gruntwork.io

© 2019, Amazon Web Services, Inc. or its affiliates. All
rights reserved. Slides by Yevgeniy Brikman Co-founder, Gruntwork.io @brikis98

The world of software is full of Fear

Fear of downtime

Fear of security breaches

Fear of data loss

Fear of change

Teams deal with this fear in two ways:

1) Live in a perpetual state of anxiety

2) Deploy code less frequently

Sadly, both of these just make the problem worse!

There is a better way to deal with the fear:

Automated tests

Automated tests reduce anxiety and give you confidence

We know how to write automated tests for application code

resource "aws_lambda_function" "web_app" { function_name = var.name role = aws_iam_role.lambda.arn
# ... } resource "aws_api_gateway_integration" "proxy" { type = "AWS_PROXY" uri = aws_lambda_function.web_app.invoke_arn # ... } But how do you test infrastructure provisioned with Terraform?

export class DemoCdkStack extends cdk.Stack { constructor(scope: cdk.Construct, id: string)
{ super(scope, id); new s3.Bucket(this, 'reInventDemoBucket’, { versioned: true, }); } } const app = new cdk.App(); new DemoCdkStack(app, "DemoCdkStack"); How do you test infrastructure provisioned with the AWS CDK?

This talk is about how to test infrastructure code. Automated
testing techniques: ✓ terraform ✓ cdk ✓ unit tests ✓ integration tests ✓ end-to-end tests Passed: 5. Failed: 0. Skipped: 0. Test run successful.

Agenda 1. Static analysis 2. Unit tests 3. Integration tests
4. End-to-end tests 5. Conclusion

Static analysis: test your code without deploying it

Static analysis 1. Compiler / parser / interpreter 2. Linter
3. Dry run 4. Example: CDK testing

Compiler: Check your code for syntactic or structural issues

Compiler / parser / interpreter check terraform terraform validate cdk
npm run build kubernetes kubectl apply -f <file> --dry-run --validate=true cloudformation aws cloudformation validate-template

3. Dry run 4. Example: CDK testing

Linter: Validate your code to catch common errors

Linters terraform • conftest • terraform_validate • tflint cdk cdk
doctor kubernetes • kube-score • kube-lint • yamllint cloudformation • cfn-python-lint • cfn-nag

resource "aws_instance" "foo" { ami = "ami-0ff8a91507f77f867" instance_type = “t2.2xlarge”
} t2.2xlarge is an invalid type!

$ tflint Error: instance_type is not a valid value on
main.tf line 3: 3: instance_type = “t2.2xlarge” Linters find this type of problem in advance

3. Dry run 4. CDK testing

Partially execute to check for errors, but don’t fully deploy

Dry run terraform terraform plan cdk cdk diff Snapshots kubernetes
kubectl apply -f <file> --server-dry-run cloudformation Change Sets

3. Dry run 4. CDK testing

Sample code for this talk is at: github.com/gruntwork-io/infrastructure-as-code-testing-talk

An example of a CDK app that you might want
to test:

infrastructure-as-code-testing-talk └ examples └ modules └ cdk-app └ bin └
lib └ cdk-demo-stack.ts └ test cdk-app: Create an S3 bucket

{ super(scope, id); new s3.Bucket(this, 'reInventDemoBucket’, { versioned: true, }); } } const app = new cdk.App(); new DemoCdkStack(app, "DemoCdkStack"); Create an S3 bucket with versioning enabled

$ cdk deploy DemoCdkStack: deploying... DemoCdkStack: creating CloudFormation changeset... 0/3
| 1:04:13 PM | CREATE_IN_PROGRESS … [snip] … 3/3 | 1:04:37 PM | CREATE_COMPLETE ✅ DemoCdkStack Stack ARN: arn:aws:cloudformation:us-west-2:…[snip]… When you run cdk deploy, it creates the CloudFormation stack and outputs the stack ARN

Let’s create a test for our CDK app

infrastructure-as-code-testing-talk └ examples └ modules └ cdk-app └ bin └
lib └ test └ cdk-demo.test.ts Create cdk-demo.test.ts

test('Bucket Stack', () => { const app = new cdk.App();
const stack = new CdkDemo.DemoCdkStack(app, 'DemoTestStack’); expectCDK(stack).to(haveResource('AWS::S3::Bucket’, { VersioningConfiguration: { Status: "Enabled" }, })); }); The basic test structure

const stack = new CdkDemo.DemoCdkStack(app, 'DemoTestStack’); expectCDK(stack).to(haveResource('AWS::S3::Bucket’, { VersioningConfiguration: { Status: "Enabled" }, })); }); 1) CDK uses the Jest testing framework

const stack = new CdkDemo.DemoCdkStack(app, 'DemoTestStack’); expectCDK(stack).to(haveResource('AWS::S3::Bucket’, { VersioningConfiguration: { Status: "Enabled" }, })); }); 2) Set up the app and stack

const stack = new CdkDemo.DemoCdkStack(app, 'DemoTestStack’); expectCDK(stack).to(haveResource('AWS::S3::Bucket’, { VersioningConfiguration: { Status: "Enabled" }, })); }); 3) Assert that the stack has an S3 bucket with versioning enabled

$ npm run test PASS test/cdk-demo.test.ts ✓ Bucket Stack (27ms)
Test Suites: 1 passed, 1 total Tests: 1 passed, 1 total Snapshots: 0 total Time: 2.391s Ran all test suites. Run npm test. You now have a unit test you can run after every commit!

{ super(scope, id); //new s3.Bucket(this, 'reInventDemoBucket’, { // versioned: true, //}); new s3.Bucket(this, 'reInventDemoBucket’); } } What if somebody makes a change to the bucket, like removing the versioning?

$ npm run test FAIL test/cdk-demo.test.ts ✕ Bucket Stack (29ms)
None of 1 resources matches resource 'AWS::S3::Bucket' with properties { "VersioningConfiguration": { "Status": "Enabled" } } All of the properties of the resource must match

*CDK says this is a unit test, but from an
infra standpoint, it’s actually a static test

Unit tests: test that an individual unit works in isolation

Unit tests 1. Unit testing basics 2. Example: Terraform unit
tests 3. Cleaning up after tests

You can’t “unit test” an entire end-to-end architecture

Instead, break your infra code into small modules and unit
test those! module module module module module module module module module module module module module module module

With app code, you can test units in isolation from
the outside world

# ... } resource "aws_api_gateway_integration" "proxy" { type = "AWS_PROXY" uri = aws_lambda_function.web_app.invoke_arn # ... } But infrastructure code is all about talking to the outside world!

You can only truly test infra code by deploying to
a real environment

Key takeaway: there’s no pure unit testing for infrastructure code.

Therefore, the test strategy is: 1. Deploy real infrastructure 2.
Validate it works (e.g., via API calls, SSH commands, etc.) 3. Destroy the infrastructure (So it’s really integration testing of a single unit!)

Tools that help with infrastructure “unit” testing: Deploy/Destroy Validate Supported
Technologies Terratest Yes Yes Terraform, k8s, packer, docker, OS, Cloud APIs kitchen-terraform Yes Yes Terraform Inspec No Yes OS, Cloud APIs awspec No Yes AWS API

An example of a Terraform module that you might want
to test:

infrastructure-as-code-testing-talk └ examples └ hello-world-app └ main.tf └ outputs.tf └
variables.tf └ modules └ test └ README.md hello-world-app: deploy a “Hello, World” web service

# ... } resource "aws_api_gateway_integration" "proxy" { type = "AWS_PROXY" uri = aws_lambda_function.web_app.invoke_arn # ... } Under the hood, this example runs on top of AWS Lambda & API Gateway

$ terraform apply Outputs: url = ruvvwv3sh1.execute-api.us-east-2.amazonaws.com $ curl ruvvwv3sh1.execute-api.us-east-2.amazonaws.com
Hello, World! When you run terraform apply, it deploys and outputs the URL

Let’s write a unit test for hello-world-app with Terratest

infrastructure-as-code-testing-talk └ examples └ modules └ test └ hello_world_app_test.go └
README.md Create hello_world_app_test.go

func TestHelloWorldAppUnit(t *testing.T) { terraformOptions := &terraform.Options{ TerraformDir: "../examples/hello-world-app", }
defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validate(t, terraformOptions) } The basic test structure

defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validate(t, terraformOptions) } 1. Tell Terratest where your Terraform code lives

defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validate(t, terraformOptions) } 2. Run terraform init and terraform apply to deploy your module

defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validate(t, terraformOptions) } 3. Validate the infrastructure works. We’ll come back to this shortly.

defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validate(t, terraformOptions) } 4. Run terraform destroy at the end of the test to clean up

func validate(t *testing.T, opts *terraform.Options) { url := terraform.Output(t, opts,
"url") http_helper.HttpGetWithRetry(t, url, // URL to test 200, // Expected status code "Hello, World!", // Expected body 10, // Max retries 3 * time.Second // Time between retries ) } The validate function

"url") http_helper.HttpGetWithRetry(t, url, // URL to test 200, // Expected status code "Hello, World!", // Expected body 10, // Max retries 3 * time.Second // Time between retries ) } 1. Run terraform output to get the web service URL

"url") http_helper.HttpGetWithRetry(t, url, // URL to test 200, // Expected status code "Hello, World!", // Expected body 10, // Max retries 3 * time.Second // Time between retries ) } 2. Make HTTP requests to the URL

"url") http_helper.HttpGetWithRetry(t, url, // URL to test 200, // Expected status code "Hello, World!", // Expected body 10, // Max retries 3 * time.Second // Time between retries ) } 3. Check the response for an expected status and body

"url") http_helper.HttpGetWithRetry(t, url, // URL to test 200, // Expected status code "Hello, World!", // Expected body 10, // Max retries 3 * time.Second // Time between retries ) } 4. Retry the request up to 10 times, as deployment is asynchronous

$ export AWS_ACCESS_KEY_ID=xxxx $ export AWS_SECRET_ACCESS_KEY=xxxxx To run the test,
first authenticate to AWS

$ go test -v -timeout 15m -run TestHelloWorldAppUnit … ---
PASS: TestHelloWorldAppUnit (31.57s) Then run go test. You now have a unit test you can run after every commit!

Since we’re testing a web service, we use HTTP requests
to validate it

Examples of other ways to validate: Example Validation Helper Web
service Containerized or serverless web app HTTP requests Terratest http_helper package Server EC2 instance SSH commands Terratest ssh package Cloud resource SQS queue API calls Terratest aws package Database MySQL SQL queries MySQL driver for Go

Tests create and destroy many resources!

Pro tip: run tests in isolated “sandbox” accounts

Features cloud-nuke Delete resources older than a certain date; in
a certain region; of a certain type. Janitor Monkey Configurable rules of what to delete. Notify owners of pending deletions. aws-nuke Specify specific AWS accounts and resource types to target. Pro tip #2: run these tools in cron jobs to clean up stale resources

Integration tests: test that multiple “units” work together.

Integration tests 1. Example: Terraform integration tests 2. Test parallelism

infrastructure-as-code-testing-talk └ examples └ hello-world-app └ docker-kubernetes └ proxy-app └
web-service └ modules └ test └ README.md Let’s say you have two Terraform modules you want to test together:

web-service └ modules └ test └ README.md proxy-app: an app that acts as an HTTP proxy for other web services.

web-service └ modules └ test └ README.md web-service: a web service that you want proxied.

variable "url_to_proxy" { description = "The URL to proxy." type
= string } proxy-app takes in the URL to proxy via an input variable

output "url" { value = module.web_service.url } web-service exposes its
URL via an output variable

infrastructure-as-code-testing-talk └ examples └ modules └ test └ hello_world_app_test.go └
proxy_app_test.go └ README.md Create proxy_app_test.go

func TestProxyApp(t *testing.T) { webServiceOpts := configWebService(t) defer terraform.Destroy(t, webServiceOpts)
terraform.InitAndApply(t, webServiceOpts) proxyAppOpts := configProxyApp(t, webServiceOpts) defer terraform.Destroy(t, proxyAppOpts) terraform.InitAndApply(t, proxyAppOpts) validate(t, proxyAppOpts) } The basic test structure

terraform.InitAndApply(t, webServiceOpts) proxyAppOpts := configProxyApp(t, webServiceOpts) defer terraform.Destroy(t, proxyAppOpts) terraform.InitAndApply(t, proxyAppOpts) validate(t, proxyAppOpts) } 1. Configure options for the web service

terraform.InitAndApply(t, webServiceOpts) proxyAppOpts := configProxyApp(t, webServiceOpts) defer terraform.Destroy(t, proxyAppOpts) terraform.InitAndApply(t, proxyAppOpts) validate(t, proxyAppOpts) } 2. Deploy the web service

terraform.InitAndApply(t, webServiceOpts) proxyAppOpts := configProxyApp(t, webServiceOpts) defer terraform.Destroy(t, proxyAppOpts) terraform.InitAndApply(t, proxyAppOpts) validate(t, proxyAppOpts) } 3. Configure options for the proxy app (passing it the web service options)

terraform.InitAndApply(t, webServiceOpts) proxyAppOpts := configProxyApp(t, webServiceOpts) defer terraform.Destroy(t, proxyAppOpts) terraform.InitAndApply(t, proxyAppOpts) validate(t, proxyAppOpts) } 4. Deploy the proxy app

terraform.InitAndApply(t, webServiceOpts) proxyAppOpts := configProxyApp(t, webServiceOpts) defer terraform.Destroy(t, proxyAppOpts) terraform.InitAndApply(t, proxyAppOpts) validate(t, proxyAppOpts) } 5. Validate the proxy app works

terraform.InitAndApply(t, webServiceOpts) proxyAppOpts := configProxyApp(t, webServiceOpts) defer terraform.Destroy(t, proxyAppOpts) terraform.InitAndApply(t, proxyAppOpts) validate(t, proxyAppOpts) } 6. At the end of the test, destroy the proxy app and the web service

func configWebService(t *testing.T) *terraform.Options { return &terraform.Options{ TerraformDir: "../examples/web-service", }
} The configWebService method

func configProxyApp(t *testing.T, webServiceOpts *terraform.Options) *terraform.Options { url := terraform.Output(t,
webServiceOpts, "url") return &terraform.Options{ TerraformDir: "../examples/proxy-app", Vars: map[string]interface{}{ "url_to_proxy": url, }, } } The configProxyApp method

webServiceOpts, "url") return &terraform.Options{ TerraformDir: "../examples/proxy-app", Vars: map[string]interface{}{ "url_to_proxy": url, }, } } 1. Read the url output from the web- service module

webServiceOpts, "url") return &terraform.Options{ TerraformDir: "../examples/proxy-app", Vars: map[string]interface{}{ "url_to_proxy": url, }, } } 2. Pass it in as the url_to_proxy input to the proxy-app module

"url") http_helper.HttpGetWithRetry(t, url, // URL to test 200, // Expected status code `{"text":"Hello, World!"}`, // Expected body 10, // Max retries 3 * time.Second // Time between retries ) } The validate method

$ go test -v -timeout 15m -run TestProxyApp … ---
PASS: TestProxyApp (182.44s) Run go test. You’re now testing multiple modules together!

$ go test -v -timeout 15m -run TestProxyApp … ---
PASS: TestProxyApp (182.44s) But integration tests can take (many) minutes to run…

Integration tests 1. Example: Terraform integration tests 2. Test parallelism

Infrastructure tests can take a long time to run

One way to save time: run tests in parallel

func TestProxyApp(t *testing.T) { t.Parallel() // The rest of the
test code } func TestHelloWorldAppUnit(t *testing.T) { t.Parallel() // The rest of the test code } Enable test parallelism in Go by adding t.Parallel() as the 1st line of each test.

$ go test -v -timeout 15m === RUN TestHelloWorldApp ===
RUN TestProxyApp Now, if you run go test, all the tests with t.Parallel() will run in parallel

But there’s a gotcha: resource conflicts

resource "aws_iam_role" "role_example" { name = "example-iam-role" } resource "aws_security_group"
"sg_example" { name = "security-group-example" } Example: module with hard-coded IAM Role and Security Group names

resource "aws_iam_role" "role_example" { name = "example-iam-role" } resource "aws_security_group"
"sg_example" { name = "security-group-example" } If two tests tried to deploy this module in parallel, the names would conflict!

Key takeaway: you must namespace all your resources

resource "aws_iam_role" "role_example" { name = var.name } resource "aws_security_group"
"sg_example" { name = var.name } Example: use variables in all resource names…

uniqueId := random.UniqueId() return &terraform.Options{ TerraformDir: "../examples/proxy-app", Vars: map[string]interface{}{ "name":
"text-proxy-app-” + uniqueId }, } At test time, set the variables to a randomized value to avoid conflicts

End-to-end tests: test your entire infrastructure works together.

How do you test this entire thing?

You could use the same strategy… 1. Deploy all the
infrastructure 2. Validate it works (e.g., via API calls, SSH commands, etc.) 3. Destroy all the infrastructure

But it’s rare to write end-to-end tests this way. Here’s
why:

e2e Tests Test pyramid Integration Tests Unit Tests Static analysis

e2e Tests Integration Tests Unit Tests Static analysis Cost, brittleness,
run time

e2e Tests Integration Tests Unit Tests Static analysis 60 –
240+ minutes 5 – 60 minutes 1 – 20 minutes 1 – 60 seconds

e2e Tests Integration Tests Unit Tests Static analysis E2E tests
are too slow to be useful 60 – 240+ minutes 5 – 60 minutes 1 – 20 minutes 1 – 60 seconds

Another problem with E2E tests: brittleness

Let’s do some math:

Assume a single resource (e.g., EC2 instance) has a 1/1000
(0.1%) chance of failure.

The more resources your tests deploy, the flakier they will
be. Type of test # of resources Chance of failure Unit tests 1-10 ~1% Integration 11-100 ~10% End-to-end 400-500 ~40-50%

You can work around the failure rate for unit &
integration tests with retries Type of test # of resources Chance of failure Unit tests 1-10 ~1% Integration 11-100 ~10% End-to-end 400-500 ~40-50%

Key takeaway: E2E tests from scratch are too slow and
too brittle to be useful

Instead, you can do incremental E2E testing!

module module module module module module module module module module
module module module module module 1. Deploy a persistent test environment and leave it running.

module module module module module module module module module module
module module module module module 2. Each time you update a module, deploy & validate just that module

Testing techniques compared:

Technique Strengths Weaknesses Static analysis 1.Fast 2.Stable 3.No need to
deploy real resources 4.Easy to use 1.Very limited in errors you can catch 2.You don’t get much confidence in your code solely from static analysis Unit tests 1.Fast enough (1 – 10 min) 2.Mostly stable (with retry logic) 3.High level of confidence in individual units 1.Need to deploy real resources 2.Requires writing non-trivial code Integration tests 1.Mostly stable (with retry logic) 2.High level of confidence in multiple units working together 1.Need to deploy real resources 2.Requires writing non-trivial code 3.Slow (10 – 30 min) End-to-end tests 1.Build confidence in your entire architecture 1.Need to deploy real resources 2.Requires writing non-trivial code 3.Very slow (60 min – 240+ min)* 4.Can be brittle (even with retry logic)*

So which should you use?

All of them! They all catch different types of bugs.

e2e Tests Bear in mind the test pyramid Integration Tests
Unit Tests Static analysis

e2e Tests Lots of unit tests + static analysis Integration
Tests Unit Tests Static analysis

e2e Tests Fewer integration tests Integration Tests Unit Tests Static
analysis

e2e Tests A handful of high-value e2e tests Integration Tests
Unit Tests Static analysis

Infrastructure code without tests is scary

Fight the fear & build confidence in your code with
automated tests

Questions? [email protected]

Best practices for testing AWS infrastructure.

Best practices for testing AWS infrastructure.

More Decks by Ben Whaley

Other Decks in Programming

Featured

Transcript