Best practices for testing AWS infrastructure.

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

The world of software is full of Fear

Slide 4

Slide 4 text

Fear of downtime

Slide 5

Slide 5 text

Fear of security breaches

Slide 6

Slide 6 text

Fear of data loss

Slide 7

Slide 7 text

Fear of change

Slide 8

Slide 8 text

Teams deal with this fear in two ways:

Slide 9

Slide 9 text

1) Live in a perpetual state of anxiety

Slide 10

Slide 10 text

2) Deploy code less frequently

Slide 11

Slide 11 text

Sadly, both of these just make the problem worse!

Slide 12

Slide 12 text

There is a better way to deal with the fear:

Slide 13

Slide 13 text

Automated tests

Slide 14

Slide 14 text

Automated tests reduce anxiety and give you confidence

Slide 15

Slide 15 text

We know how to write automated tests for application code

Slide 16

Slide 16 text

resource "aws_lambda_function" "web_app" { function_name = var.name role = aws_iam_role.lambda.arn # ... } resource "aws_api_gateway_integration" "proxy" { type = "AWS_PROXY" uri = aws_lambda_function.web_app.invoke_arn # ... } But how do you test infrastructure provisioned with Terraform?

Slide 17

Slide 17 text

export class DemoCdkStack extends cdk.Stack { constructor(scope: cdk.Construct, id: string) { super(scope, id); new s3.Bucket(this, 'reInventDemoBucket’, { versioned: true, }); } } const app = new cdk.App(); new DemoCdkStack(app, "DemoCdkStack"); How do you test infrastructure provisioned with the AWS CDK?

Slide 18

Slide 18 text

This talk is about how to test infrastructure code. Automated testing techniques: ✓ terraform ✓ cdk ✓ unit tests ✓ integration tests ✓ end-to-end tests Passed: 5. Failed: 0. Skipped: 0. Test run successful.

Slide 19

Slide 19 text

Agenda 1. Static analysis 2. Unit tests 3. Integration tests 4. End-to-end tests 5. Conclusion

Slide 20

Slide 20 text

Static analysis: test your code without deploying it

Slide 21

Slide 21 text

Static analysis 1. Compiler / parser / interpreter 2. Linter 3. Dry run 4. Example: CDK testing

Slide 22

Slide 22 text

Compiler: Check your code for syntactic or structural issues

Slide 23

Slide 23 text

Compiler / parser / interpreter check terraform terraform validate cdk npm run build kubernetes kubectl apply -f --dry-run --validate=true cloudformation aws cloudformation validate-template

Slide 24

Slide 24 text

Static analysis 1. Compiler / parser / interpreter 2. Linter 3. Dry run 4. Example: CDK testing

Slide 25

Slide 25 text

Linter: Validate your code to catch common errors

Slide 26

Slide 26 text

Linters terraform • conftest • terraform_validate • tflint cdk cdk doctor kubernetes • kube-score • kube-lint • yamllint cloudformation • cfn-python-lint • cfn-nag

Slide 27

Slide 27 text

resource "aws_instance" "foo" { ami = "ami-0ff8a91507f77f867" instance_type = “t2.2xlarge” } t2.2xlarge is an invalid type!

Slide 28

Slide 28 text

$ tflint Error: instance_type is not a valid value on main.tf line 3: 3: instance_type = “t2.2xlarge” Linters find this type of problem in advance

Slide 29

Slide 29 text

Static analysis 1. Compiler / parser / interpreter 2. Linter 3. Dry run 4. CDK testing

Slide 30

Slide 30 text

Partially execute to check for errors, but don’t fully deploy

Slide 31

Slide 31 text

Dry run terraform terraform plan cdk cdk diff Snapshots kubernetes kubectl apply -f --server-dry-run cloudformation Change Sets

Slide 32

Slide 32 text

Static analysis 1. Compiler / parser / interpreter 2. Linter 3. Dry run 4. CDK testing

Slide 33

Slide 33 text

Sample code for this talk is at: github.com/gruntwork-io/infrastructure-as-code-testing-talk

Slide 34

Slide 34 text

An example of a CDK app that you might want to test:

Slide 35

Slide 35 text

infrastructure-as-code-testing-talk └ examples └ modules └ cdk-app └ bin └ lib └ cdk-demo-stack.ts └ test cdk-app: Create an S3 bucket

Slide 36

Slide 36 text

Slide 37

Slide 37 text

$ cdk deploy DemoCdkStack: deploying... DemoCdkStack: creating CloudFormation changeset... 0/3 | 1:04:13 PM | CREATE_IN_PROGRESS … [snip] … 3/3 | 1:04:37 PM | CREATE_COMPLETE ✅ DemoCdkStack Stack ARN: arn:aws:cloudformation:us-west-2:…[snip]… When you run cdk deploy, it creates the CloudFormation stack and outputs the stack ARN

Slide 38

Slide 38 text

Let’s create a test for our CDK app

Slide 39

Slide 39 text

infrastructure-as-code-testing-talk └ examples └ modules └ cdk-app └ bin └ lib └ test └ cdk-demo.test.ts Create cdk-demo.test.ts

Slide 40

Slide 40 text

test('Bucket Stack', () => { const app = new cdk.App(); const stack = new CdkDemo.DemoCdkStack(app, 'DemoTestStack’); expectCDK(stack).to(haveResource('AWS::S3::Bucket’, { VersioningConfiguration: { Status: "Enabled" }, })); }); The basic test structure

Slide 41

Slide 41 text

Slide 42

Slide 42 text

Slide 43

Slide 43 text

Slide 44

Slide 44 text

$ npm run test PASS test/cdk-demo.test.ts ✓ Bucket Stack (27ms) Test Suites: 1 passed, 1 total Tests: 1 passed, 1 total Snapshots: 0 total Time: 2.391s Ran all test suites. Run npm test. You now have a unit test you can run after every commit!

Slide 45

Slide 45 text

export class DemoCdkStack extends cdk.Stack { constructor(scope: cdk.Construct, id: string) { super(scope, id); //new s3.Bucket(this, 'reInventDemoBucket’, { // versioned: true, //}); new s3.Bucket(this, 'reInventDemoBucket’); } } What if somebody makes a change to the bucket, like removing the versioning?

Slide 46

Slide 46 text

$ npm run test FAIL test/cdk-demo.test.ts ✕ Bucket Stack (29ms) None of 1 resources matches resource 'AWS::S3::Bucket' with properties { "VersioningConfiguration": { "Status": "Enabled" } } All of the properties of the resource must match

Slide 47

Slide 47 text

*CDK says this is a unit test, but from an infra standpoint, it’s actually a static test

Slide 48

Slide 48 text

Agenda 1. Static analysis 2. Unit tests 3. Integration tests 4. End-to-end tests 5. Conclusion

Slide 49

Slide 49 text

Unit tests: test that an individual unit works in isolation

Slide 50

Slide 50 text

Unit tests 1. Unit testing basics 2. Example: Terraform unit tests 3. Cleaning up after tests

Slide 51

Slide 51 text

You can’t “unit test” an entire end-to-end architecture

Slide 52

Slide 52 text

Instead, break your infra code into small modules and unit test those! module module module module module module module module module module module module module module module

Slide 53

Slide 53 text

With app code, you can test units in isolation from the outside world

Slide 54

Slide 54 text

resource "aws_lambda_function" "web_app" { function_name = var.name role = aws_iam_role.lambda.arn # ... } resource "aws_api_gateway_integration" "proxy" { type = "AWS_PROXY" uri = aws_lambda_function.web_app.invoke_arn # ... } But infrastructure code is all about talking to the outside world!

Slide 55

Slide 55 text

You can only truly test infra code by deploying to a real environment

Slide 56

Slide 56 text

Key takeaway: there’s no pure unit testing for infrastructure code.

Slide 57

Slide 57 text

Therefore, the test strategy is: 1. Deploy real infrastructure 2. Validate it works (e.g., via API calls, SSH commands, etc.) 3. Destroy the infrastructure (So it’s really integration testing of a single unit!)

Slide 58

Slide 58 text

Tools that help with infrastructure “unit” testing: Deploy/Destroy Validate Supported Technologies Terratest Yes Yes Terraform, k8s, packer, docker, OS, Cloud APIs kitchen-terraform Yes Yes Terraform Inspec No Yes OS, Cloud APIs awspec No Yes AWS API

Slide 59

Slide 59 text

Unit tests 1. Unit testing basics 2. Example: Terraform unit tests 3. Cleaning up after tests

Slide 60

Slide 60 text

An example of a Terraform module that you might want to test:

Slide 61

Slide 61 text

infrastructure-as-code-testing-talk └ examples └ hello-world-app └ main.tf └ outputs.tf └ variables.tf └ modules └ test └ README.md hello-world-app: deploy a “Hello, World” web service

Slide 62

Slide 62 text

resource "aws_lambda_function" "web_app" { function_name = var.name role = aws_iam_role.lambda.arn # ... } resource "aws_api_gateway_integration" "proxy" { type = "AWS_PROXY" uri = aws_lambda_function.web_app.invoke_arn # ... } Under the hood, this example runs on top of AWS Lambda & API Gateway

Slide 63

Slide 63 text

$ terraform apply Outputs: url = ruvvwv3sh1.execute-api.us-east-2.amazonaws.com $ curl ruvvwv3sh1.execute-api.us-east-2.amazonaws.com Hello, World! When you run terraform apply, it deploys and outputs the URL

Slide 64

Slide 64 text

Let’s write a unit test for hello-world-app with Terratest

Slide 65

Slide 65 text

infrastructure-as-code-testing-talk └ examples └ modules └ test └ hello_world_app_test.go └ README.md Create hello_world_app_test.go

Slide 66

Slide 66 text

func TestHelloWorldAppUnit(t *testing.T) { terraformOptions := &terraform.Options{ TerraformDir: "../examples/hello-world-app", } defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validate(t, terraformOptions) } The basic test structure

Slide 67

Slide 67 text

Slide 68

Slide 68 text

Slide 69

Slide 69 text

Slide 70

Slide 70 text

Slide 71

Slide 71 text

func validate(t *testing.T, opts *terraform.Options) { url := terraform.Output(t, opts, "url") http_helper.HttpGetWithRetry(t, url, // URL to test 200, // Expected status code "Hello, World!", // Expected body 10, // Max retries 3 * time.Second // Time between retries ) } The validate function

Slide 72

Slide 72 text

Slide 73

Slide 73 text

Slide 74

Slide 74 text

Slide 75

Slide 75 text

Slide 76

Slide 76 text

$ export AWS_ACCESS_KEY_ID=xxxx $ export AWS_SECRET_ACCESS_KEY=xxxxx To run the test, first authenticate to AWS

Slide 77

Slide 77 text

$ go test -v -timeout 15m -run TestHelloWorldAppUnit … --- PASS: TestHelloWorldAppUnit (31.57s) Then run go test. You now have a unit test you can run after every commit!

Slide 78

Slide 78 text

Since we’re testing a web service, we use HTTP requests to validate it

Slide 79

Slide 79 text

Examples of other ways to validate: Example Validation Helper Web service Containerized or serverless web app HTTP requests Terratest http_helper package Server EC2 instance SSH commands Terratest ssh package Cloud resource SQS queue API calls Terratest aws package Database MySQL SQL queries MySQL driver for Go

Slide 80

Slide 80 text

Unit tests 1. Unit testing basics 2. Example: Terraform unit tests 3. Cleaning up after tests

Slide 81

Slide 81 text

Tests create and destroy many resources!

Slide 82

Slide 82 text

Pro tip: run tests in isolated “sandbox” accounts

Slide 83

Slide 83 text

Features cloud-nuke Delete resources older than a certain date; in a certain region; of a certain type. Janitor Monkey Configurable rules of what to delete. Notify owners of pending deletions. aws-nuke Specify specific AWS accounts and resource types to target. Pro tip #2: run these tools in cron jobs to clean up stale resources

Slide 84

Slide 84 text

Agenda 1. Static analysis 2. Unit tests 3. Integration tests 4. End-to-end tests 5. Conclusion

Slide 85

Slide 85 text

Integration tests: test that multiple “units” work together.

Slide 86

Slide 86 text

Integration tests 1. Example: Terraform integration tests 2. Test parallelism

Slide 87

Slide 87 text

infrastructure-as-code-testing-talk └ examples └ hello-world-app └ docker-kubernetes └ proxy-app └ web-service └ modules └ test └ README.md Let’s say you have two Terraform modules you want to test together:

Slide 88

Slide 88 text

infrastructure-as-code-testing-talk └ examples └ hello-world-app └ docker-kubernetes └ proxy-app └ web-service └ modules └ test └ README.md proxy-app: an app that acts as an HTTP proxy for other web services.

Slide 89

Slide 89 text

infrastructure-as-code-testing-talk └ examples └ hello-world-app └ docker-kubernetes └ proxy-app └ web-service └ modules └ test └ README.md web-service: a web service that you want proxied.

Slide 90

Slide 90 text

variable "url_to_proxy" { description = "The URL to proxy." type = string } proxy-app takes in the URL to proxy via an input variable

Slide 91

Slide 91 text

output "url" { value = module.web_service.url } web-service exposes its URL via an output variable

Slide 92

Slide 92 text

infrastructure-as-code-testing-talk └ examples └ modules └ test └ hello_world_app_test.go └ proxy_app_test.go └ README.md Create proxy_app_test.go

Slide 93

Slide 93 text

func TestProxyApp(t *testing.T) { webServiceOpts := configWebService(t) defer terraform.Destroy(t, webServiceOpts) terraform.InitAndApply(t, webServiceOpts) proxyAppOpts := configProxyApp(t, webServiceOpts) defer terraform.Destroy(t, proxyAppOpts) terraform.InitAndApply(t, proxyAppOpts) validate(t, proxyAppOpts) } The basic test structure

Slide 94

Slide 94 text

Slide 95

Slide 95 text

Slide 96

Slide 96 text

Slide 97

Slide 97 text

Slide 98

Slide 98 text

Slide 99

Slide 99 text

Slide 100

Slide 100 text

func configWebService(t *testing.T) *terraform.Options { return &terraform.Options{ TerraformDir: "../examples/web-service", } } The configWebService method

Slide 101

Slide 101 text

func configProxyApp(t *testing.T, webServiceOpts *terraform.Options) *terraform.Options { url := terraform.Output(t, webServiceOpts, "url") return &terraform.Options{ TerraformDir: "../examples/proxy-app", Vars: map[string]interface{}{ "url_to_proxy": url, }, } } The configProxyApp method

Slide 102

Slide 102 text

Slide 103

Slide 103 text

Slide 104

Slide 104 text

func validate(t *testing.T, opts *terraform.Options) { url := terraform.Output(t, opts, "url") http_helper.HttpGetWithRetry(t, url, // URL to test 200, // Expected status code `{"text":"Hello, World!"}`, // Expected body 10, // Max retries 3 * time.Second // Time between retries ) } The validate method

Slide 105

Slide 105 text

$ go test -v -timeout 15m -run TestProxyApp … --- PASS: TestProxyApp (182.44s) Run go test. You’re now testing multiple modules together!

Slide 106

Slide 106 text

$ go test -v -timeout 15m -run TestProxyApp … --- PASS: TestProxyApp (182.44s) But integration tests can take (many) minutes to run…

Slide 107

Slide 107 text

Integration tests 1. Example: Terraform integration tests 2. Test parallelism

Slide 108

Slide 108 text

Infrastructure tests can take a long time to run

Slide 109

Slide 109 text

One way to save time: run tests in parallel

Slide 110

Slide 110 text

func TestProxyApp(t *testing.T) { t.Parallel() // The rest of the test code } func TestHelloWorldAppUnit(t *testing.T) { t.Parallel() // The rest of the test code } Enable test parallelism in Go by adding t.Parallel() as the 1st line of each test.

Slide 111

Slide 111 text

$ go test -v -timeout 15m === RUN TestHelloWorldApp === RUN TestProxyApp Now, if you run go test, all the tests with t.Parallel() will run in parallel

Slide 112

Slide 112 text

But there’s a gotcha: resource conflicts

Slide 113

Slide 113 text

resource "aws_iam_role" "role_example" { name = "example-iam-role" } resource "aws_security_group" "sg_example" { name = "security-group-example" } Example: module with hard-coded IAM Role and Security Group names

Slide 114

Slide 114 text

resource "aws_iam_role" "role_example" { name = "example-iam-role" } resource "aws_security_group" "sg_example" { name = "security-group-example" } If two tests tried to deploy this module in parallel, the names would conflict!

Slide 115

Slide 115 text

Key takeaway: you must namespace all your resources

Slide 116

Slide 116 text

resource "aws_iam_role" "role_example" { name = var.name } resource "aws_security_group" "sg_example" { name = var.name } Example: use variables in all resource names…

Slide 117

Slide 117 text

uniqueId := random.UniqueId() return &terraform.Options{ TerraformDir: "../examples/proxy-app", Vars: map[string]interface{}{ "name": "text-proxy-app-” + uniqueId }, } At test time, set the variables to a randomized value to avoid conflicts

Slide 118

Slide 118 text

Agenda 1. Static analysis 2. Unit tests 3. Integration tests 4. End-to-end tests 5. Conclusion

Slide 119

Slide 119 text

End-to-end tests: test your entire infrastructure works together.

Slide 120

Slide 120 text

How do you test this entire thing?

Slide 121

Slide 121 text

You could use the same strategy… 1. Deploy all the infrastructure 2. Validate it works (e.g., via API calls, SSH commands, etc.) 3. Destroy all the infrastructure

Slide 122

Slide 122 text

But it’s rare to write end-to-end tests this way. Here’s why:

Slide 123

Slide 123 text

e2e Tests Test pyramid Integration Tests Unit Tests Static analysis

Slide 124

Slide 124 text

e2e Tests Integration Tests Unit Tests Static analysis Cost, brittleness, run time

Slide 125

Slide 125 text

e2e Tests Integration Tests Unit Tests Static analysis 60 – 240+ minutes 5 – 60 minutes 1 – 20 minutes 1 – 60 seconds

Slide 126

Slide 126 text

e2e Tests Integration Tests Unit Tests Static analysis E2E tests are too slow to be useful 60 – 240+ minutes 5 – 60 minutes 1 – 20 minutes 1 – 60 seconds

Slide 127

Slide 127 text

Another problem with E2E tests: brittleness

Slide 128

Slide 128 text

Let’s do some math:

Slide 129

Slide 129 text

Assume a single resource (e.g., EC2 instance) has a 1/1000 (0.1%) chance of failure.

Slide 130

Slide 130 text

The more resources your tests deploy, the flakier they will be. Type of test # of resources Chance of failure Unit tests 1-10 ~1% Integration 11-100 ~10% End-to-end 400-500 ~40-50%

Slide 131

Slide 131 text

You can work around the failure rate for unit & integration tests with retries Type of test # of resources Chance of failure Unit tests 1-10 ~1% Integration 11-100 ~10% End-to-end 400-500 ~40-50%

Slide 132

Slide 132 text

Key takeaway: E2E tests from scratch are too slow and too brittle to be useful

Slide 133

Slide 133 text

Instead, you can do incremental E2E testing!

Slide 134

Slide 134 text

module module module module module module module module module module module module module module module 1. Deploy a persistent test environment and leave it running.

Slide 135

Slide 135 text

module module module module module module module module module module module module module module module 2. Each time you update a module, deploy & validate just that module

Slide 136

Slide 136 text

Agenda 1. Static analysis 2. Unit tests 3. Integration tests 4. End-to-end tests 5. Conclusion

Slide 137

Slide 137 text

Testing techniques compared:

Slide 138

Slide 138 text

Technique Strengths Weaknesses Static analysis 1.Fast 2.Stable 3.No need to deploy real resources 4.Easy to use 1.Very limited in errors you can catch 2.You don’t get much confidence in your code solely from static analysis Unit tests 1.Fast enough (1 – 10 min) 2.Mostly stable (with retry logic) 3.High level of confidence in individual units 1.Need to deploy real resources 2.Requires writing non-trivial code Integration tests 1.Mostly stable (with retry logic) 2.High level of confidence in multiple units working together 1.Need to deploy real resources 2.Requires writing non-trivial code 3.Slow (10 – 30 min) End-to-end tests 1.Build confidence in your entire architecture 1.Need to deploy real resources 2.Requires writing non-trivial code 3.Very slow (60 min – 240+ min)* 4.Can be brittle (even with retry logic)*

Slide 139

Slide 139 text

So which should you use?

Slide 140

Slide 140 text

All of them! They all catch different types of bugs.

Slide 141

Slide 141 text

e2e Tests Bear in mind the test pyramid Integration Tests Unit Tests Static analysis

Slide 142

Slide 142 text

e2e Tests Lots of unit tests + static analysis Integration Tests Unit Tests Static analysis

Slide 143

Slide 143 text

e2e Tests Fewer integration tests Integration Tests Unit Tests Static analysis

Slide 144

Slide 144 text

e2e Tests A handful of high-value e2e tests Integration Tests Unit Tests Static analysis