Slide 1

Slide 1 text

Sebastian Cohnen, Stephan Zeissler StormForger – 2020 The Life of A Load Generator Proudly Presents

Slide 2

Slide 2 text

• Sebastian Cohnen (@tisba) • Founder & CTO StormForger • Stephan Zeissler (@zeisss) • Software Engineer StormForger !

Slide 3

Slide 3 text

StormForger.com • Performance Testing as a Service (SaaS) • Founded in 2014, based in Cologne • Fully managed, scalable platform to run perf tests • Available from 17 AWS regions

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Sebastian Cohnen, Stephan Zeissler StormForger – 2020 The Life of A Load Generator Proudly Presents

Slide 6

Slide 6 text

General Design Goals • Keep things simple: Monolithic application + Load Generators. • Always default to the safe option and terminate running tests. • EC2 VMs are ephemeral. • Load generators are dynamically configured and have very little prior knowledge of the world. • Things fail so retry where possible and use timeouts.

Slide 7

Slide 7 text

Setup • Ruby on Rails based Command & Control application (aka “the forge”) • Background worker operations are designed to be idempotent and retry-able. • Golang based custom load generator • A load generator cluster consists of a controller and generators • Infrastructure is fully managed, running on EC2 on-demand instances

Slide 8

Slide 8 text

Server Life Cycle Launch Busy Idle Termination

Slide 9

Slide 9 text

Launch • Instances are managed by the forge via AWS Ruby SDK • AWS Region is selected by customer • AMI ID, instance type and engine version are determined • Instances are launched in batches • Launch typically takes ~20-30sec (from API call to ready)

Slide 10

Slide 10 text

Launch (cont.) • cloud-init is used to make final provisioning steps: • Write environment specific configuration • Download our engine (pre-signed S3 URL) • Update internal X509 root certificate (CA) • Download AWS EC2 Instance Identity Documents • Explicitly wait for the clock to sync

Slide 11

Slide 11 text

Authentication • Each instance initially authenticates itself to our API with… • AWS Instance Identity Documents (provided by the EC2 metadata API) • a Certificate Signing Request (CSR) • Identity of the instance is verified against AWS public certificates. • A short lived X509 certificate is generated and passed back to the instance for mutual authenticated communication.

Slide 12

Slide 12 text

Running a Test • Assign available instances to the test • Deploy test configuration, required test data and start test • Monitor running test, pre-processing metrics • After completion, fetch logs and measurements, do post-processing and analysis • Release instance

Slide 13

Slide 13 text

Termination • After a certain idle time, instances are terminated. • Terminating instances are tracked, until full termination is confirmed.

Slide 14

Slide 14 text

Instance Pooling • Instances are not directly terminated after usage. • Minimize “Time to Test” by keeping recently used resources around. • At all times: Make sure that instances are used exclusively! • Instances have a limited upper life span (ephemeral design).

Slide 15

Slide 15 text

https://www.allthingsdistributed.com/2016/03/10-lessons-from-10-years-of-aws.html “Failures are a given and everything will eventually fail[…]” – Werner Vogels

Slide 16

Slide 16 text

Robustness • Heart beats • Test Runs: Push state changes to API & Poll state periodically • Retries & Timeouts (for requests & state machines) • Deadman Switches

Slide 17

Slide 17 text

• Software issues…? • Network issues…? • VM issues…? • Terminate the EC2 instance " Break Glass Procedure

Slide 18

Slide 18 text

Tooling #

Slide 19

Slide 19 text

AMI Provisioning $ • Packer for provisioning and to copy AMIs to all regions • Packer is using Ansible to… • remove unused configs and packages, disable unattended updates, remove AWS SSM Agent, etc. • apply network and kernel optimisations

Slide 20

Slide 20 text

Other Tooling • Base AMI Update Checker • Orphaned Server Detection • Exception Tracking (Rollbar) • Integration Testing (automated tests for new AMI and engine releases)

Slide 21

Slide 21 text

Future Work

Slide 22

Slide 22 text

Future Work • Closing the gap to get fully automated AMI updates and assessments • Cost optimisation using EC2 Spot instances with defined durations • Support dedicated IPs for customers • Dedicated Customer Pools & self-hosted setups • Improve troubleshooting in case we have to auto terminate instances

Slide 23

Slide 23 text

Find the slides at stormforger.com/presentations/2020/german-aws-community-day-2020/ The END @tisba @zeisss @StormForgerApp