Slide 1

Slide 1 text

Dynamically scaling a news & activism hub (scaling out up to 5x the write-traffic in 20 minutes) Susan Potter April 26, 2019

Slide 2

Slide 2 text

Outline Intro Problem Outline Before & After AWS EC2 AutoScaling: An overview Related Side Notes Questions? 1

Slide 3

Slide 3 text

Intro

Slide 4

Slide 4 text

whoami $ finger $(whoami) Name: Susan Potter Last login Sun Jan 18 18:30 1996 (GMT) on tty1 - 23 years writing software - Server-side/backend/infrastructure engineering, mostly - Likes: functional programming (e.g. Haskell) Today: - Build new backend services in Haskell - I babysit a bloated Rails webapp Previously: trading systems, SaaS products, CI/CD 2

Slide 5

Slide 5 text

In the cloud Figure 1: Programming cloud infrastructures from the soy bean fields 3

Slide 6

Slide 6 text

Problem Outline

Slide 7

Slide 7 text

Traffic 4

Slide 8

Slide 8 text

Legacy/History • Deliver news, discussions, & campaigns to over two million users/day • Traffic varies significantly during the day • Heavy reads (varnish saves our site every day) • Writes go to content publishing backends, which are slow and expensive (Perl, Ruby) • When news breaks or the newsletter is sent our active users want to login, comment, recommend, write their own story, etc, which are WRITES. 5

Slide 9

Slide 9 text

Legacy/History • Deliver news, discussions, & campaigns to over two million users/day • Traffic varies significantly during the day • Heavy reads (varnish saves our site every day) • Writes go to content publishing backends, which are slow and expensive (Perl, Ruby) • When news breaks or the newsletter is sent our active users want to login, comment, recommend, write their own story, etc, which are WRITES. 5

Slide 10

Slide 10 text

Legacy/History • Deliver news, discussions, & campaigns to over two million users/day • Traffic varies significantly during the day • Heavy reads (varnish saves our site every day) • Writes go to content publishing backends, which are slow and expensive (Perl, Ruby) • When news breaks or the newsletter is sent our active users want to login, comment, recommend, write their own story, etc, which are WRITES. 5

Slide 11

Slide 11 text

Legacy/History • Deliver news, discussions, & campaigns to over two million users/day • Traffic varies significantly during the day • Heavy reads (varnish saves our site every day) • Writes go to content publishing backends, which are slow and expensive (Perl, Ruby) • When news breaks or the newsletter is sent our active users want to login, comment, recommend, write their own story, etc, which are WRITES. 5

Slide 12

Slide 12 text

Legacy/History • Deliver news, discussions, & campaigns to over two million users/day • Traffic varies significantly during the day • Heavy reads (varnish saves our site every day) • Writes go to content publishing backends, which are slow and expensive (Perl, Ruby) • When news breaks or the newsletter is sent our active users want to login, comment, recommend, write their own story, etc, which are WRITES. 5

Slide 13

Slide 13 text

Related Problems • Deployment method (Capistrano) has horrific failure modes during scale out/in events • Chef converged less and less and work to maintain it increased • Using moving to dynamic autoscaling didn’t fix these directly but our solution considered how it could help. 6

Slide 14

Slide 14 text

Making me … 7

Slide 15

Slide 15 text

Before & After

Slide 16

Slide 16 text

Before: When I started (Sept 2016) • Only one problematic service was in a static autoscaling group (no scaling policies, manually modified by a human :gasp:, ”static”) • Services used atrophying AMIs that may not converge due to external APT source dependencies changing in significant ways :( • Often AMIs didn’t successfully bootstrap within 15 minutes 8

Slide 17

Slide 17 text

Before: When I started (Sept 2016) • Only one problematic service was in a static autoscaling group (no scaling policies, manually modified by a human :gasp:, ”static”) • Services used atrophying AMIs that may not converge due to external APT source dependencies changing in significant ways :( • Often AMIs didn’t successfully bootstrap within 15 minutes 8

Slide 18

Slide 18 text

Before: When I started (Sept 2016) • Only one problematic service was in a static autoscaling group (no scaling policies, manually modified by a human :gasp:, ”static”) • Services used atrophying AMIs that may not converge due to external APT source dependencies changing in significant ways :( • Often AMIs didn’t successfully bootstrap within 15 minutes 8

Slide 19

Slide 19 text

Today: All services in dynamic autoscaling groups • Frontend caching/routing layer • Both content publishing backends • Internal systems, e.g. logging, metrics, etc. 9

Slide 20

Slide 20 text

Today: All services in dynamic autoscaling groups • Frontend caching/routing layer • Both content publishing backends • Internal systems, e.g. logging, metrics, etc. 9

Slide 21

Slide 21 text

Today: All services in dynamic autoscaling groups • Frontend caching/routing layer • Both content publishing backends • Internal systems, e.g. logging, metrics, etc. 9

Slide 22

Slide 22 text

AWS EC2 AutoScaling: An overview

Slide 23

Slide 23 text

10

Slide 24

Slide 24 text

High-level Primitives • AutoScaling Group • Launch Configuration • Scaling Policies • Lifecycle Hooks 10

Slide 25

Slide 25 text

High-level Primitives • AutoScaling Group • Launch Configuration • Scaling Policies • Lifecycle Hooks 10

Slide 26

Slide 26 text

High-level Primitives • AutoScaling Group • Launch Configuration • Scaling Policies • Lifecycle Hooks 10

Slide 27

Slide 27 text

High-level Primitives • AutoScaling Group • Launch Configuration • Scaling Policies • Lifecycle Hooks 10

Slide 28

Slide 28 text

AutoScaling Lifecycle Figure 2: Transition between instance states in the Amazon EC2 AutoScaling lifecycle 11

Slide 29

Slide 29 text

AutoScaling Group: Properties • Min, max, desired • Launch configuration (exactly one pointer) • Health check type (EC2/ELB) • AZs • Timeouts • Scaling policies (zero or more) 12

Slide 30

Slide 30 text

AutoScaling Group: Create via CLI declare -r rid="ResourceId=${asg_name}" delcare -r rtype="ResourceType=auto-scaling-group" aws autoscaling create-auto-scaling-group \ --auto-scaling-group-name "${asg_name}" \ --launch-configuration-name "${lc_name}" \ --min-size ${min_size:-1} \ --max-size ${max_size:-9} \ --default-cooldown ${cooldown:-120} \ --availability-zones ${availability_zones} \ --health-check-type "${health_check_type:-ELB}" \ --health-check-grace-period "${grace_period:-90}" \ --vpc-zone-identifier "${subnet_ids}" \ --tags \ "${rid},${rtype},Key=LifeCycle,Value=alive,PropagateAtLaunch=false" 13

Slide 31

Slide 31 text

Autoscaling Group: Enable metrics collection # After creation aws autoscaling enable-metrics-collection \ --auto-scaling-group-name "${asg_name}" \ --granularity "1Minute" 14

Slide 32

Slide 32 text

Autoscaling Group: Querying instance IDs in ASG aws autoscaling describe-auto-scaling-groups \ --output text \ --region "${region}" \ --auto-scaling-group-names "${asg_name}" \ --query 'AutoScalingGroups[].Instances[].InstanceId' 15

Slide 33

Slide 33 text

Launch Configuration: Properties • AMI • Instance type • User-data • Instance tags • Security groups • Block device mappings • IAM instance profiles Note: immutable after creation 16

Slide 34

Slide 34 text

Launch Configuration: Create via CLI declare -r bdev="DeviceName=/dev/sda1" declare -r vtype="VolumeType=gp2" declare -r term="DeleteOnTermination=true" aws autoscaling create-launch-configuration \ --launch-configuration-name "${lc_name}" \ --image-id "${image_id}" \ --iam-instance-profile "${lc_name}-profile" \ --security-groups ${security_groups} \ --instance-type ${instance} \ --block-device-mappings \ "${bdev},Ebs={${term},${vtype},VolumeSize=${disk_size}}" 17

Slide 35

Slide 35 text

Scaling Policies: Properties • Policy name • Metric type • Adjustment type • Scaling adjustment 18

Slide 36

Slide 36 text

Scaling Policies: Properties • Policy name • Metric type • Adjustment type • Scaling adjustment 18

Slide 37

Slide 37 text

Scaling Policies: Properties • Policy name • Metric type • Adjustment type • Scaling adjustment 18

Slide 38

Slide 38 text

Scaling Policies: Properties • Policy name • Metric type • Adjustment type • Scaling adjustment 18

Slide 39

Slide 39 text

Scaling Policies: Create via CLI aws autoscaling put-scaling-policy \ --auto-scaling-group-name "${asg_name}" \ --policy-name "${scaling_policy_name}" \ --adjustment-type ChangeInCapacity \ --scaling-adjustment 1 19

Slide 40

Slide 40 text

Scaling Policies: Attach Metric Alarm aws cloudwatch put-metric-alarm \ --alarm-name Step-Scaling-AlarmHigh-AddCapacity \ --metric-name CPUUtilization \ --namespace AWS/EC2 \ --statistic Average \ --period 120 \ --evaluation-periods 2 \ --threshold 60 \ --comparison-operator GreaterThanOrEqualToThreshold \ --dimensions "Name=AutoScalingGroupName,Value=${asg_name}" \ --alarm-actions "${policy_arn}" 20

Slide 41

Slide 41 text

Custom Metrics: Report metric data aws cloudwatch put-metric-data \ --metric-name custom-metric-name \ --namespace MyOrg/Custom \ --unit Count \ --value ${value} \ --storage-resolution 1 \ --dimensions "AutoScalingGroupName=${asg_name}" 21

Slide 42

Slide 42 text

Lifecycle Hooks: Properties We don’t use this but for adding hooks to provision software on newly launched instances and similar actions. 22

Slide 43

Slide 43 text

Related Side Notes

Slide 44

Slide 44 text

23

Slide 45

Slide 45 text

EC2 Instance Bootstrapping • Chef converge boostrapping took ~15minutes • Improved bootstrapping by an order of magnitude with fully baked AMIs • Now we fully bake AMIs for each config and app change (5mins, one time per release per environment, a constant factor, using NixOS) Fully baking AMIs also gives us system reproducibility that convergent configuration systems like Chef couldn’t give us. 23

Slide 46

Slide 46 text

EC2 Instance Bootstrapping • Chef converge boostrapping took ~15minutes • Improved bootstrapping by an order of magnitude with fully baked AMIs • Now we fully bake AMIs for each config and app change (5mins, one time per release per environment, a constant factor, using NixOS) Fully baking AMIs also gives us system reproducibility that convergent configuration systems like Chef couldn’t give us. 23

Slide 47

Slide 47 text

EC2 Instance Bootstrapping • Chef converge boostrapping took ~15minutes • Improved bootstrapping by an order of magnitude with fully baked AMIs • Now we fully bake AMIs for each config and app change (5mins, one time per release per environment, a constant factor, using NixOS) Fully baking AMIs also gives us system reproducibility that convergent configuration systems like Chef couldn’t give us. 23

Slide 48

Slide 48 text

Right-Size Instance Types per Service • We used to use whatever instance type was set before because $REASONS • Now we inspect each service’s resource usage in production in peak, typical, and overnight resting states to know how to size a service’s cluster. • Recommend this practice post-ASG or you are dropping $$$ in AWS’s lap and potentially hurting your product’s UX 24

Slide 49

Slide 49 text

Right-Size Instance Types per Service • We used to use whatever instance type was set before because $REASONS • Now we inspect each service’s resource usage in production in peak, typical, and overnight resting states to know how to size a service’s cluster. • Recommend this practice post-ASG or you are dropping $$$ in AWS’s lap and potentially hurting your product’s UX 24

Slide 50

Slide 50 text

Right-Size Instance Types per Service • We used to use whatever instance type was set before because $REASONS • Now we inspect each service’s resource usage in production in peak, typical, and overnight resting states to know how to size a service’s cluster. • Recommend this practice post-ASG or you are dropping $$$ in AWS’s lap and potentially hurting your product’s UX 24

Slide 51

Slide 51 text

Find Leading Indicator Metric for Dynamic Scale Out/In • Every service behaves differently under load • We initially started dyanamically scaling using policies based purely on CPU (a start but not good enough for us) • Now we report custom metrics to AWS CloudWatch that are leading indicators that our cluster needs to scale out or in. Leads to more predictable performance on the site even under traffic spikes. 25

Slide 52

Slide 52 text

Find Leading Indicator Metric for Dynamic Scale Out/In • Every service behaves differently under load • We initially started dyanamically scaling using policies based purely on CPU (a start but not good enough for us) • Now we report custom metrics to AWS CloudWatch that are leading indicators that our cluster needs to scale out or in. Leads to more predictable performance on the site even under traffic spikes. 25

Slide 53

Slide 53 text

Find Leading Indicator Metric for Dynamic Scale Out/In • Every service behaves differently under load • We initially started dyanamically scaling using policies based purely on CPU (a start but not good enough for us) • Now we report custom metrics to AWS CloudWatch that are leading indicators that our cluster needs to scale out or in. Leads to more predictable performance on the site even under traffic spikes. 25

Slide 54

Slide 54 text

Fail-Safe Semantics for Deploy • AMI artifacts built and tested • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26

Slide 55

Slide 55 text

Fail-Safe Semantics for Deploy • AMI artifacts built and tested • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26

Slide 56

Slide 56 text

Fail-Safe Semantics for Deploy • AMI artifacts built and tested • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26

Slide 57

Slide 57 text

Fail-Safe Semantics for Deploy • AMI artifacts built and tested • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26

Slide 58

Slide 58 text

Fail-Safe Semantics for Deploy • AMI artifacts built and tested • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26

Slide 59

Slide 59 text

Fail-Safe Semantics for Deploy • AMI artifacts built and tested • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26

Slide 60

Slide 60 text

Fail-Safe Semantics for Deploy • AMI artifacts built and tested • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26

Slide 61

Slide 61 text

Fail-Safe Semantics for Deploy • AMI artifacts built and tested • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26

Slide 62

Slide 62 text

Fail-Safe Semantics for Deploy • AMI artifacts built and tested • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26

Slide 63

Slide 63 text

Fail-Safe Semantics for Deploy • AMI artifacts built and tested • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26

Slide 64

Slide 64 text

Other stuff • DONE Script rollback (~1 minute to previous version) • TODO Implement canary deploy capability • TODO Check error rates and/or latencies haven’t increased before removing old ASG from ALB • REMINDER your max capacity should be determined by your backend runtime dependencies (it’s transitive) 27

Slide 65

Slide 65 text

Other stuff • DONE Script rollback (~1 minute to previous version) • TODO Implement canary deploy capability • TODO Check error rates and/or latencies haven’t increased before removing old ASG from ALB • REMINDER your max capacity should be determined by your backend runtime dependencies (it’s transitive) 27

Slide 66

Slide 66 text

Other stuff • DONE Script rollback (~1 minute to previous version) • TODO Implement canary deploy capability • TODO Check error rates and/or latencies haven’t increased before removing old ASG from ALB • REMINDER your max capacity should be determined by your backend runtime dependencies (it’s transitive) 27

Slide 67

Slide 67 text

Other stuff • DONE Script rollback (~1 minute to previous version) • TODO Implement canary deploy capability • TODO Check error rates and/or latencies haven’t increased before removing old ASG from ALB • REMINDER your max capacity should be determined by your backend runtime dependencies (it’s transitive) 27

Slide 68

Slide 68 text

Questions?

Slide 69

Slide 69 text

LinkedIn /in/susanpotter GitHub @mbbx6spp Keybase @mbbx6spp Twitter @SusanPotter 27