Dynamically scaling a political news and activism hub (up to 5x the traffic in 20 minutes)

Transcript

Dynamically scaling a news & activism hub (scaling out up

to 5x the write-trafﬁc in 20 minutes) Susan Potter April 26, 2019

Outline Intro Problem Outline Before & After AWS EC2 AutoScaling:

An overview Related Side Notes Questions? 1

Intro

whoami $ finger $(whoami) Name: Susan Potter Last login Sun

Jan 18 18:30 1996 (GMT) on tty1 - 23 years writing software - Server-side/backend/infrastructure engineering, mostly - Likes: functional programming (e.g. Haskell) Today: - Build new backend services in Haskell - I babysit a bloated Rails webapp Previously: trading systems, SaaS products, CI/CD 2

In the cloud Figure 1: Programming cloud infrastructures from the

soy bean ﬁelds 3

Problem Outline

Trafﬁc 4

Legacy/History • Deliver news, discussions, & campaigns to over two

million users/day • Trafﬁc varies signiﬁcantly during the day • Heavy reads (varnish saves our site every day) • Writes go to content publishing backends, which are slow and expensive (Perl, Ruby) • When news breaks or the newsletter is sent our active users want to login, comment, recommend, write their own story, etc, which are WRITES. 5

Legacy/History • Deliver news, discussions, & campaigns to over two

Making me … 7

Before & After

Before: When I started (Sept 2016) • Only one problematic

service was in a static autoscaling group (no scaling policies, manually modiﬁed by a human :gasp:, ”static”) • Services used atrophying AMIs that may not converge due to external APT source dependencies changing in signiﬁcant ways :( • Often AMIs didn’t successfully bootstrap within 15 minutes 8

Before: When I started (Sept 2016) • Only one problematic

Today: All services in dynamic autoscaling groups • Frontend caching/routing

layer • Both content publishing backends • Internal systems, e.g. logging, metrics, etc. 9

Today: All services in dynamic autoscaling groups • Frontend caching/routing

AWS EC2 AutoScaling: An overview

10 High-level Primitives • AutoScaling Group • Launch Conﬁguration • Scaling

Policies • Lifecycle Hooks 10

High-level Primitives • AutoScaling Group • Launch Conﬁguration • Scaling

AutoScaling Lifecycle Figure 2: Transition between instance states in the

Amazon EC2 AutoScaling lifecycle 11

AutoScaling Group: Properties • Min, max, desired • Launch conﬁguration

(exactly one pointer) • Health check type (EC2/ELB) • AZs • Timeouts • Scaling policies (zero or more) 12

AutoScaling Group: Create via CLI declare -r rid="ResourceId=${asg_name}" delcare -r

rtype="ResourceType=auto-scaling-group" aws autoscaling create-auto-scaling-group \ --auto-scaling-group-name "${asg_name}" \ --launch-configuration-name "${lc_name}" \ --min-size ${min_size:-1} \ --max-size ${max_size:-9} \ --default-cooldown ${cooldown:-120} \ --availability-zones ${availability_zones} \ --health-check-type "${health_check_type:-ELB}" \ --health-check-grace-period "${grace_period:-90}" \ --vpc-zone-identifier "${subnet_ids}" \ --tags \ "${rid},${rtype},Key=LifeCycle,Value=alive,PropagateAtLaunch=false" 13

Autoscaling Group: Enable metrics collection # After creation aws autoscaling

enable-metrics-collection \ --auto-scaling-group-name "${asg_name}" \ --granularity "1Minute" 14

Autoscaling Group: Querying instance IDs in ASG aws autoscaling describe-auto-scaling-groups

\ --output text \ --region "${region}" \ --auto-scaling-group-names "${asg_name}" \ --query 'AutoScalingGroups[].Instances[].InstanceId' 15

Launch Conﬁguration: Properties • AMI • Instance type • User-data

• Instance tags • Security groups • Block device mappings • IAM instance proﬁles Note: immutable after creation 16

Launch Conﬁguration: Create via CLI declare -r bdev="DeviceName=/dev/sda1" declare -r

vtype="VolumeType=gp2" declare -r term="DeleteOnTermination=true" aws autoscaling create-launch-configuration \ --launch-configuration-name "${lc_name}" \ --image-id "${image_id}" \ --iam-instance-profile "${lc_name}-profile" \ --security-groups ${security_groups} \ --instance-type ${instance} \ --block-device-mappings \ "${bdev},Ebs={${term},${vtype},VolumeSize=${disk_size}}" 17

Scaling Policies: Properties • Policy name • Metric type •

Adjustment type • Scaling adjustment 18

Scaling Policies: Properties • Policy name • Metric type •

Scaling Policies: Create via CLI aws autoscaling put-scaling-policy \ --auto-scaling-group-name

"${asg_name}" \ --policy-name "${scaling_policy_name}" \ --adjustment-type ChangeInCapacity \ --scaling-adjustment 1 19

Scaling Policies: Attach Metric Alarm aws cloudwatch put-metric-alarm \ --alarm-name

Step-Scaling-AlarmHigh-AddCapacity \ --metric-name CPUUtilization \ --namespace AWS/EC2 \ --statistic Average \ --period 120 \ --evaluation-periods 2 \ --threshold 60 \ --comparison-operator GreaterThanOrEqualToThreshold \ --dimensions "Name=AutoScalingGroupName,Value=${asg_name}" \ --alarm-actions "${policy_arn}" 20

Custom Metrics: Report metric data aws cloudwatch put-metric-data \ --metric-name

custom-metric-name \ --namespace MyOrg/Custom \ --unit Count \ --value ${value} \ --storage-resolution 1 \ --dimensions "AutoScalingGroupName=${asg_name}" 21

Lifecycle Hooks: Properties We don’t use this but for adding

hooks to provision software on newly launched instances and similar actions. 22

Related Side Notes

23 EC2 Instance Bootstrapping • Chef converge boostrapping took ~15minutes •

Improved bootstrapping by an order of magnitude with fully baked AMIs • Now we fully bake AMIs for each conﬁg and app change (5mins, one time per release per environment, a constant factor, using NixOS) Fully baking AMIs also gives us system reproducibility that convergent conﬁguration systems like Chef couldn’t give us. 23

EC2 Instance Bootstrapping • Chef converge boostrapping took ~15minutes •

Right-Size Instance Types per Service • We used to use

whatever instance type was set before because $REASONS • Now we inspect each service’s resource usage in production in peak, typical, and overnight resting states to know how to size a service’s cluster. • Recommend this practice post-ASG or you are dropping $$$ in AWS’s lap and potentially hurting your product’s UX 24

Right-Size Instance Types per Service • We used to use

Find Leading Indicator Metric for Dynamic Scale Out/In • Every

service behaves differently under load • We initially started dyanamically scaling using policies based purely on CPU (a start but not good enough for us) • Now we report custom metrics to AWS CloudWatch that are leading indicators that our cluster needs to scale out or in. Leads to more predictable performance on the site even under trafﬁc spikes. 25

Find Leading Indicator Metric for Dynamic Scale Out/In • Every

Fail-Safe Semantics for Deploy • AMI artifacts built and tested

• AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26

Fail-Safe Semantics for Deploy • AMI artifacts built and tested

Other stuff • DONE Script rollback (~1 minute to previous

version) • TODO Implement canary deploy capability • TODO Check error rates and/or latencies haven’t increased before removing old ASG from ALB • REMINDER your max capacity should be determined by your backend runtime dependencies (it’s transitive) 27

Dynamically scaling a political news and activi...

Dynamically scaling a political news and activism hub (up to 5x the traffic in 20 minutes)

More Decks by Susan Potter

Other Decks in Programming

Featured

Transcript