Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dynamically scaling a political news and activism hub (up to 5x the traffic in 20 minutes)

Dynamically scaling a political news and activism hub (up to 5x the traffic in 20 minutes)

On any given day our website can receive traffic peaks up to five times our base traffic, sometimes requiring us to scale out to double our backend app server capacity within a 10-20 minute window (sometimes at unpredictable times). In this talk, Susan Potter will discuss the use of autoscaling in EC2 from the essential components to some gotchas learned along the way.

Susan Potter

April 26, 2019
Tweet

More Decks by Susan Potter

Other Decks in Programming

Transcript

  1. Dynamically scaling a news & activism hub (scaling out up

    to 5x the write-traffic in 20 minutes) Susan Potter April 26, 2019
  2. Outline Intro Problem Outline Before & After AWS EC2 AutoScaling:

    An overview Related Side Notes Questions? 1
  3. whoami $ finger $(whoami) Name: Susan Potter Last login Sun

    Jan 18 18:30 1996 (GMT) on tty1 - 23 years writing software - Server-side/backend/infrastructure engineering, mostly - Likes: functional programming (e.g. Haskell) Today: - Build new backend services in Haskell - I babysit a bloated Rails webapp Previously: trading systems, SaaS products, CI/CD 2
  4. Legacy/History • Deliver news, discussions, & campaigns to over two

    million users/day • Traffic varies significantly during the day • Heavy reads (varnish saves our site every day) • Writes go to content publishing backends, which are slow and expensive (Perl, Ruby) • When news breaks or the newsletter is sent our active users want to login, comment, recommend, write their own story, etc, which are WRITES. 5
  5. Legacy/History • Deliver news, discussions, & campaigns to over two

    million users/day • Traffic varies significantly during the day • Heavy reads (varnish saves our site every day) • Writes go to content publishing backends, which are slow and expensive (Perl, Ruby) • When news breaks or the newsletter is sent our active users want to login, comment, recommend, write their own story, etc, which are WRITES. 5
  6. Legacy/History • Deliver news, discussions, & campaigns to over two

    million users/day • Traffic varies significantly during the day • Heavy reads (varnish saves our site every day) • Writes go to content publishing backends, which are slow and expensive (Perl, Ruby) • When news breaks or the newsletter is sent our active users want to login, comment, recommend, write their own story, etc, which are WRITES. 5
  7. Legacy/History • Deliver news, discussions, & campaigns to over two

    million users/day • Traffic varies significantly during the day • Heavy reads (varnish saves our site every day) • Writes go to content publishing backends, which are slow and expensive (Perl, Ruby) • When news breaks or the newsletter is sent our active users want to login, comment, recommend, write their own story, etc, which are WRITES. 5
  8. Legacy/History • Deliver news, discussions, & campaigns to over two

    million users/day • Traffic varies significantly during the day • Heavy reads (varnish saves our site every day) • Writes go to content publishing backends, which are slow and expensive (Perl, Ruby) • When news breaks or the newsletter is sent our active users want to login, comment, recommend, write their own story, etc, which are WRITES. 5
  9. Related Problems • Deployment method (Capistrano) has horrific failure modes

    during scale out/in events • Chef converged less and less and work to maintain it increased • Using moving to dynamic autoscaling didn’t fix these directly but our solution considered how it could help. 6
  10. Before: When I started (Sept 2016) • Only one problematic

    service was in a static autoscaling group (no scaling policies, manually modified by a human :gasp:, ”static”) • Services used atrophying AMIs that may not converge due to external APT source dependencies changing in significant ways :( • Often AMIs didn’t successfully bootstrap within 15 minutes 8
  11. Before: When I started (Sept 2016) • Only one problematic

    service was in a static autoscaling group (no scaling policies, manually modified by a human :gasp:, ”static”) • Services used atrophying AMIs that may not converge due to external APT source dependencies changing in significant ways :( • Often AMIs didn’t successfully bootstrap within 15 minutes 8
  12. Before: When I started (Sept 2016) • Only one problematic

    service was in a static autoscaling group (no scaling policies, manually modified by a human :gasp:, ”static”) • Services used atrophying AMIs that may not converge due to external APT source dependencies changing in significant ways :( • Often AMIs didn’t successfully bootstrap within 15 minutes 8
  13. Today: All services in dynamic autoscaling groups • Frontend caching/routing

    layer • Both content publishing backends • Internal systems, e.g. logging, metrics, etc. 9
  14. Today: All services in dynamic autoscaling groups • Frontend caching/routing

    layer • Both content publishing backends • Internal systems, e.g. logging, metrics, etc. 9
  15. Today: All services in dynamic autoscaling groups • Frontend caching/routing

    layer • Both content publishing backends • Internal systems, e.g. logging, metrics, etc. 9
  16. 10

  17. AutoScaling Group: Properties • Min, max, desired • Launch configuration

    (exactly one pointer) • Health check type (EC2/ELB) • AZs • Timeouts • Scaling policies (zero or more) 12
  18. AutoScaling Group: Create via CLI declare -r rid="ResourceId=${asg_name}" delcare -r

    rtype="ResourceType=auto-scaling-group" aws autoscaling create-auto-scaling-group \ --auto-scaling-group-name "${asg_name}" \ --launch-configuration-name "${lc_name}" \ --min-size ${min_size:-1} \ --max-size ${max_size:-9} \ --default-cooldown ${cooldown:-120} \ --availability-zones ${availability_zones} \ --health-check-type "${health_check_type:-ELB}" \ --health-check-grace-period "${grace_period:-90}" \ --vpc-zone-identifier "${subnet_ids}" \ --tags \ "${rid},${rtype},Key=LifeCycle,Value=alive,PropagateAtLaunch=false" 13
  19. Autoscaling Group: Enable metrics collection # After creation aws autoscaling

    enable-metrics-collection \ --auto-scaling-group-name "${asg_name}" \ --granularity "1Minute" 14
  20. Autoscaling Group: Querying instance IDs in ASG aws autoscaling describe-auto-scaling-groups

    \ --output text \ --region "${region}" \ --auto-scaling-group-names "${asg_name}" \ --query 'AutoScalingGroups[].Instances[].InstanceId' 15
  21. Launch Configuration: Properties • AMI • Instance type • User-data

    • Instance tags • Security groups • Block device mappings • IAM instance profiles Note: immutable after creation 16
  22. Launch Configuration: Create via CLI declare -r bdev="DeviceName=/dev/sda1" declare -r

    vtype="VolumeType=gp2" declare -r term="DeleteOnTermination=true" aws autoscaling create-launch-configuration \ --launch-configuration-name "${lc_name}" \ --image-id "${image_id}" \ --iam-instance-profile "${lc_name}-profile" \ --security-groups ${security_groups} \ --instance-type ${instance} \ --block-device-mappings \ "${bdev},Ebs={${term},${vtype},VolumeSize=${disk_size}}" 17
  23. Scaling Policies: Properties • Policy name • Metric type •

    Adjustment type • Scaling adjustment 18
  24. Scaling Policies: Properties • Policy name • Metric type •

    Adjustment type • Scaling adjustment 18
  25. Scaling Policies: Properties • Policy name • Metric type •

    Adjustment type • Scaling adjustment 18
  26. Scaling Policies: Properties • Policy name • Metric type •

    Adjustment type • Scaling adjustment 18
  27. Scaling Policies: Create via CLI aws autoscaling put-scaling-policy \ --auto-scaling-group-name

    "${asg_name}" \ --policy-name "${scaling_policy_name}" \ --adjustment-type ChangeInCapacity \ --scaling-adjustment 1 19
  28. Scaling Policies: Attach Metric Alarm aws cloudwatch put-metric-alarm \ --alarm-name

    Step-Scaling-AlarmHigh-AddCapacity \ --metric-name CPUUtilization \ --namespace AWS/EC2 \ --statistic Average \ --period 120 \ --evaluation-periods 2 \ --threshold 60 \ --comparison-operator GreaterThanOrEqualToThreshold \ --dimensions "Name=AutoScalingGroupName,Value=${asg_name}" \ --alarm-actions "${policy_arn}" 20
  29. Custom Metrics: Report metric data aws cloudwatch put-metric-data \ --metric-name

    custom-metric-name \ --namespace MyOrg/Custom \ --unit Count \ --value ${value} \ --storage-resolution 1 \ --dimensions "AutoScalingGroupName=${asg_name}" 21
  30. Lifecycle Hooks: Properties We don’t use this but for adding

    hooks to provision software on newly launched instances and similar actions. 22
  31. 23

  32. EC2 Instance Bootstrapping • Chef converge boostrapping took ~15minutes •

    Improved bootstrapping by an order of magnitude with fully baked AMIs • Now we fully bake AMIs for each config and app change (5mins, one time per release per environment, a constant factor, using NixOS) Fully baking AMIs also gives us system reproducibility that convergent configuration systems like Chef couldn’t give us. 23
  33. EC2 Instance Bootstrapping • Chef converge boostrapping took ~15minutes •

    Improved bootstrapping by an order of magnitude with fully baked AMIs • Now we fully bake AMIs for each config and app change (5mins, one time per release per environment, a constant factor, using NixOS) Fully baking AMIs also gives us system reproducibility that convergent configuration systems like Chef couldn’t give us. 23
  34. EC2 Instance Bootstrapping • Chef converge boostrapping took ~15minutes •

    Improved bootstrapping by an order of magnitude with fully baked AMIs • Now we fully bake AMIs for each config and app change (5mins, one time per release per environment, a constant factor, using NixOS) Fully baking AMIs also gives us system reproducibility that convergent configuration systems like Chef couldn’t give us. 23
  35. Right-Size Instance Types per Service • We used to use

    whatever instance type was set before because $REASONS • Now we inspect each service’s resource usage in production in peak, typical, and overnight resting states to know how to size a service’s cluster. • Recommend this practice post-ASG or you are dropping $$$ in AWS’s lap and potentially hurting your product’s UX 24
  36. Right-Size Instance Types per Service • We used to use

    whatever instance type was set before because $REASONS • Now we inspect each service’s resource usage in production in peak, typical, and overnight resting states to know how to size a service’s cluster. • Recommend this practice post-ASG or you are dropping $$$ in AWS’s lap and potentially hurting your product’s UX 24
  37. Right-Size Instance Types per Service • We used to use

    whatever instance type was set before because $REASONS • Now we inspect each service’s resource usage in production in peak, typical, and overnight resting states to know how to size a service’s cluster. • Recommend this practice post-ASG or you are dropping $$$ in AWS’s lap and potentially hurting your product’s UX 24
  38. Find Leading Indicator Metric for Dynamic Scale Out/In • Every

    service behaves differently under load • We initially started dyanamically scaling using policies based purely on CPU (a start but not good enough for us) • Now we report custom metrics to AWS CloudWatch that are leading indicators that our cluster needs to scale out or in. Leads to more predictable performance on the site even under traffic spikes. 25
  39. Find Leading Indicator Metric for Dynamic Scale Out/In • Every

    service behaves differently under load • We initially started dyanamically scaling using policies based purely on CPU (a start but not good enough for us) • Now we report custom metrics to AWS CloudWatch that are leading indicators that our cluster needs to scale out or in. Leads to more predictable performance on the site even under traffic spikes. 25
  40. Find Leading Indicator Metric for Dynamic Scale Out/In • Every

    service behaves differently under load • We initially started dyanamically scaling using policies based purely on CPU (a start but not good enough for us) • Now we report custom metrics to AWS CloudWatch that are leading indicators that our cluster needs to scale out or in. Leads to more predictable performance on the site even under traffic spikes. 25
  41. Fail-Safe Semantics for Deploy • AMI artifacts built and tested

    • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26
  42. Fail-Safe Semantics for Deploy • AMI artifacts built and tested

    • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26
  43. Fail-Safe Semantics for Deploy • AMI artifacts built and tested

    • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26
  44. Fail-Safe Semantics for Deploy • AMI artifacts built and tested

    • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26
  45. Fail-Safe Semantics for Deploy • AMI artifacts built and tested

    • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26
  46. Fail-Safe Semantics for Deploy • AMI artifacts built and tested

    • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26
  47. Fail-Safe Semantics for Deploy • AMI artifacts built and tested

    • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26
  48. Fail-Safe Semantics for Deploy • AMI artifacts built and tested

    • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26
  49. Fail-Safe Semantics for Deploy • AMI artifacts built and tested

    • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26
  50. Fail-Safe Semantics for Deploy • AMI artifacts built and tested

    • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26
  51. Other stuff • DONE Script rollback (~1 minute to previous

    version) • TODO Implement canary deploy capability • TODO Check error rates and/or latencies haven’t increased before removing old ASG from ALB • REMINDER your max capacity should be determined by your backend runtime dependencies (it’s transitive) 27
  52. Other stuff • DONE Script rollback (~1 minute to previous

    version) • TODO Implement canary deploy capability • TODO Check error rates and/or latencies haven’t increased before removing old ASG from ALB • REMINDER your max capacity should be determined by your backend runtime dependencies (it’s transitive) 27
  53. Other stuff • DONE Script rollback (~1 minute to previous

    version) • TODO Implement canary deploy capability • TODO Check error rates and/or latencies haven’t increased before removing old ASG from ALB • REMINDER your max capacity should be determined by your backend runtime dependencies (it’s transitive) 27
  54. Other stuff • DONE Script rollback (~1 minute to previous

    version) • TODO Implement canary deploy capability • TODO Check error rates and/or latencies haven’t increased before removing old ASG from ALB • REMINDER your max capacity should be determined by your backend runtime dependencies (it’s transitive) 27