Upgrade to Pro — share decks privately, control downloads, hide ads and more …

More bang for your buck: How Yelp autoscales Me...

More bang for your buck: How Yelp autoscales Mesos & Marathon on AWS Spotfleet

Avatar for Rob Johnson

Rob Johnson

October 26, 2017

More Decks by Rob Johnson

Other Decks in Programming

Transcript

  1. Rob Johnson [email protected] More bang for your buck How Yelp

    autoscales Mesos & Marathon on AWS Spot Fleet
  2. • Transitioned to SOA architecture over 3 year period •

    Monolith still exists, but is now deployed + monitored like any other service (just not so micro)
  3. • 3 main production clusters • ~900 Marathon Apps in

    biggest cluster • ~5500 Mesos tasks • ~600 Mesos Agents • Spans across metal DC and AWS
  4. $ cat yelpsoa_configs/my_service/marathon-norcal-prod.yaml main: cpu: 10 mem: 500 min_instances: 5

    max_instances: 20 metrics_provider: cpu decision_policy: pid cmd: dumb-init exec ./run-yelp
  5. service_autoscaler.py target = get_target_utilization(service, instance) real = get_real_utilization(service, instance) required_instances

    = decision_policy.get_instances(target,real) zk.set( ‘/autoscaling/service/instance/instances’, required_instances )
  6. deploy_daemon.py zookeeper.watch(‘/autoscaling/’, handle_instance_change) def handle_instance_change(service, instance, new_val): current_val = get_instance_count(service,

    instance) If new_val >= current_val: marathon.scale_app(‘service.instance’, new_val) else: drain_and_scale(‘service.instance)
  7. • Single decision policy - proportional • Runs every 20

    minutes • Aim for 80% utilization • Errs on the side of defence - lots of checks to avoid accidently killing too much capacity
  8. • Give each host a ‘fitness’ score according to how

    much churn is caused by shutting it down. • Data points = AWS events, number of tasks, chronos batches.
  9. - Users bid for Amazon’s spare capacity - Lowest winning

    bid is the $$ paid Used Used Used Available Available Available Available User A - $4 User A - $4 User B - $3 User C - $2 User C - $2 User D - $1 User D - $1 User D - $1
  10. - Users bid for Amazon’s spare capacity - Lowest winning

    bid is the $$ paid Used Used Used User A - $2 User A - $2 User B - $2 User C - $2 User A - $4 User A - $4 User B - $3 User C - $2 User C - $2 User D - $1 User D - $1 User D - $1
  11. - Users bid for Amazon’s spare capacity - Lowest winning

    bid is the $$ paid Used Used Used User A - $3 User A - $3 User B - $3 User B - $3 User A - $4 User A - $4 User B - $3 User B - $3 User C - $2 User C - $2 User D - $1 User D - $1
  12. • High Bid Price ◦ Savings in low periods will

    outweigh expenditure in expensive periods
  13. • High Bid Price ◦ Savings in low periods will

    outweigh expenditure in expensive periods ◦ We bid 2X instance price
  14. • Diversify by AZ, Instance Type ◦ Ask Amazon to

    fulfill diversifying across instance types, rather than picking the cheapest selection (Allocation Strategy)
  15. module "norcal-prod-uswest1a-highcpus6" { source = "git::ssh://[email protected]/terraform-modules/paasta_spot_cluster" cluster = "norcal-prod" region

    = "${var.region}" account = "${var.account}" ecosystem = "${var.ecosystem}" instances_data = "${file("instances_high_cpus_weighted.json")}" account_id = "${var.account_id}" valid_until = "2118-12-31T23:59:59Z" # One unit = 100 vCPU min_capacity = 7 max_capacity = 70 ami_type = "paasta-optimized" initial_target_capacity = 25 spot_price = 0.154 instance_profile = "paasta" }
  16. robj@xenialdev1-uswest1cdevc:~/terraform/paasta master % cat instances_high_cpus_weighted.json { "instance_data": [ { "type":

    "c4.4xlarge", "price": "2.098", "weight": "0.15" }, { "type": "c4.8xlarge", "price": "4.196", "weight": "0.35" }, { "type": "m4.4xlarge", "price": "2.234", "weight": "0.15" } ] ] } 16 vCPUs 36 vCPUs 14 vCPUs
  17. Type / Region us-west-1 us-east-1 us-west-2 c3.4xlarge 29.00% 0.00% 0.00%

    c3.8xlarge 27.00% 0.00% 42.00% c4.4xlarge 52.00% 49.00% 78.00% c4.8xlarge 49.00% 53.00% 81.00% m4.10xlarge 65.00% 77.00% 65.00% m4.16xlarge 47.00% 59.00% 58.00% m4.4xlarge 60.00% 70.00% 62.00% r3.4xlarge 32.00% 0.00% 34.00% r3.8xlarge 41.00% 0.00% 48.00% r4.16xlarge 71.00% 62.00% 61.00% r4.4xlarge 45.00% 49.00% 35.00% r4.8xlarge 48.00% 34.00% 42.00% Weighted Total 47.00% 51.00% 60.00%
  18. • Autoscaling provides Yelp business value as we save $$$

    by reducing excess capacity • Running at 80% efficiency means we can quickly scale up services. • Spotfleet can further reduce our AWS bill, but comes with significant risk. • Mesos maintenance primitives provide building blocks for us to reduce this risk.