Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Life of a Placement

The Life of a Placement

A deep dive into how Nomad makes placement decisions.

Alex Dadgar

May 15, 2017
Tweet

Other Decks in Programming

Transcript

  1. HASHICORP example.nomad # Define our simple redis job job "redis"

    { # Run only in us-east-1 datacenters = ["us-east-1"] # Define the single redis task using Docker task "redis" { driver = "docker" config { image = "redis:latest" } resources { cpu = 500 # Mhz memory = 256 # MB network { mbits = 10 port “redis” {} } } } }
  2. HASHICORP # Define our simple redis job job "redis" {

    # Run only in us-east-1 datacenters = ["us-east-1"] # Define the single redis task using Docker task "redis" { driver = "docker" config { image = "redis:latest" } resources { cpu = 500 # Mhz memory = 256 # MB network { mbits = 10 port “redis” {} } } } }
  3. HASHICORP # Define our simple redis job job "redis" {

    ... group "redis" { # Ensure reliable network speed constraint { attribute = "${attr.platform.aws.instance-type}" value = "m4.xlarge" } task "redis" { driver = "docker" config { image = "redis:latest" } resources { cpu = 500 # Mhz memory = 256 # MB network { mbits = 10 port “redis” {} } } } } }
  4. HASHICORP # Define our simple redis job job "redis" {

    ... group "redis" { count = 3 constraint { distinct_hosts = true } task "redis" { driver = "docker" config { image = "redis:latest" } resources { cpu = 500 # Mhz memory = 256 # MB network { mbits = 10 port “redis” {} } } } } }
  5. HASHICORP # Define our simple redis job job "redis" {

    ... group "redis" { count = 3 constraint { distinct_property = "${meta.rack}" } task "redis" { driver = "docker" config { image = "redis:latest" } resources { cpu = 500 # Mhz memory = 256 # MB network { mbits = 10 port “redis” {} } } } } }
  6. Impact Placement? • Allowed datacenters • Driver type • Constraints

    • Resource Request • CPU, Memory, Disk, Network (Bandwidth and Ports) • Future: GPU @adadgar
  7. Scheduler • Reconcile with what already exists • Stop any

    unneeded allocations • In-place upgrade where possible • Rolling updates • Place the required new allocations @adadgar
  8. Stack @adadgar Node Source Filter 1 Filter 2 Ranker 1

    Ranker 2 Limit Selector • Chain multiple iterators together • Start with all possibly eligible nodes • Final output is eligible and highly ranked node
  9. HASHICORP job "redis" { datacenters = [“us-east-1”] group "redis" {

    count = 3 constraint { attribute = "${attr.platform.aws.instance-type}" value = "m4.xlarge" } task "redis" { driver = “docker" config { image = "redis:latest" } resources { cpu = 500 # Mhz memory = 256 # MB network { mbits = 10 port “redis” {} } } } } }
  10. HASHICORP job "redis" { datacenters = [“us-east-1”] group "redis" {

    count = 3 constraint { attribute = "${attr.platform.aws.instance-type}" value = "m4.xlarge" } task "redis" { driver = “docker" config { image = "redis:latest" } resources { cpu = 500 # Mhz memory = 256 # MB network { mbits = 10 port “redis” {} } } } } }
  11. HASHICORP job "redis" { datacenters = [“us-east-1”] group "redis" {

    count = 3 constraint { attribute = "${attr.platform.aws.instance-type}" value = "m4.xlarge" } task "redis" { driver = “docker" config { image = "redis:latest" } resources { cpu = 500 # Mhz memory = 256 # MB network { mbits = 10 port “redis” {} } } } } }
  12. Types of Constraint Filters • Regex Matcher • AWS instance

    type matches “m4.[0-9]+xlarge • Version • Is Java version >= 1.8.0 • Distinct Hosts • Spread across unique hosts • Distinct Property • Spread across unique values of attribute (e.g. different racks) @adadgar
  13. Stack @adadgar Node Source Filter 1 Filter 2 Ranker 1

    Ranker 2 Limit Selector • Ranker can filter and add a score to each node
  14. Bin Packer • Filter Nodes that can’t fit the placement

    (not enough resources) • Score nodes based upon the density @adadgar
  15. How? • When Nodes join they fingerprint host and determine

    resources • Filtering step: Do already placed allocations + new allocation fit on node • Rank: Produce score between 0-18. • Score = 20 - 10^x - 10^y, where x is free CPU and y is free memory @adadgar
  16. Why? Bin Packing versus Spread • Accommodate heterogeneous workloads •

    Easier capacity planning • Future: Autoscaling @adadgar
  17. Heterogeneous Workloads @adadgar 3 GB RAM 4 GB RAM 3

    GB RAM Place Following: Bin Pack Spread 4 GB RAM
  18. Heterogeneous Workloads @adadgar 3 GB RAM 4 GB RAM 3

    GB RAM Bin Pack Spread 4 GB RAM 4 GB RAM 4 GB RAM 3 GB RAM 3 GB RAM Place Following:
  19. Heterogeneous Workloads @adadgar 3 GB RAM 4 GB RAM 3

    GB RAM Bin Pack Spread 4 GB RAM 4 GB RAM 4 GB RAM 3 GB RAM 3 GB RAM 4 GB RAM 4 GB RAM 3 GB RAM 3 GB RAM Place Following:
  20. Heterogeneous Workloads @adadgar 3 GB RAM 4 GB RAM 3

    GB RAM Bin Pack Spread 4 GB RAM 6 GB RAM 4 GB RAM 4 GB RAM 3 GB RAM 3 GB RAM 4 GB RAM 4 GB RAM 3 GB RAM 3 GB RAM Place Following:
  21. Heterogeneous Workloads @adadgar 3 GB RAM 4 GB RAM 3

    GB RAM Bin Pack Spread 4 GB RAM 6 GB RAM 4 GB RAM 4 GB RAM 3 GB RAM 3 GB RAM 4 GB RAM 4 GB RAM 3 GB RAM 3 GB RAM 6 GB RAM Place Following:
  22. Heterogeneous Workloads @adadgar 3 GB RAM 4 GB RAM 3

    GB RAM Bin Pack Spread 4 GB RAM 6 GB RAM 4 GB RAM 4 GB RAM 3 GB RAM 3 GB RAM 4 GB RAM 4 GB RAM 3 GB RAM 3 GB RAM 6 GB RAM Place Following:
  23. Why? Bin Packing versus Spread • Accommodate heterogeneous workloads •

    Easier capacity planning • Future: Autoscaling @adadgar
  24. Capacity Planning • Assume 10 nodes • Scheduling many jobs

    that take roughly half of the overall resources @adadgar
  25. Why? Bin Packing versus Spread • Accommodate heterogeneous workloads •

    Easier capacity planning • Future: Autoscaling @adadgar
  26. Auto Scaling - Scaling Up @adadgar 3 GB RAM 4

    GB RAM 3 GB RAM Bin Pack Spread 4 GB RAM 4 GB RAM 4 GB RAM 3 GB RAM 3 GB RAM 6 GB RAM Place Following: 4 GB RAM 3 GB RAM 3 GB RAM 4 GB RAM 6 GB RAM 6 GB RAM
  27. Auto Scaling - Scaling Up @adadgar 3 GB RAM 4

    GB RAM 3 GB RAM Bin Pack Spread 4 GB RAM 4 GB RAM 4 GB RAM 3 GB RAM 3 GB RAM Place Following: 4 GB RAM 3 GB RAM 3 GB RAM 4 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM
  28. Auto Scaling - Scaling Up @adadgar 3 GB RAM 4

    GB RAM 3 GB RAM Bin Pack Spread 4 GB RAM 4 GB RAM 4 GB RAM 3 GB RAM 3 GB RAM Place Following: 4 GB RAM 3 GB RAM 3 GB RAM 4 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM
  29. Auto Scaling - Scaling Up @adadgar 3 GB RAM 4

    GB RAM 3 GB RAM Bin Pack Wasted = 5 / 32 = 15.6% Spread Wasted = 35% and Extra Node 4 GB RAM 4 GB RAM 4 GB RAM 3 GB RAM 3 GB RAM Place Following: 4 GB RAM 3 GB RAM 3 GB RAM 4 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM
  30. Auto Scaling - Scaling Down @adadgar 3 GB RAM 4

    GB RAM 3 GB RAM Bin Pack Spread 4 GB RAM 4 GB RAM 4 GB RAM 3 GB RAM 3 GB RAM Place Following: 4 GB RAM 3 GB RAM 3 GB RAM 4 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM
  31. Auto Scaling - Scaling Down @adadgar 3 GB RAM 4

    GB RAM 3 GB RAM Bin Pack Spread 4 GB RAM 4 GB RAM 4 GB RAM 3 GB RAM 3 GB RAM Place Following: 4 GB RAM 3 GB RAM 3 GB RAM 4 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM
  32. Auto Scaling - Scaling Down @adadgar 3 GB RAM 4

    GB RAM 3 GB RAM Bin Pack Spread 4 GB RAM 4 GB RAM 4 GB RAM 3 GB RAM 3 GB RAM Place Following: 4 GB RAM 3 GB RAM 3 GB RAM 4 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM
  33. Auto Scaling - Scaling Down @adadgar 3 GB RAM 4

    GB RAM 3 GB RAM Bin Pack Spread 4 GB RAM 4 GB RAM 3 GB RAM Place Following: 4 GB RAM 3 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM
  34. Auto Scaling - Scaling Down @adadgar 3 GB RAM 4

    GB RAM 3 GB RAM Bin Pack Spread 4 GB RAM 4 GB RAM 3 GB RAM Place Following: 4 GB RAM 3 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM
  35. Auto Scaling - Scaling Down @adadgar 3 GB RAM 4

    GB RAM 3 GB RAM Bin Pack Spread 4 GB RAM 4 GB RAM 3 GB RAM Place Following: 4 GB RAM 3 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM
  36. Auto Scaling - Scaling Down @adadgar 3 GB RAM 4

    GB RAM 3 GB RAM Bin Pack Spread 4 GB RAM 4 GB RAM 3 GB RAM Place Following: 4 GB RAM 3 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM
  37. Auto Scaling - Scaling Down @adadgar 3 GB RAM 4

    GB RAM 3 GB RAM Bin Pack Spread 4 GB RAM 4 GB RAM 3 GB RAM Place Following: 4 GB RAM 3 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM 6 GB RAM
  38. Job Anti-Affinity • Does not filter nodes • Applies a

    penalty to collocating multiple placements onto the same node @adadgar
  39. Stack @adadgar Node Source Filter 1 Filter 2 Ranker 1

    Ranker 2 Limit Selector • Limit Selector picks the highest scored node using sampling
  40. Limit Selector @adadgar Limit Selector 17 1 3 11 8

    16 12 5 3 7 14 Ranking Is Costly 17
  41. Why is Ranking expense? • Each placement must go through

    all iterators. • Must be done sequentially • Bin-packing and job anti-affinity score may change @adadgar
  42. Power of Two Choices @adadgar • Imagine n balls that

    are placed into n buckets • Maximum load in any bucket: • Chosen at random: ~ log(n)/loglog(n) • Chosen sequentially, picking least full of d bins: ~ loglog(n)/log(d) +O(1) • d=2 yields a exponential improvement
  43. Power of Two Choices @adadgar • Inspired from the Berkley

    Sparrow Scheduler • Originally from Load Balancing Research • Talk: https://people.eecs.berkeley.edu/~keo/talks/sparrow- sosp-talk.pdf • Paper: https://www.eecs.harvard.edu/~michaelm/postscripts/ mythesis.pdf
  44. All that for one allocation! @adadgar Node Source Filter 1

    Filter 2 Ranker 1 Ranker 2 Limit Selector Allocation