Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scheduling Applications on Clusters with Nomad

Scheduling Applications on Clusters with Nomad

A brief look at Nomad (https://www.nomadproject.io/), created by Hashicorp for the scalable deployment and scheduling of applications, which is part of their toolset which makes the DevOps lives much easier.

Also a look at the basics of a 'Nomad Simulator' project, which allows to simulate a cluster in-memory, presenting it to Nomad as if they were real machines, then run Nomad jobs on them, and retrieve statistics regarding the scheduling process, which together with a separate visualizer tailored specificly for the output of this simulator were part of a software engineering internship project at Hooklift, Inc.


Santiago Zubieta

March 01, 2016


  1. Scheduling Applications on Clusters with Santiago Martín Zubieta Ortiz Software

    Engineering Intern at Hooklift, Inc.
  2. Thanks for the speaker space

  3. @mde_devops Thanks for the awesome community

  4. Thanks for the experience

  5. Scheduling

  6. How to assign resources to perform some work, with an

    strategy that aims towards achieving a certain goal. Scheduling
  7. Many think at first of operating systems. Give resources... to

    long lived tasks? to most recent tasks? equally for everything? to fit tasks best in whats available? Operating Systems In this case, there’s a competition for the resources of a ‘single’ entity, which is a computer, and the work takes turns to share the usage of such entity.
  8. https://vissicompcodder.wordpress.com/category/best-fit/ Operating Systems

  9. Now, we have multiple networked computers. Competing for using the

    resources of a single entity is not necessary. But scheduling work between the multiple entities still is. Clusters Run a task in all the cluster? Assign work to entities that meet some constraints?
  10. Clusters Don’t think about the internal placement of work when

    placing, just see an entity as its overall resources, then if work is assigned to it, it will resolve on its own how to allocate that work in its operating system. cpu memory storage cpu memory storage cpu memory storage cpu mem storage ? work cluster client 2 client 1 client 3
  11. Clusters cpu memory storage cpu memory storage cpu memory storage

    client 2 client 1 client 3 cpu memory storage cpu memory storage cpu memory storage client 2 client 1 client 3 cluster cpu mem storage cluster work
  12. Nomad

  13. None
  14. gossip protocol consensus protocol

  15. https://www.nomadproject.io/docs/internals/architecture.html Regions are fully independent from each other, and do

    not share jobs, clients, or state. They are loosely- coupled using a gossip protocol, which allows users to submit jobs to any region or query the state of any region transparently. Requests are forwarded to the appropriate server to be processed and the results returned.
  16. Optimistic vs Pessimistic Internal vs External State Single vs Multi

    Level Service vs Batch Oriented
  17. Optimistic vs Pessimistic Do stuff assuming it will work, abort

    if fails. The other assumes other will use my resource, so mutex it.
  18. Management and coordination, should user provide it (easier for dev),

    or program acquire it (easier for user) Internal vs External State Google Omega: internal allows good concurrency
  19. Sparrow: fixed function, optimized for high throughput batch scheduling, no

    alternatives. Mesos: multiple schedulers, a cluster manager that manages different stuff. Single vs Multi Level Single Level ~ fixed, Multi Level ~ pluggable
  20. Service vs Batch Oriented Long lived jobs Fast, short lived

    jobs MS’s Apollo is extremely optimistic with batch, doesn’t think of conflicts, sends work to client and lets it resolve any problems.
  21. What is provided to Nomad: Job - A Job is

    a specification provided by users that declares a workload for Nomad. A Job is composed of one or more task groups. SEE JOB SPEC EXAMPLE: https://www.nomadproject.io/docs/jobspec/index.html Task Group - A Task Group is a set of tasks that must be run together. The entire group must run on the same client node and cannot be split. Task - Tasks are executed by drivers. Tasks specify their driver, configuration for the driver, constraints, and resources required. { } { } e.g. Docker, Qemu, Java, exec binaries, etc.
  22. What is provided to Nomad: Client - A Client of

    Nomad is a machine that tasks can be run on. No information is provided about the clients other than networking info such as ports and IPs of servers. The running agent is in charge of obtaining the resources and constraints of the client’s system, via ‘fingerprinting’. Server - Nomad servers are the brains of the cluster. 3~5 servers per region gives good balance between availability (in case of failure) and speed (more servers give slower consensus). The servers in each datacenter are part of a single consensus group. SEE SERVER/CLIENT EXAMPLE: https://www.nomadproject.io/intro/getting-started/cluster.html After servers and clients are registered, they are loaded into memory as structs, and scheduling is done on memory over those structs and then the operation is relayed to the respective agent.
  23. Only what we want to run is provided, and the

    available places to run, Nomad takes the decision. But how does it decide? Easy! (?)
  24. Bin Packing Fitting best in overall resources is ideal, but

    sometimes other things matter, such as software constraints, availability, speed of response, etc, which may yield different results than a perfect best fit. Nomad aims to provide an optimal bin packing, but tradeoffs can be done according to the desired speed of response.
  25. System Scheduling The system scheduler is used to register jobs

    that should be run on all clients that meet the job's constraints. The system scheduler is also invoked when clients join the cluster or transition into the ready state. This means that all registered system jobs will be re-evaluated and their tasks will be placed on the newly available nodes if the constraints are met. This scheduler type is extremely useful for deploying and managing tasks that should be present on every node in the cluster. Since these tasks are being managed by Nomad, they can take advantage of job updating, rolling deploys, service discovery and more. https://www.nomadproject.io/docs/jobspec/schedulers.html
  26. Batch jobs are much less sensitive to short term performance

    fluctuations and are short lived, finishing in a few minutes to a few days. Although the batch scheduler is very similar to the service scheduler, it makes certain optimizations for the batch workload. The main distinction is that after finding the set of nodes that meet the jobs constraints it uses the power of two choices described in Berkeley's Sparrow scheduler to limit the number of nodes that are ranked. Batch Scheduling https://www.nomadproject.io/docs/jobspec/schedulers.html
  27. https://www.nomadproject.io/docs/jobspec/schedulers.html The service scheduler is designed for scheduling long lived

    services that should never go down. As such, the service scheduler ranks a large portion of the nodes that meet the jobs constraints and selects the optimal node to place a task group on. The service scheduler uses a best fit scoring algorithm influenced by Google work on Borg. Ranking this larger set of candidate nodes increases scheduling time but provides greater guarantees about the optimality of a job placement, which given the service workload is highly desirable. Service Scheduling
  28. What if we wanted our own ‘virtual’ cluster to evaluate

    the performance of some jobs we want to run, for example, for the acquisition of a real cluster? Or also, to evaluate the efficiency of bin packing and acquire metrics that may be useful for comparing with other schedulers? My internship project at Hooklift, Inc. Hopeful OSS contribution to Nomad ;-)
  29. Simulator Hopeful OSS contribution to Nomad ;-)

  30. Simulator Hopeful OSS contribution to Nomad ;-)

  31. Visualizer Hooklift’s own tool :-P

  32. Node consumption Distribution of allocation times Sorted list of allocation

    counts Playback controls CPU Memory Disk Changed node(s) each line represents a node each iteration is a job evaluation and some relevant statistics Press to see simulator input/output specifications :-)
  33. The usual rules apply for allocation: Jobs must belong to

    same region of nodes, otherwise won't allocate. Jobs must have a datacenter related to the one in the nodes, otherwise won't allocate. Jobs' task drivers must be related to the ones in nodes, otherwise won't allocate. For specifically evaluating the bin packing, its better if all jobs and nodes belong to a single datacenter, in the same region, with the same task drivers, so all nodes are eligible for scheduling, and the bin packing can be seen more evidently. Evaluating Bin Packing
  34. Simulator Goals Met It should simulate a cluster of machines

    or nodes in memory, presenting them to Nomad as if they were real machines • It should allow configuring a pre-existing cluster state before a given simulation is run • It should exercise Nomad’s scheduling algorithms on the simulated cluster • It should show metrics about placements such as: ◦ Time taken for a given scheduling algorithm to make a placement decision, under different cluster load conditions. ◦ TaskGroup Placements per second ◦ Job Placements per second ◦ Failed placements