Scheduling Applications on Clusters with Nomad

Scheduling Applications on Clusters with Santiago Martín Zubieta Ortiz Software
Engineering Intern at Hooklift, Inc.

Thanks for the speaker space

@mde_devops Thanks for the awesome community

Thanks for the experience

Scheduling

How to assign resources to perform some work, with an
strategy that aims towards achieving a certain goal. Scheduling

Many think at ﬁrst of operating systems. Give resources... to
long lived tasks? to most recent tasks? equally for everything? to ﬁt tasks best in whats available? Operating Systems In this case, there’s a competition for the resources of a ‘single’ entity, which is a computer, and the work takes turns to share the usage of such entity.

https://vissicompcodder.wordpress.com/category/best-ﬁt/ Operating Systems

Now, we have multiple networked computers. Competing for using the
resources of a single entity is not necessary. But scheduling work between the multiple entities still is. Clusters Run a task in all the cluster? Assign work to entities that meet some constraints?

Clusters Don’t think about the internal placement of work when
placing, just see an entity as its overall resources, then if work is assigned to it, it will resolve on its own how to allocate that work in its operating system. cpu memory storage cpu memory storage cpu memory storage cpu mem storage ? work cluster client 2 client 1 client 3

Clusters cpu memory storage cpu memory storage cpu memory storage
client 2 client 1 client 3 cpu memory storage cpu memory storage cpu memory storage client 2 client 1 client 3 cluster cpu mem storage cluster work

gossip protocol consensus protocol

https://www.nomadproject.io/docs/internals/architecture.html Regions are fully independent from each other, and do
not share jobs, clients, or state. They are loosely- coupled using a gossip protocol, which allows users to submit jobs to any region or query the state of any region transparently. Requests are forwarded to the appropriate server to be processed and the results returned.

Optimistic vs Pessimistic Internal vs External State Single vs Multi
Level Service vs Batch Oriented

Optimistic vs Pessimistic Do stuff assuming it will work, abort
if fails. The other assumes other will use my resource, so mutex it.

Management and coordination, should user provide it (easier for dev),
or program acquire it (easier for user) Internal vs External State Google Omega: internal allows good concurrency

Sparrow: ﬁxed function, optimized for high throughput batch scheduling, no
alternatives. Mesos: multiple schedulers, a cluster manager that manages different stuff. Single vs Multi Level Single Level ~ ﬁxed, Multi Level ~ pluggable

Service vs Batch Oriented Long lived jobs Fast, short lived
jobs MS’s Apollo is extremely optimistic with batch, doesn’t think of conﬂicts, sends work to client and lets it resolve any problems.

What is provided to Nomad: Job - A Job is
a speciﬁcation provided by users that declares a workload for Nomad. A Job is composed of one or more task groups. SEE JOB SPEC EXAMPLE: https://www.nomadproject.io/docs/jobspec/index.html Task Group - A Task Group is a set of tasks that must be run together. The entire group must run on the same client node and cannot be split. Task - Tasks are executed by drivers. Tasks specify their driver, conﬁguration for the driver, constraints, and resources required. { } { } e.g. Docker, Qemu, Java, exec binaries, etc.

What is provided to Nomad: Client - A Client of
Nomad is a machine that tasks can be run on. No information is provided about the clients other than networking info such as ports and IPs of servers. The running agent is in charge of obtaining the resources and constraints of the client’s system, via ‘ﬁngerprinting’. Server - Nomad servers are the brains of the cluster. 3~5 servers per region gives good balance between availability (in case of failure) and speed (more servers give slower consensus). The servers in each datacenter are part of a single consensus group. SEE SERVER/CLIENT EXAMPLE: https://www.nomadproject.io/intro/getting-started/cluster.html After servers and clients are registered, they are loaded into memory as structs, and scheduling is done on memory over those structs and then the operation is relayed to the respective agent.

Only what we want to run is provided, and the
available places to run, Nomad takes the decision. But how does it decide? Easy! (?)

Bin Packing Fitting best in overall resources is ideal, but
sometimes other things matter, such as software constraints, availability, speed of response, etc, which may yield different results than a perfect best ﬁt. Nomad aims to provide an optimal bin packing, but tradeoffs can be done according to the desired speed of response.

System Scheduling The system scheduler is used to register jobs
that should be run on all clients that meet the job's constraints. The system scheduler is also invoked when clients join the cluster or transition into the ready state. This means that all registered system jobs will be re-evaluated and their tasks will be placed on the newly available nodes if the constraints are met. This scheduler type is extremely useful for deploying and managing tasks that should be present on every node in the cluster. Since these tasks are being managed by Nomad, they can take advantage of job updating, rolling deploys, service discovery and more. https://www.nomadproject.io/docs/jobspec/schedulers.html

Batch jobs are much less sensitive to short term performance
fluctuations and are short lived, finishing in a few minutes to a few days. Although the batch scheduler is very similar to the service scheduler, it makes certain optimizations for the batch workload. The main distinction is that after finding the set of nodes that meet the jobs constraints it uses the power of two choices described in Berkeley's Sparrow scheduler to limit the number of nodes that are ranked. Batch Scheduling https://www.nomadproject.io/docs/jobspec/schedulers.html

https://www.nomadproject.io/docs/jobspec/schedulers.html The service scheduler is designed for scheduling long lived
services that should never go down. As such, the service scheduler ranks a large portion of the nodes that meet the jobs constraints and selects the optimal node to place a task group on. The service scheduler uses a best ﬁt scoring algorithm inﬂuenced by Google work on Borg. Ranking this larger set of candidate nodes increases scheduling time but provides greater guarantees about the optimality of a job placement, which given the service workload is highly desirable. Service Scheduling

What if we wanted our own ‘virtual’ cluster to evaluate
the performance of some jobs we want to run, for example, for the acquisition of a real cluster? Or also, to evaluate the efﬁciency of bin packing and acquire metrics that may be useful for comparing with other schedulers? My internship project at Hooklift, Inc. Hopeful OSS contribution to Nomad ;-)

Simulator Hopeful OSS contribution to Nomad ;-)

Visualizer Hooklift’s own tool :-P

Node consumption Distribution of allocation times Sorted list of allocation
counts Playback controls CPU Memory Disk Changed node(s) each line represents a node each iteration is a job evaluation and some relevant statistics Press to see simulator input/output speciﬁcations :-)

The usual rules apply for allocation: Jobs must belong to
same region of nodes, otherwise won't allocate. Jobs must have a datacenter related to the one in the nodes, otherwise won't allocate. Jobs' task drivers must be related to the ones in nodes, otherwise won't allocate. For speciﬁcally evaluating the bin packing, its better if all jobs and nodes belong to a single datacenter, in the same region, with the same task drivers, so all nodes are eligible for scheduling, and the bin packing can be seen more evidently. Evaluating Bin Packing

Simulator Goals Met It should simulate a cluster of machines
or nodes in memory, presenting them to Nomad as if they were real machines • It should allow conﬁguring a pre-existing cluster state before a given simulation is run • It should exercise Nomad’s scheduling algorithms on the simulated cluster • It should show metrics about placements such as: ◦ Time taken for a given scheduling algorithm to make a placement decision, under different cluster load conditions. ◦ TaskGroup Placements per second ◦ Job Placements per second ◦ Failed placements

Scheduling Applications on Clusters with Nomad

Scheduling Applications on Clusters with Nomad

Santiago Zubieta

More Decks by Santiago Zubieta

Other Decks in Programming

Featured

Transcript

Scheduling Applications on Clusters with Santiago Martín Zubieta Ortiz Software

Thanks for the speaker space

@mde_devops Thanks for the awesome community

Thanks for the experience

Scheduling

How to assign resources to perform some work, with an

Many think at ﬁrst of operating systems. Give resources... to

https://vissicompcodder.wordpress.com/category/best-ﬁt/ Operating Systems

Now, we have multiple networked computers. Competing for using the

Clusters Don’t think about the internal placement of work when

Clusters cpu memory storage cpu memory storage cpu memory storage

Nomad

gossip protocol consensus protocol

https://www.nomadproject.io/docs/internals/architecture.html Regions are fully independent from each other, and do

Optimistic vs Pessimistic Internal vs External State Single vs Multi

Optimistic vs Pessimistic Do stuff assuming it will work, abort

Management and coordination, should user provide it (easier for dev),

Sparrow: ﬁxed function, optimized for high throughput batch scheduling, no

Service vs Batch Oriented Long lived jobs Fast, short lived

What is provided to Nomad: Job - A Job is

What is provided to Nomad: Client - A Client of

Only what we want to run is provided, and the

Bin Packing Fitting best in overall resources is ideal, but

System Scheduling The system scheduler is used to register jobs

Batch jobs are much less sensitive to short term performance

https://www.nomadproject.io/docs/jobspec/schedulers.html The service scheduler is designed for scheduling long lived

What if we wanted our own ‘virtual’ cluster to evaluate

Simulator Hopeful OSS contribution to Nomad ;-)

Simulator Hopeful OSS contribution to Nomad ;-)

Visualizer Hooklift’s own tool :-P

Node consumption Distribution of allocation times Sorted list of allocation

The usual rules apply for allocation: Jobs must belong to

Simulator Goals Met It should simulate a cluster of machines