Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scheduling Applications on Clusters with Nomad

Scheduling Applications on Clusters with Nomad

A brief look at Nomad (https://www.nomadproject.io/), created by Hashicorp for the scalable deployment and scheduling of applications, which is part of their toolset which makes the DevOps lives much easier.

Also a look at the basics of a 'Nomad Simulator' project, which allows to simulate a cluster in-memory, presenting it to Nomad as if they were real machines, then run Nomad jobs on them, and retrieve statistics regarding the scheduling process, which together with a separate visualizer tailored specificly for the output of this simulator were part of a software engineering internship project at Hooklift, Inc.

Santiago Zubieta

March 01, 2016
Tweet

More Decks by Santiago Zubieta

Other Decks in Programming

Transcript

  1. Scheduling Applications
    on Clusters with
    Santiago Martín Zubieta Ortiz
    Software Engineering Intern at Hooklift, Inc.

    View Slide

  2. Thanks for the speaker space

    View Slide

  3. @mde_devops
    Thanks for the awesome community

    View Slide

  4. Thanks for the experience

    View Slide

  5. Scheduling

    View Slide

  6. How to assign resources to perform some work, with
    an strategy that aims towards achieving a certain goal.
    Scheduling

    View Slide

  7. Many think at first of operating systems. Give resources...
    to long lived tasks? to most recent tasks? equally for
    everything? to fit tasks best in whats available?
    Operating Systems
    In this case, there’s a competition for the resources of a
    ‘single’ entity, which is a computer, and the work takes
    turns to share the usage of such entity.

    View Slide

  8. https://vissicompcodder.wordpress.com/category/best-fit/
    Operating Systems

    View Slide

  9. Now, we have multiple networked computers. Competing
    for using the resources of a single entity is not necessary.
    But scheduling work between the multiple entities still is.
    Clusters
    Run a task in all the cluster? Assign work to entities that
    meet some constraints?

    View Slide

  10. Clusters
    Don’t think about the internal placement of
    work when placing, just see an entity as its
    overall resources, then if work is assigned
    to it, it will resolve on its own how to
    allocate that work in its operating system.
    cpu
    memory
    storage
    cpu
    memory
    storage
    cpu
    memory
    storage
    cpu
    mem
    storage
    ?
    work
    cluster
    client 2 client 1
    client 3

    View Slide

  11. Clusters
    cpu
    memory
    storage
    cpu
    memory
    storage
    cpu
    memory
    storage
    client 2 client 1
    client 3
    cpu
    memory
    storage
    cpu
    memory
    storage
    cpu
    memory
    storage
    client 2 client 1
    client 3
    cluster
    cpu
    mem
    storage
    cluster
    work

    View Slide

  12. Nomad

    View Slide

  13. View Slide

  14. gossip
    protocol
    consensus
    protocol

    View Slide

  15. https://www.nomadproject.io/docs/internals/architecture.html
    Regions are fully independent from
    each other, and do not share jobs,
    clients, or state. They are loosely-
    coupled using a gossip protocol,
    which allows users to submit jobs
    to any region or query the state of
    any region transparently. Requests
    are forwarded to the appropriate
    server to be processed and the
    results returned.

    View Slide

  16. Optimistic vs Pessimistic
    Internal vs External State
    Single vs Multi Level
    Service vs Batch Oriented

    View Slide

  17. Optimistic vs Pessimistic
    Do stuff assuming it will work, abort if
    fails. The other assumes other will
    use my resource, so mutex it.

    View Slide

  18. Management and coordination, should
    user provide it (easier for dev), or
    program acquire it (easier for user)
    Internal vs External State
    Google Omega: internal allows good concurrency

    View Slide

  19. Sparrow: fixed function, optimized for high
    throughput batch scheduling, no alternatives.
    Mesos: multiple schedulers, a cluster manager
    that manages different stuff.
    Single vs Multi Level
    Single Level ~ fixed, Multi Level ~ pluggable

    View Slide

  20. Service vs Batch Oriented
    Long lived jobs Fast, short lived jobs
    MS’s Apollo is extremely optimistic with batch, doesn’t think of conflicts,
    sends work to client and lets it resolve any problems.

    View Slide

  21. What is provided to Nomad:
    Job - A Job is a specification provided by users that declares a
    workload for Nomad. A Job is composed of one or more task
    groups.
    SEE JOB SPEC EXAMPLE: https://www.nomadproject.io/docs/jobspec/index.html
    Task Group - A Task Group is a set of tasks that must be run
    together. The entire group must run on the same client node and
    cannot be split.
    Task - Tasks are executed by drivers. Tasks specify their driver,
    configuration for the driver, constraints, and resources required.
    {
    }
    {
    } e.g. Docker, Qemu, Java, exec binaries, etc.

    View Slide

  22. What is provided to Nomad:
    Client - A Client of Nomad is a machine that tasks can be run on.
    No information is provided about the clients other than networking info
    such as ports and IPs of servers. The running agent is in charge of obtaining
    the resources and constraints of the client’s system, via ‘fingerprinting’.
    Server - Nomad servers are the brains of the cluster.
    3~5 servers per region gives good balance between availability (in case of failure) and speed
    (more servers give slower consensus). The servers in each datacenter are part of a single
    consensus group.
    SEE SERVER/CLIENT EXAMPLE: https://www.nomadproject.io/intro/getting-started/cluster.html
    After servers and clients are registered, they are loaded into memory as
    structs, and scheduling is done on memory over those structs and then
    the operation is relayed to the respective agent.

    View Slide

  23. Only what we want to run is provided,
    and the available places to run,
    Nomad takes the decision. But how
    does it decide?
    Easy! (?)

    View Slide

  24. Bin Packing
    Fitting best in overall resources is ideal, but sometimes
    other things matter, such as software constraints,
    availability, speed of response, etc, which may yield
    different results than a perfect best fit. Nomad aims to
    provide an optimal bin packing, but tradeoffs can be done
    according to the desired speed of response.

    View Slide

  25. System Scheduling
    The system scheduler is used to register jobs that should be run
    on all clients that meet the job's constraints. The system
    scheduler is also invoked when clients join the cluster or
    transition into the ready state. This means that all registered
    system jobs will be re-evaluated and their tasks will be placed
    on the newly available nodes if the constraints are met.
    This scheduler type is extremely useful for deploying and
    managing tasks that should be present on every node in the
    cluster. Since these tasks are being managed by Nomad, they can
    take advantage of job updating, rolling deploys, service
    discovery and more.
    https://www.nomadproject.io/docs/jobspec/schedulers.html

    View Slide

  26. Batch jobs are much less sensitive to short term performance fluctuations
    and are short lived, finishing in a few minutes to a few days. Although the
    batch scheduler is very similar to the service scheduler, it makes certain
    optimizations for the batch workload. The main distinction is that after
    finding the set of nodes that meet the jobs constraints it uses the power of
    two choices described in Berkeley's Sparrow scheduler to limit the number
    of nodes that are ranked.
    Batch Scheduling
    https://www.nomadproject.io/docs/jobspec/schedulers.html

    View Slide

  27. https://www.nomadproject.io/docs/jobspec/schedulers.html
    The service scheduler is designed for scheduling long lived services that
    should never go down. As such, the service scheduler ranks a large
    portion of the nodes that meet the jobs constraints and selects the optimal
    node to place a task group on. The service scheduler uses a best fit
    scoring algorithm influenced by Google work on Borg. Ranking this larger
    set of candidate nodes increases scheduling time but provides greater
    guarantees about the optimality of a job placement, which given the service
    workload is highly desirable.
    Service Scheduling

    View Slide

  28. What if we wanted our own ‘virtual’ cluster to
    evaluate the performance of some jobs we want to
    run, for example, for the acquisition of a real cluster?
    Or also, to evaluate the efficiency of bin packing and
    acquire metrics that may be useful for comparing
    with other schedulers? My internship project at
    Hooklift, Inc.
    Hopeful OSS contribution to Nomad ;-)

    View Slide

  29. Simulator
    Hopeful OSS contribution to Nomad ;-)

    View Slide

  30. Simulator
    Hopeful OSS contribution to Nomad ;-)

    View Slide

  31. Visualizer
    Hooklift’s own tool :-P

    View Slide

  32. Node consumption
    Distribution of allocation times
    Sorted list of allocation counts
    Playback controls
    CPU
    Memory
    Disk
    Changed node(s)
    each line represents a node
    each iteration is a job evaluation
    and some relevant statistics
    Press to see simulator input/output specifications :-)

    View Slide

  33. The usual rules apply for allocation:
    Jobs must belong to same region of nodes, otherwise won't allocate.
    Jobs must have a datacenter related to the one in the nodes, otherwise won't allocate.
    Jobs' task drivers must be related to the ones in nodes, otherwise won't allocate.
    For specifically evaluating the bin packing, its better if all jobs and nodes belong to a
    single datacenter, in the same region, with the same task drivers, so all nodes are
    eligible for scheduling, and the bin packing can be seen more evidently.
    Evaluating Bin Packing

    View Slide

  34. Simulator Goals Met
    It should simulate a cluster of machines or nodes in memory, presenting them
    to Nomad as if they were real machines
    • It should allow configuring a pre-existing cluster state before a given simulation
    is run
    • It should exercise Nomad’s scheduling algorithms on the simulated cluster
    • It should show metrics about placements such as:
    ◦ Time taken for a given scheduling algorithm to make a placement decision,
    under different cluster load conditions.
    ◦ TaskGroup Placements per second
    ◦ Job Placements per second
    ◦ Failed placements

    View Slide