Scheduling Applications at Scale

Scheduling Applications at Scale

Tools like Docker and rkt make it easier than ever to package and distribute applications. Unfortunately, not all organizations have the luxury of being able to package their applications in a container runtime.

Many organizations have virtualized workloads that cannot be easily containerized, such as applications that require full hardware isolation or virtual appliances. On the opposite end of the spectrum, some organizations deploy workloads that are already static binaries such as Go applications or Java applications that only rely on the JVM. These types of applications do not benefit from containerization as they are already self-contained. To address the growing heterogeneity of workloads, HashiCorp created Nomad – a globally aware, distributed scheduler and cluster manager.

Nomad is designed to handle many types of workloads, on a variety of operating systems, at massive scale. Nomad empowers developers to specify jobs and tasks using a high-level specification in a plain-text file. Nomad accepts the job specification, parses the information, determines which compatible hosts have available resources, and then automatically manages the placement, healing, and scaling of the application. By placing multiple applications per host, Nomad maximizes resource utilization and dramatically reduces infrastructure costs.

The flexibility of Nomad’s design brings the benefits of a scheduled application workflow to organizations with heterogeneous workloads and operating systems. This talk will discuss the pros and cons of running in a scheduled environment and includes a series of live demos to supplement the learning experience.

502828deee7e3b38ca1e527dded8a1a9?s=128

Seth Vargo

April 28, 2016
Tweet

Transcript

  1. http://1stchoicesportsrehab.com/wp-content/uploads/2012/05/calendar.jpg Scheduling Applications at Scale Meeting Tomorrow's Application Needs, Today

  2. SETH VARGO @sethvargo

  3. None
  4. None
  5. Nomad Globally Distributed Optimistically Concurrent Scheduler

  6. Nomad Globally Distributed Optimistically Concurrent Scheduler

  7. sched·ul·er COMPUTING (n.) a program that arranges jobs or a

    computer's operations into an appropriate sequence.
  8. NORMAL sched·ul·er (n.) a person or machine that organizes or

    maintains schedules.
  9. Schedulers map a set of work to a set of

    resources
  10. https://cdn.shopify.com/s/files/1/0167/3936/files/04-MS-Excel-4-0-Office-4-0_large.gif

  11.   Operator Datacenter

  12.   Operator Datacenter Skywalker Vader Leia Solo

  13.   Operator Datacenter PYTHON PYTHON GOLANG GOLANG GOLANG Skywalker

    Vader Leia Solo
  14.   Operator Datacenter RUBY PYTHON PYTHON PYTHON GOLANG GOLANG

    GOLANG GOLANG NODE Skywalker Vader Leia Solo
  15.   Operator Datacenter RUBY PYTHON PYTHON PYTHON GOLANG GOLANG

    GOLANG GOLANG NODE Skywalker Vader Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.5 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Randomly kills applications
  16.   Operator Datacenter RUBY PYTHON PYTHON PYTHON GOLANG GOLANG

    GOLANG GOLANG NODE Skywalker Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.5 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Randomly kills applications F F Vader
  17.   Operator Datacenter RUBY PYTHON PYTHON PYTHON GOLANG GOLANG

    GOLANG GOLANG NODE Skywalker Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.5 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Randomly kills applications F F Vader PYTHON PYTHON PYTHON
  18.   Operator Datacenter RUBY GOLANG GOLANG GOLANG GOLANG NODE

    Skywalker Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.5 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Randomly kills applications Vader PYTHON PYTHON PYTHON
  19.   Operator Datacenter RUBY GOLANG GOLANG GOLANG GOLANG NODE

    Skywalker Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.9 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Rebuilt on 04/20/2016 Vader PYTHON PYTHON PYTHON
  20.   Operator Datacenter RUBY GOLANG GOLANG GOLANG GOLANG NODE

    Skywalker Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.9 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Rebuilt on 04/20/2016 Vader PYTHON PYTHON PYTHON
  21. This does not scale

  22. CPU Scheduler CORE CORE CORE CORE CPU SCHEDULER KERNEL APACHE

    REDIS BASH
  23. CPU Scheduler CORE CORE CPU SCHEDULER KERNEL APACHE REDIS BASH

  24. Schedulers in the Wild Type Work Resources CPU Scheduler Threads

    Physical Cores EC2 / Nova Virtual Machines Hypervisors Hadoop YARN MapReduce Jobs Client Nodes Cluster Scheduler Applications Machines
  25. Scheduler Advantages Higher Resource Utilization Decouple Work from Resources Better

    Quality of Service
  26. Scheduler Advantages Bin Packing Over-Subscription Job Queueing Higher Resource Utilization

    Decouple Work from Resources Better Quality of Service
  27. Scheduler Advantages Abstraction API Contracts Standardization Higher Resource Utilization Decouple

    Work from Resources Better Quality of Service
  28. Scheduler Advantages Priorities Resource Isolation Pre-emption Higher Resource Utilization Decouple

    Work from Resources Better Quality of Service
  29. Not a New Concept

  30. Not Alone

  31. Nomad

  32. Nomad Cluster Scheduler Deployments Job Specification

  33. job "redis" { datacenters = ["us-east-1"] task "redis" { driver

    = "docker" config { image = "redis:latest" } resources { cpu = 500 # Mhz memory = 256 # MB network { mbits = 10 dynamic_ports = ["redis"] } } } } example.nomad
  34. Job specification declares what to run

  35. Nomad determines how and where to run

  36. Nomad abstracts work from resources

  37. Nomad Higher Resource Utilization Decouple Work from Resources Better Quality

    of Service
  38. Designing Nomad

  39. Nomad Multi-Datacenter Multi-Region Flexible Workloads Job Priorities Bin Packing Large

    Scale Operationally Simple
  40. Scaling Requirements Thousands of regions Tens of thousands of clients

    per region Thousands of jobs per region
  41. Built on Experience GOSSIP CONSENSUS

  42. Serf Cluster Management Gossip Based (P2P) Membership Failure Detection Event

    System
  43. Serf Gossip Protocol Large Scale Production Hardened Operationally Simple

  44. Consul Service Discovery Configuration Coordination (Locking) Central Servers + Distributed

    Clients
  45. Consul Multi-Datacenter Raft Consensus Large Scale Production Hardened

  46. Built on Experience GOSSIP CONSENSUS Mature Libraries Proven Design Patterns

  47. Built on Experience GOSSIP CONSENSUS Mature Libraries Proven Design Patterns

    Lacking Scheduling Logic
  48. Built on Research GOSSIP CONSENSUS

  49. None
  50. Optimistic vs Pessimistic Internal vs External State Single vs Multi

    Level Fixed vs Pluggable Service vs Batch Oriented
  51. Nomad Inspired by Google Omega Optimistic Concurrency State Coordination Service

    & Batch workloads Pluggable Architecture
  52. Consul Architecture CLIENT CLIENT CLIENT CLIENT CLIENT CLIENT SERVER SERVER

    SERVER REPLICATION REPLICATION RPC RPC LAN GOSSIP SERVER SERVER SERVER REPLICATION REPLICATION WAN GOSSIP
  53. Consul Multi-Datacenter Servers per DC Failure Isolation Domain is the

    Datacenter
  54. Single-Region Architecture SERVER SERVER SERVER CLIENT CLIENT CLIENT DC1 DC2

    DC3 FOLLOWER LEADER FOLLOWER REPLICATION FORWARDING REPLICATION FORWARDING RPC RPC RPC
  55. Multi-Region Architecture SERVER SERVER SERVER FOLLOWER LEADER FOLLOWER REPLICATION FORWARDING

    REPLICATION REGION B  GOSSIP REPLICATION REPLICATION FORWARDING REGION FORWARDING  REGION A SERVER FOLLOWER SERVER SERVER LEADER FOLLOWER
  56. Nomad Region is Isolation Domain 1-N Datacenters Per Region Flexibility

    to do 1:1 (Consul) Scheduling Boundary
  57. Data Model ALLOCATION JOB EVALUATION NODE

  58. Evaluation ~= State Change

  59. Evaluations Create / Update / Delete Job Node Up /

    Node Down Allocation Failed
  60. Evaluations SCHEDULER func(Evaluation) => []AllocationUpdates

  61. Evaluations SCHEDULER func(Evaluation) => []AllocationUpdates Service, Batch, System

  62. Server Architecture Omega Class Scheduler Pluggable Logic Internal Coordination and

    State Multi-Region / Multi-Datacenter
  63. Client Architecture Broad OS Support Host Fingerprinting Pluggable Drivers

  64. Fingerprinting Type Examples Operating System Kernel, OS, Version Hardware CPU,

    Memory, Disk Apps (Capabilities) Docker, Java, Consul Environment AWS, GCE
  65. Constrain Placement and Bin Pack

  66. “Task Requires Linux, Docker, and PCI- Compliant Hardware” expressed as

    constraints in job file
  67. “Task needs 512MB RAM and 1 Core” expressed as resource

    in job file
  68. Execute Tasks Provide Resource Isolation Drivers

  69. Containerized Virtualized Standalone Docker Qemu / KVM Java Jar Static

    Binaries rkt
  70. Containerized Virtualized Standalone Docker Qemu / KVM Java Jar rkt

    Windows Server Containers Hyper-V Xen C# Static Binaries
  71. Nomad Schedulers Fingerprints Drivers Job Specification

  72. Nomad Single Binary No Dependencies Highly Available

  73. Nomad Million Container Challenge 1,000 Jobs 1,000 Tasks per Job

    5,000 Hosts on GCE 1,000,000 Containers
  74. None
  75. “ – Negative Nancy No one would ever need to

    schedule a million containers.
  76. None
  77. None
  78. None
  79. Nomad Globally Distributed Optimistically Concurrent Scheduler

  80. Nomad Higher Resource Utilization Decouple Work from Resources Better Quality

    of Service
  81. SETH VARGO @sethvargo QUESTIONS?