Slide 1

Slide 1 text

http://1stchoicesportsrehab.com/wp-content/uploads/2012/05/calendar.jpg Scheduling Applications at Scale Meeting Tomorrow's Application Needs, Today

Slide 2

Slide 2 text

SETH VARGO @sethvargo

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Nomad Globally Distributed Optimistically Concurrent Scheduler

Slide 6

Slide 6 text

Nomad Globally Distributed Optimistically Concurrent Scheduler

Slide 7

Slide 7 text

sched·ul·er COMPUTING (n.) a program that arranges jobs or a computer's operations into an appropriate sequence.

Slide 8

Slide 8 text

NORMAL sched·ul·er (n.) a person or machine that organizes or maintains schedules.

Slide 9

Slide 9 text

Schedulers map a set of work to a set of resources

Slide 10

Slide 10 text

https://cdn.shopify.com/s/files/1/0167/3936/files/04-MS-Excel-4-0-Office-4-0_large.gif

Slide 11

Slide 11 text

  Operator Datacenter

Slide 12

Slide 12 text

  Operator Datacenter Skywalker Vader Leia Solo

Slide 13

Slide 13 text

  Operator Datacenter PYTHON PYTHON GOLANG GOLANG GOLANG Skywalker Vader Leia Solo

Slide 14

Slide 14 text

  Operator Datacenter RUBY PYTHON PYTHON PYTHON GOLANG GOLANG GOLANG GOLANG NODE Skywalker Vader Leia Solo

Slide 15

Slide 15 text

  Operator Datacenter RUBY PYTHON PYTHON PYTHON GOLANG GOLANG GOLANG GOLANG NODE Skywalker Vader Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.5 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Randomly kills applications

Slide 16

Slide 16 text

  Operator Datacenter RUBY PYTHON PYTHON PYTHON GOLANG GOLANG GOLANG GOLANG NODE Skywalker Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.5 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Randomly kills applications F F Vader

Slide 17

Slide 17 text

  Operator Datacenter RUBY PYTHON PYTHON PYTHON GOLANG GOLANG GOLANG GOLANG NODE Skywalker Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.5 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Randomly kills applications F F Vader PYTHON PYTHON PYTHON

Slide 18

Slide 18 text

  Operator Datacenter RUBY GOLANG GOLANG GOLANG GOLANG NODE Skywalker Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.5 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Randomly kills applications Vader PYTHON PYTHON PYTHON

Slide 19

Slide 19 text

  Operator Datacenter RUBY GOLANG GOLANG GOLANG GOLANG NODE Skywalker Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.9 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Rebuilt on 04/20/2016 Vader PYTHON PYTHON PYTHON

Slide 20

Slide 20 text

  Operator Datacenter RUBY GOLANG GOLANG GOLANG GOLANG NODE Skywalker Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.9 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Rebuilt on 04/20/2016 Vader PYTHON PYTHON PYTHON

Slide 21

Slide 21 text

This does not scale

Slide 22

Slide 22 text

CPU Scheduler CORE CORE CORE CORE CPU SCHEDULER KERNEL APACHE REDIS BASH

Slide 23

Slide 23 text

CPU Scheduler CORE CORE CPU SCHEDULER KERNEL APACHE REDIS BASH

Slide 24

Slide 24 text

Schedulers in the Wild Type Work Resources CPU Scheduler Threads Physical Cores EC2 / Nova Virtual Machines Hypervisors Hadoop YARN MapReduce Jobs Client Nodes Cluster Scheduler Applications Machines

Slide 25

Slide 25 text

Scheduler Advantages Higher Resource Utilization Decouple Work from Resources Better Quality of Service

Slide 26

Slide 26 text

Scheduler Advantages Bin Packing Over-Subscription Job Queueing Higher Resource Utilization Decouple Work from Resources Better Quality of Service

Slide 27

Slide 27 text

Scheduler Advantages Abstraction API Contracts Standardization Higher Resource Utilization Decouple Work from Resources Better Quality of Service

Slide 28

Slide 28 text

Scheduler Advantages Priorities Resource Isolation Pre-emption Higher Resource Utilization Decouple Work from Resources Better Quality of Service

Slide 29

Slide 29 text

Not a New Concept

Slide 30

Slide 30 text

Not Alone

Slide 31

Slide 31 text

Nomad

Slide 32

Slide 32 text

Nomad Cluster Scheduler Deployments Job Specification

Slide 33

Slide 33 text

job "redis" { datacenters = ["us-east-1"] task "redis" { driver = "docker" config { image = "redis:latest" } resources { cpu = 500 # Mhz memory = 256 # MB network { mbits = 10 dynamic_ports = ["redis"] } } } } example.nomad

Slide 34

Slide 34 text

Job specification declares what to run

Slide 35

Slide 35 text

Nomad determines how and where to run

Slide 36

Slide 36 text

Nomad abstracts work from resources

Slide 37

Slide 37 text

Nomad Higher Resource Utilization Decouple Work from Resources Better Quality of Service

Slide 38

Slide 38 text

Designing Nomad

Slide 39

Slide 39 text

Nomad Multi-Datacenter Multi-Region Flexible Workloads Job Priorities Bin Packing Large Scale Operationally Simple

Slide 40

Slide 40 text

Scaling Requirements Thousands of regions Tens of thousands of clients per region Thousands of jobs per region

Slide 41

Slide 41 text

Built on Experience GOSSIP CONSENSUS

Slide 42

Slide 42 text

Serf Cluster Management Gossip Based (P2P) Membership Failure Detection Event System

Slide 43

Slide 43 text

Serf Gossip Protocol Large Scale Production Hardened Operationally Simple

Slide 44

Slide 44 text

Consul Service Discovery Configuration Coordination (Locking) Central Servers + Distributed Clients

Slide 45

Slide 45 text

Consul Multi-Datacenter Raft Consensus Large Scale Production Hardened

Slide 46

Slide 46 text

Built on Experience GOSSIP CONSENSUS Mature Libraries Proven Design Patterns

Slide 47

Slide 47 text

Built on Experience GOSSIP CONSENSUS Mature Libraries Proven Design Patterns Lacking Scheduling Logic

Slide 48

Slide 48 text

Built on Research GOSSIP CONSENSUS

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

Optimistic vs Pessimistic Internal vs External State Single vs Multi Level Fixed vs Pluggable Service vs Batch Oriented

Slide 51

Slide 51 text

Nomad Inspired by Google Omega Optimistic Concurrency State Coordination Service & Batch workloads Pluggable Architecture

Slide 52

Slide 52 text

Consul Architecture CLIENT CLIENT CLIENT CLIENT CLIENT CLIENT SERVER SERVER SERVER REPLICATION REPLICATION RPC RPC LAN GOSSIP SERVER SERVER SERVER REPLICATION REPLICATION WAN GOSSIP

Slide 53

Slide 53 text

Consul Multi-Datacenter Servers per DC Failure Isolation Domain is the Datacenter

Slide 54

Slide 54 text

Single-Region Architecture SERVER SERVER SERVER CLIENT CLIENT CLIENT DC1 DC2 DC3 FOLLOWER LEADER FOLLOWER REPLICATION FORWARDING REPLICATION FORWARDING RPC RPC RPC

Slide 55

Slide 55 text

Multi-Region Architecture SERVER SERVER SERVER FOLLOWER LEADER FOLLOWER REPLICATION FORWARDING REPLICATION REGION B  GOSSIP REPLICATION REPLICATION FORWARDING REGION FORWARDING  REGION A SERVER FOLLOWER SERVER SERVER LEADER FOLLOWER

Slide 56

Slide 56 text

Nomad Region is Isolation Domain 1-N Datacenters Per Region Flexibility to do 1:1 (Consul) Scheduling Boundary

Slide 57

Slide 57 text

Data Model ALLOCATION JOB EVALUATION NODE

Slide 58

Slide 58 text

Evaluation ~= State Change

Slide 59

Slide 59 text

Evaluations Create / Update / Delete Job Node Up / Node Down Allocation Failed

Slide 60

Slide 60 text

Evaluations SCHEDULER func(Evaluation) => []AllocationUpdates

Slide 61

Slide 61 text

Evaluations SCHEDULER func(Evaluation) => []AllocationUpdates Service, Batch, System

Slide 62

Slide 62 text

Server Architecture Omega Class Scheduler Pluggable Logic Internal Coordination and State Multi-Region / Multi-Datacenter

Slide 63

Slide 63 text

Client Architecture Broad OS Support Host Fingerprinting Pluggable Drivers

Slide 64

Slide 64 text

Fingerprinting Type Examples Operating System Kernel, OS, Version Hardware CPU, Memory, Disk Apps (Capabilities) Docker, Java, Consul Environment AWS, GCE

Slide 65

Slide 65 text

Constrain Placement and Bin Pack

Slide 66

Slide 66 text

“Task Requires Linux, Docker, and PCI- Compliant Hardware” expressed as constraints in job file

Slide 67

Slide 67 text

“Task needs 512MB RAM and 1 Core” expressed as resource in job file

Slide 68

Slide 68 text

Execute Tasks Provide Resource Isolation Drivers

Slide 69

Slide 69 text

Containerized Virtualized Standalone Docker Qemu / KVM Java Jar Static Binaries rkt

Slide 70

Slide 70 text

Containerized Virtualized Standalone Docker Qemu / KVM Java Jar rkt Windows Server Containers Hyper-V Xen C# Static Binaries

Slide 71

Slide 71 text

Nomad Schedulers Fingerprints Drivers Job Specification

Slide 72

Slide 72 text

Nomad Single Binary No Dependencies Highly Available

Slide 73

Slide 73 text

Nomad Million Container Challenge 1,000 Jobs 1,000 Tasks per Job 5,000 Hosts on GCE 1,000,000 Containers

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

“ – Negative Nancy No one would ever need to schedule a million containers.

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

No content

Slide 79

Slide 79 text

Nomad Globally Distributed Optimistically Concurrent Scheduler

Slide 80

Slide 80 text

Nomad Higher Resource Utilization Decouple Work from Resources Better Quality of Service

Slide 81

Slide 81 text

SETH VARGO @sethvargo QUESTIONS?