Distributed Scheduling with Apache
Mesos on the Cloud
UberConf 2015, Denver
Diptanu Gon Choudhury
@diptanu
Slide 2
Slide 2 text
No content
Slide 3
Slide 3 text
This Talk
• Challenges of traditional Data Centre environments
• Taming the complexities of running services at
Scale
• Cloud Native Cluster Management with Titan
Slide 4
Slide 4 text
Evolution of Data Centers
Mid 90’s Early 2000
Slide 5
Slide 5 text
The Modern Data Centre
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
SDN
Network Storage
Slide 6
Slide 6 text
Data Centre of 2015
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
SDN
Network Storage
Cloud
Storage
Cloud
Persistence
Slide 7
Slide 7 text
Internet Scale Complexities
London Virginia Tokyo
Distributed Across Geographies to lower latencies
Highly Available within a region by distribution across
buildings
Slide 8
Slide 8 text
Evolution of Applications
Application
Server
Database
Software Load Balancers
API Servers
Mid Tier Services
Caches
Caches
Distributed K/V Stores
Slide 9
Slide 9 text
The Multi Core World
Scale by running on more cores
Scale by adding more servers
Scale by running on commodity hardware
Slide 10
Slide 10 text
The Modern Internet Scale
Application
Is a Data Center Application
Slide 11
Slide 11 text
Data Centre Applications
are essentially Distributed
Systems
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
Operational Aspects of
Distributed Systems
• Provisioning Compute Resources
• Configuration of services
• Distribution of services
• Supervision and Fault Tolerance
• Service Discovery
Slide 14
Slide 14 text
Provisioning Resources
Usually Ops folks assign servers for specific teams or
applications
VM1 VM2 VM3 VM4 VM5 VM6
Service Database Batch Process
Statically partitions the data centre
Slide 15
Slide 15 text
Challenges of manual
provisioning
Node Failures - Restore services from failed nodes on
specific servers
Distribution of services across fault domains are harder
at scale
Slide 16
Slide 16 text
Different Fault Domains
• Node
• Memory, Disk, CPU, Network card, etc
• Rack
• PDU
• Switch
• Data Centre
• Power
• Cooling
Slide 17
Slide 17 text
Maintenance
• Upgrading software is harder
• Choosing which machines to upgrade is more difficult
VM1 VM2 VM3 VM4 VM5 VM6
Service Database Batch Process
Challenges of Manual
Scheduling
• Homogenous distribution of applications on a single
node decreases utilization.
• Static partitioning doesn’t allow sharing a group of
machine’s resources across multiple applications.
Slide 21
Slide 21 text
Enter the Era of Cluster
Managers
• Mesos
• Borg
• Kubernetes
• CoreOS Fleet
Slide 22
Slide 22 text
Cluster Managers
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
SDN
Network Storage
Cluster Manager
API Servers Batch Apps DBs
Slide 23
Slide 23 text
Mesos
Slide 24
Slide 24 text
Mesos
Moves the focus from
Servers to Compute
Resources
Slide 25
Slide 25 text
Mesos
• Two level scheduler
• Provides discovery and brokerage of compute
resources
• Semantics for launching processes
• Event driven API for monitoring life cycle of
applications
• Resource Isolation on a single node
Slide 26
Slide 26 text
Domain Specific
Frameworks
• Scheduling decisions are left to users
• Allows plugging in multiple frameworks
• Allows users to define their own states for
processes
• Pending -> Running -> Dead
• Sends messages when state of a process changes
• Task Dispatched -> Task Staging-> Task Running -> Task Finished
Slide 27
Slide 27 text
Node Node Node Node Node
10s of 1000s of Compute Nodes
Mesos
Scheduler 1 Scheduler 2
Mesos - Indirections
Mesos - Custom Executors
Linux Kernel
Mesos Slave
Mesos Executor
Process
Compute Node
Slide 32
Slide 32 text
Mesos - SDK for building
Data Centre OS
• SDK in Java, Python, Go
• Provides interfaces for exchanging messages
between schedulers and executors
• Provides log replication capabilities
Slide 33
Slide 33 text
Data Centre OS Services
• A highly available and consistent control plane for
managing state of the cluster
• An API for users and other services to submit job
specifications
• Custom Executors to setup processes and
communicate life cycle events of processes
• Containerizer for providing process isolation
Slide 34
Slide 34 text
Cloud Native Scheduling
• AutoScaling
• Dynamic Reservations
• Automatice Node Replacements
• Fail-overs across multiple Regions
Slide 35
Slide 35 text
Titan
Slide 36
Slide 36 text
Titan
• A distributed compute service native to public
clouds
• Provides AutoScaling to clusters of Containers
• Supervises containers and provides failover
mechanisms to applications running in containers
• Provides logging, monitoring, volume
management capabilities
Slide 37
Slide 37 text
Titan
Slide 38
Slide 38 text
A Compute Node of Titan
Slide 39
Slide 39 text
Scheduling Library for
Mesos
Slide 40
Slide 40 text
Dynamic Reservations
• Titan allows reserving resources for specific applications
• Enforces the reservations under resource contentions
• Reservations are made on a priority level
• P1 (Guaranteed Reservations) <-> P3 (Best Effort)
Slide 41
Slide 41 text
AutoScaling
• Two Levels of AutoScaling
• Scaling of underlying compute resources
• Application Scaling based on business and
performance metrics
Slide 42
Slide 42 text
jobs.netflix.com
The Data Centre As a Computer
http://www.cs.berkeley.edu/~rxin/db-papers/WarehouseScaleComputing.pdf
Resource Scheduling using Fenzo
http://www.slideshare.net/spodila/aws-reinvent-2014-talk-scheduling-using-apache-mesos-in-the-cloud
Apache Mesos
https://www.cs.berkeley.edu/~alig/papers/mesos.pdf
Large Scale Cluster Management at Google
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf