Netflix has a complex micro-services architecture which is run in an active-active manner from multiple geographies on top of AWS. Application deployment and management is a very important aspect of running services in this manner at scale. We have developed Titan to make cluster management, application deployments and process supervision much more robust and efficient in terms of CPU/memory utilization across all of our servers in different geographies.
Titan is built on top of Apache Mesos for scheduling processes of applications on top of AWS EC2. In this talk we will talk about the design of our Mesos framework and the scheduler. We will focus on the following aspects of the scheduler – Bin packing algorithms, Scaling in and out of clusters, fault tolerance of processes via reconciliation and processing life cycle events, multi-geography/cross data center redundancy.