Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reliably shipping containers in a resource rich world using Titan

Diptanu Choudhury
June 23, 2015

Reliably shipping containers in a resource rich world using Titan

Netflix has a complex microservices architecture that is operated in an active-active manner from multiple geographies on top of AWS. Amazon gives us the flexibility to tap into massive amounts of resources, but how we use and manage those is a constantly evolving and ever-growing task. We have developed Titan to make cluster management, application deployments using Docker and process supervision much more robust and efficient in terms of CPU/memory utilization across all of our servers in different geographies.
Titan, a combination of Docker and Apache Mesos, is an application infrastructure gives us a highly resilient and dynamic PAAS, that is native to public clouds and runs across multiple geographies. It makes it easy for us to manage applications in our complex infrastructure and gives us the ability to make changes in the IAAS layer without impacting developer productivity or sacrificing insight into our production infrastructure.

Diptanu Choudhury

June 23, 2015


  1. Reliably shipping containers in a resource rich world using Titan

    Diptanu Choudhury Software Engineer, Netflix @diptanu
  2. Titan • A distributed compute service native to public clouds

    • Provides Auto Scaling to clusters of containers • Supervises containers and provides failover mechanisms to applications running in containers • Provides logging, monitoring, volume management capabilities
  3. Why we chose Docker • Process isolation • Immutable deployment

    artifacts • Ability to package dependencies of an application in a single binary • Tooling around the runtime for building and deploying • Scalable distribution of binaries across clusters
  4. Logging • Titan allows users to stream logs of a

    Task from a running container in a location transparent manner • Logs are archived off-instance and Titan provides API to stream logs of finished tasks
  5. Network • In EC2 Classic, Titan exposes ports on containers

    on the host machine. - Mesos is used as a broker for port allocation
  6. Network • In VPC, every container gets its own IP

    address. - Mesos is completely out of the picture for port management - We use ENIs and move them into the network namespace of containers - Developing a custom network plugin
  7. Monitoring • cgroup metrics published by the kernel are pushed

    to Atlas. • Users can see all the cgroup metrics per task. • cgroup notification API for alerting
  8. Failover • Titan allows SREs to drain a cluster of

    containers into newer compute nodes • Underlying VMs are automatically terminated when containers crashes for hardware/OS problems • Allows failover across multiple data centers
  9. AutoScaling • Two Levels of Autoscaling - Scaling of underlying

    compute resources - Application Scaling based on business and performance metrics
  10. AutoScaling • Two Types of Autoscaling - Predictive • Titan

    scales up infrastructure based on historical data on statistical modeling. - Reactive • Scaling activities are triggered based on pre- defined thresholds
  11. Where are we with Titan at Netflix Prototype for running

    cron jobs Non Mission Critical Algorithms Mission Critical Batch Jobs in Production Prototypes for running online processes and web services Parts of Netflix Data Pipeline The Netflix API and Edge Systems May ‘14 Future Near Term