Reliably shipping containers in a resource rich world using Titan

57d490dbd122a839031cd17b9d0795da?s=47 Diptanu Choudhury
June 23, 2015

Reliably shipping containers in a resource rich world using Titan

Netflix has a complex microservices architecture that is operated in an active-active manner from multiple geographies on top of AWS. Amazon gives us the flexibility to tap into massive amounts of resources, but how we use and manage those is a constantly evolving and ever-growing task. We have developed Titan to make cluster management, application deployments using Docker and process supervision much more robust and efficient in terms of CPU/memory utilization across all of our servers in different geographies.
Titan, a combination of Docker and Apache Mesos, is an application infrastructure gives us a highly resilient and dynamic PAAS, that is native to public clouds and runs across multiple geographies. It makes it easy for us to manage applications in our complex infrastructure and gives us the ability to make changes in the IAAS layer without impacting developer productivity or sacrificing insight into our production infrastructure.


Diptanu Choudhury

June 23, 2015


  1. Reliably shipping containers in a resource rich world using Titan

    Diptanu Choudhury Software Engineer, Netflix @diptanu
  2. Titan • A distributed compute service native to public clouds

    • Provides Auto Scaling to clusters of containers • Supervises containers and provides failover mechanisms to applications running in containers • Provides logging, monitoring, volume management capabilities
  3. None
  4. A Cloud Native Application built on MicroServices Architecture

  5. None
  6. Architected with High Availability in mind

  7. The operational benefits of a PaaS without the dilemmas of

    sandboxing technologies.
  8. A need for a common resource scheduler for domain specific

    distributed systems
  9. Consistent tooling and operational control plane for SREs across all

    technology stacks
  10. Faster turn around time from development to production

  11. Auto Scaling Groups are harder to adopt for event based

    orchestration systems
  12. Increasing density of application processes per server

  13. None
  14. Why we chose Docker • Process isolation • Immutable deployment

    artifacts • Ability to package dependencies of an application in a single binary • Tooling around the runtime for building and deploying • Scalable distribution of binaries across clusters
  15. Docker Containers are the deployment artifacts and process runtime for

  16. The Titan API

  17. A Titan Compute Node Direct Netflix Titan

  18. From a 1000 Feet

  19. From a 5000 Feet

  20. Disk • Titan manages ephemeral volumes for containers. • Data

    volumes are mounted within containers
  21. We use ZFS on Linux

  22. Logging • Titan allows users to stream logs of a

    Task from a running container in a location transparent manner • Logs are archived off-instance and Titan provides API to stream logs of finished tasks
  23. Network • In EC2 Classic, Titan exposes ports on containers

    on the host machine. - Mesos is used as a broker for port allocation
  24. Network • In VPC, every container gets its own IP

    address. - Mesos is completely out of the picture for port management - We use ENIs and move them into the network namespace of containers - Developing a custom network plugin
  25. Monitoring • cgroup metrics published by the kernel are pushed

    to Atlas. • Users can see all the cgroup metrics per task. • cgroup notification API for alerting
  26. Failover • Titan allows SREs to drain a cluster of

    containers into newer compute nodes • Underlying VMs are automatically terminated when containers crashes for hardware/OS problems • Allows failover across multiple data centers
  27. AutoScaling • Two Levels of Autoscaling - Scaling of underlying

    compute resources - Application Scaling based on business and performance metrics
  28. AutoScaling

  29. AutoScaling • Two Types of Autoscaling - Predictive • Titan

    scales up infrastructure based on historical data on statistical modeling. - Reactive • Scaling activities are triggered based on pre- defined thresholds
  30. Where are we with Titan at Netflix Prototype for running

    cron jobs Non Mission Critical Algorithms Mission Critical Batch Jobs in Production Prototypes for running online processes and web services Parts of Netflix Data Pipeline The Netflix API and Edge Systems May ‘14 Future Near Term
  31. Thank you Diptanu Choudhury @diptanu