Reliably shipping containers in a resource rich world using Titan

Slide 1

Slide 1 text

Reliably shipping containers in a resource rich world using Titan Diptanu Choudhury Software Engineer, Netflix @diptanu

Slide 2

Slide 2 text

Titan • A distributed compute service native to public clouds • Provides Auto Scaling to clusters of containers • Supervises containers and provides failover mechanisms to applications running in containers • Provides logging, monitoring, volume management capabilities

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

A Cloud Native Application built on MicroServices Architecture

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Architected with High Availability in mind

Slide 7

Slide 7 text

The operational benefits of a PaaS without the dilemmas of sandboxing technologies.

Slide 8

Slide 8 text

A need for a common resource scheduler for domain specific distributed systems

Slide 9

Slide 9 text

Consistent tooling and operational control plane for SREs across all technology stacks

Slide 10

Slide 10 text

Faster turn around time from development to production

Slide 11

Slide 11 text

Auto Scaling Groups are harder to adopt for event based orchestration systems

Slide 12

Slide 12 text

Increasing density of application processes per server

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Why we chose Docker • Process isolation • Immutable deployment artifacts • Ability to package dependencies of an application in a single binary • Tooling around the runtime for building and deploying • Scalable distribution of binaries across clusters

Slide 15

Slide 15 text

Docker Containers are the deployment artifacts and process runtime for Titan

Slide 16

Slide 16 text

The Titan API

Slide 17

Slide 17 text

A Titan Compute Node Direct Netﬂix Titan

Slide 18

Slide 18 text

From a 1000 Feet

Slide 19

Slide 19 text

From a 5000 Feet

Slide 20

Slide 20 text

Disk • Titan manages ephemeral volumes for containers. • Data volumes are mounted within containers

Slide 21

Slide 21 text

We use ZFS on Linux

Slide 22

Slide 22 text

Logging • Titan allows users to stream logs of a Task from a running container in a location transparent manner • Logs are archived off-instance and Titan provides API to stream logs of finished tasks

Slide 23

Slide 23 text

Network • In EC2 Classic, Titan exposes ports on containers on the host machine. - Mesos is used as a broker for port allocation

Slide 24

Slide 24 text

Network • In VPC, every container gets its own IP address. - Mesos is completely out of the picture for port management - We use ENIs and move them into the network namespace of containers - Developing a custom network plugin

Slide 25

Slide 25 text

Monitoring • cgroup metrics published by the kernel are pushed to Atlas. • Users can see all the cgroup metrics per task. • cgroup notification API for alerting

Slide 26

Slide 26 text

Failover • Titan allows SREs to drain a cluster of containers into newer compute nodes • Underlying VMs are automatically terminated when containers crashes for hardware/OS problems • Allows failover across multiple data centers

Slide 27

Slide 27 text

AutoScaling • Two Levels of Autoscaling - Scaling of underlying compute resources - Application Scaling based on business and performance metrics

Slide 28

Slide 28 text

AutoScaling

Slide 29

Slide 29 text

AutoScaling • Two Types of Autoscaling - Predictive • Titan scales up infrastructure based on historical data on statistical modeling. - Reactive • Scaling activities are triggered based on pre- defined thresholds

Slide 30

Slide 30 text

Where are we with Titan at Netflix Prototype for running cron jobs Non Mission Critical Algorithms Mission Critical Batch Jobs in Production Prototypes for running online processes and web services Parts of Netﬂix Data Pipeline The Netﬂix API and Edge Systems May ‘14 Future Near Term

Slide 31

Slide 31 text

Thank you Diptanu Choudhury @diptanu www.linkedin.com/in/diptanu