Pinterest's Journey from VMs to Containers

Journey from VMs to Containers Micheal Benedict (@micheal)  Cloud &
Data Infrastructure Lida Li (@lidali)  Cloud Management Platform

About Pinterest Stats & Current State Compute Platform - Vision
/ Scope - Orchestration Evaluation - Moving to Containers Future 1 2 3 4

Diverse workloads Services  (ONLINE) Number of Services: 103 Number of
Hosts: 104

Diverse workloads Services  (ONLINE) Batch Jobs  (OFFLINE) Number of Services:
103 Number of Hosts: 104 Number of Data Jobs: 105 Model Training Analytics Pipeline   Block Storage in GBs: 105 Also used as a analytics backend (transactions)

Current state of the world PINLATER  (JOB QUEUE / ASYNC
EXECUTION) MONARCH & OVERWATCH (HADOOP + SPARK) LONG RUNNING (STATELSS / STATEFUL) CLOUD TELETRAAN  (GENERAL COMPUTE ON VM) DIRECT API ACCESS RHODMUS TERRAFORM JOBS (ASYNC / DATA) • Lack of consistent E2E Developer Experience (Develop, Deploy, Operate) • Tech debt / moving to new platforms challenging • Support / Operations challenging • Difficult to implement Infrastructure Governance PROVISIONING COMPUTE PLATFORMS USE-CASES

Compute Platform

fastest path from an idea to production, without worrying about
infrastructure Vision

focus #1 Simplify E2E Dev XP What are the steps
a developer is required (but not expected) to do when building, launching & managing services, batch jobs, etc.?

focus #2 An integrated Infra Platform What is required to
build a reliable, scalable, eﬃcient & well integrated infrastructure platform?

focus #3 Infra Governance Without hampering developer experience and adding
opswork, What controls are required to eﬀectively utilize & manage Infrastructure

SETUP TEST & BUILD UNIT TEST APP IMAGE MANGEMENT OPERATIONS
METRICS LOGS TRACING DEPLOY & RELEASE WORKFLOW MANAGEMENT JOB SUBMISSION INTEGRATION TEST OWNERSHIP SCAFFOLDING ROLES, KEYS & SECRETS RESOURCE MANAGEMENT QUOTA AMI MANAGEMENT CLUSTER PROVISIONING METERING HEALTH CHECK JOB STATUS JOB CONFIG Scope

H1 2016 H2 2016 H1 2017 H2 2017 Phase 2:
Productionize Docker & Adoption • Metric, logging, security and high availability support. • Fully production ready and over one hundred services migrated (including major API fleet) Phase 1: Docker MVP • Developer Workflow • Image Management • Integration w/ existing security   & networking systems • First Production Service migrated Containers @Pinterest Kickoff Container Orchestration @Pinterest Kickoff • Orchestration Evaluation • MVP build & Operate production cluster for a use-case H1 2018 Timeline

CHOICES POC CRITERIA OUTCOME Container Orchestration Evaluation Framework

CHOICES POC CRITERIA OUTCOME • Resource and Task Scheduling (Flexibility,
Multi-Tenancy, Extensibility etc.) • Scalability and Performance • Integration Cost • Docker Support, Sidecar support and Runtime extensibility • Network Support on AWS* • Security Support on AWS* • Stateful Service Support • Ecosystem and Community • Cluster Operations & Support Container Orchestration

Container Orchestration CHOICES POC CRITERIA OUTCOME Custom Scheduler

CHOICES POC CRITERIA OUTCOME Container Orchestration

Container Orchestration CHOICES POC CRITERIA OUTCOME

H2 2016 H1 2017 Phase 2: Productionize Docker & Adoption
• Metric, logging, security and high availability support. • Fully production ready and over one hundred services migrated (including major API fleet) Phase 1: Docker MVP • Developer Workflow • Image Management • Integration w/ existing security   & networking systems • First Production Service migrated Timeline H1 2016 H2 2017 Containers @Pinterest Kickoff Container Orchestration @Pinterest Kickoff • Orchestration Evaluation • MVP build & Operate production cluster for a use-case H1 2018

Teletraan Monit Supervisor Upstart Service AMI Puppet Service Service Service
Cron Jobs Crond • Multiple AMIs, complex management • End users did puppet authoring & testing • Unpredictability around Puppet runs • Disparate process management  (monit, upstart, supervisor) Base AMI Launching a Service Before Containerization Project

Unified (single) AMI Container Container Container Container Teletraan + Telefig
Container Engine (Docker) • Single AMI for Containers • No puppet authoring! • Unified process management • Immutable infrastructure   & deterministic behavior Launching a Service Post Containerization Project

Code Build Test Developer   Workflow Under the hood Developer
Workflow - Docker Containers • Introduce Pinterest Service Description Language (PSDL) • Optimize Image building on large shared image • ECR and a Self Hosted Registry (HA) version: 1 myservice: docker: image: myservice-server user: prod environment: - CONFIG_FILE=config/myservice.props sidecars: zum: deps: #sidecar container - myservice.dep singer: #sidecar container property_sets: - myservice.singer

Logging Service Discovery Metrics AWS Service Service Proxy Config Files
Application Secrets Management Process Control Application Runtime Under the hood Application Runtime - Docker Containers • Manage container run order for a service & its sidecars • Default —net=host • Docker engine running with --live- restore and overlay2 file system • Local Image cleanup (garbage collection) • Parallel prefetching images for deploy performance

75% Hosts State of Migration Stateless Services Migrated

Team Status:

Learnings • Run Containers and Non-Containers together with adjustable ratio
• Ability to measure & compare metrics • Automate deploy migration & Run time configuration Validation (IAM, Security Groups, Service Discovery) • Understand Company / Team Dynamics • Migrate a complex service early

H1 2016 H2 2016 H1 2017 H2 2017 Phase 2:
Productionize Docker & Adoption • Metric, logging, security and high availability support. • Fully production ready and over one hundred services migrated (including major API fleet) Phase 1: Docker MVP • Developer Workflow • Image Management • Integration w/ existing security   & networking systems • First Production Service migrated Containers @Pinterest Kickoff Container Orchestration @Pinterest Kickoff • Orchestration Evaluation • MVP build & Operate production cluster for a use-case H1 2018 Timeline

Container Orchestration Proof of Concept (MVP)

Networking High Level • Use ENI natively supported by EC2
• Also support Different CNIs plugins (Configured by Pod annotations) • Support AWS IAM role and Security Group on ENI with our own meta-proxy • Collaborating w/ AWS on amazon-vpc-cni-k8s

• Proxy CNI invokes the daemon to get the actual
CNI and parameters • Also run customized commands before and after CNI execution • Pod spec has annotation specify the network mode - pinterest.com/ networkmode: xxx Networking Pod Network Setup (Proxy CNI Daemon/Plugin)

IAM Pod IAM Setup • Role set as annotation of
Pod • IPTables rule redirect to local metaproxy (Drome) • Drome consult Kubelet and get token from an external role assume service

Future

SETUP TEST & BUILD UNIT TEST APP IMAGE MANGEMENT OPERATIONS
METRICS LOGS TRACING DEPLOY & RELEASE WORKFLOW MANAGEMENT JOB SUBMISSION INTEGRATION TEST OWNERSHIP SCAFFOLDING ROLES, KEYS & SECRETS RESOURCE MANAGEMENT QUOTA AMI MANAGEMENT CLUSTER PROVISIONING METERING HEALTH CHECK JOB STATUS JOB CONFIG Scope

H1 2018 • Productionize the Cluster (Setup & Operations) •
Adoption - Initial use-case Jenkins, non-critical long running services • Experiment - Spark & TensorFlow on Cabernets • Job Definition Abstraction & Job Submission service (Data Jobs & Long running tasks) • Service Identity Management & Resource Metering

Thanks!

Pinterest's Journey from VMs to Containers

Pinterest's Journey from VMs to Containers

More Decks by Micheal Benedict (@micheal)

Other Decks in Technology

Featured

Transcript