Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Journey from VMs to Containers

Journey from VMs to Containers

Pinterest helps you discover and do what you love. Pinterest’s infrastructure is built to cater to its scale—over 150M MAUs across the globe contributing and combing through a billion pins—which has very unique requirements. Micheal Benedict explains how Pinterest, a company operating on VMs in the public cloud since its inception, made a move to containers.

Topics include:

Pinterest’s infrastructure (offline compute, online serving)
VMs versus containers, with regard to developer velocity, service reliability, infrastructure governance, and efficiency
Moving to containers, using Docker
Pinterest’s compute platform

More Decks by Micheal Benedict (@micheal)

Other Decks in Technology

Transcript

  1. Micheal Benedict •I am @micheal •Eng Manager, Infrastructure. Focus on:

    •Continous Delivery •Kubernetes •Infrastructure Governance
  2. Cravings Omar Seyal The perfect path to cold brew 36

    Caffeinated Inc. Pin A bookmark someone has saved from the internet to a board they’ve created.
  3. Cravings Omar Seyal The perfect path to cold brew 36

    Caffeinated Inc. Cravings Omar Seyel The perfect path to cold brew 36 Caffeinated Inc. Pin
  4. Cravings Omar Seyel The perfect path to cold brew 36

    Caffeinated Inc. Board A greater 
 collection of ideas.
  5. 200m+ People on Pinterest
 each month 100b+ Pins 3b+ Boards

    10b+ Recommendatios/Day 450+ Engineers
  6. How to recommend for this pin? Boards Pins Challenges: Graph

    traversal, Candidate generation, Scoring & Serving across billions of objects Source: The Interplay of User Experience & Machine Learning by Vanja Josifovski @Pinterest
  7. Challenges: Graph traversal, Candidate generation, Scoring & Serving across billions

    of objects Boards Pins How to recommend for this pin? Random walks from a node = Personalized PageRank. More connections = higher score. 100K+ steps < 50ms Source: The Interplay of User Experience & Machine Learning by Vanja Josifovski @Pinterest
  8. Visual Search Define visual similarity between any visual object and

    images in a dataset, in real time. Object Detection
  9. Visual Search Define visual similarity between any visual object and

    images in a dataset, in real time. Lens Near Real-time
  10. Analytics Cache µService API NoSQL µS Index Sharded DB µS

    iOS Android Web Mobile Web Hadoop / Spark / Tensor Flow Presto Dashboard Big Data Storage Hive Impala Streaming Hybrid Batch Machine Learning Neutral Nets … Message Bus Source: Thinking with both sides of the brain by David Chaiken @Pinterest Overall Architecture 
 (Serving & Analytics) S
  11. fastest path from an idea to production, without worrying about

    infrastructure without worrying about infrastructure Vision
  12. fastest path from an idea to production, without worrying about

    infrastructure without worrying about infrastructure Vision
  13. focus #1 Simplify E2E Dev XP What are the steps

    a developer is required (but not expected) to do when building, launching & managing services, batch jobs, etc.?
  14. focus #2 An integrated Infra Platform What is required to

    build a reliable, scalable, efficient & well integrated infrastructure platform?
  15. focus #3 Infra Governance Without hampering developer experience and adding

    opswork, What controls are required to effectively utilize & manage Infrastructure
  16. SETUP TEST & BUILD UNIT TEST IMAGE MANGEMENT OPERATIONS METRICS

    LOGS TRACING DELIVERY WORKFLOW MANAGEMENT JOB SUBMISSION INTEGRATION TEST OWNERSHIP SCAFFOLDING ROLES, KEYS & SECRETS RESOURCE MANAGEMENT QUOTA AMI MANAGEMENT CLUSTER PROVISIONING METERING HEALTH CHECK JOB STATUS JOB CONFIG Scope DEV XP UI CLI API
  17. H1 2016 H2 2016 H1 2017 H2 2017 Phase 2:

    Productionize Docker & Adoption • Metric, logging, security and high availability support. • Fully production ready and over one hundred services migrated (+API fleet) Phase 1: Docker MVP • Developer Workflow • Image Management • Integration w/ existing security 
 & networking systems • First Production Service migrated Containers @Pinterest Kickoff H1 2018 Timeline
  18. H1 2017 H2 2017 H1 2018 H2 2018 Timeline Container

    Orchestration @Pinterest Kickoff • Motivation / Evaluation • MVP build & Operate production cluster for a use-case Phase 1: Onboard workloads (non-serving, batch type) • Adhoc job submission (Tooling) • Onboarded Jenkins • Onboard JupyterHub • Prototyped TensorFlow (using KubeFlow) Phase 2: Onboard workloads (serving but non-crticial) • Productionize TensorFlow (using KubeFlow) • Onboard non-critical serving workloads • Deployment workflow manager* • Infrastructure Governance
  19. CHOICES POC CRITERIA OUTCOME • Resource and task Scheduling (Flexibility,

    Multi-Tenancy, Extensibility etc.) • Scalability • Integration Cost • Docker Support, Sidecar support and Runtime extensibility • Network Support on AWS* • Security Support on AWS* • Stateful Service Support • Ecosystem and Community • Cluster Operations & Support Container Orchestration
  20. Cluster Adopting K8S • Self Hosted v/ Managed - Using

    a combination of both. KOPs (for self-hosted) & etc-manager • Number of Clusters - Mixed opinions. POC to evaluate burden v/ flexibility • HA Strategy - Go cross AZ at the minimum. Multi-region is flaky (without federation) • Ingress - Mostly for internal web tools, Using Amazon’s ALB (Inter-VPC routing key) • Machine Types (homogenous v/ heterogeneous) - Use node scheduling policy, taints & labels judiciously. POC to capture benefits of diverse instance-types & workloads. • Maintenance - Decide SLOs upfront • Stateless v/ Persistent v/ Durable Store - Leaning towards Persistent
  21. Platform Adopting K8S • Security[1] - Ensure workloads can be

    authenticated & access control (enforce or trust /verify) • Networking[2] - Offer dedicated & shared • Service Discovery - Support existing and provide path to move to new • Pinterest’s internal solution (ZUM) backed by Zookeeper. • Next generation is envoy based (still POC) • Ingress - For Internal Services, everyone likes Heroku! Expose http and provide sharable URL • Contour (github.com/heptio/contour) • Metrics & Logging - Observe pod automatically (both metrics & logging) • Offer tiered SLOs (ex, Application Logging > Debuggability) • Governance - Make sure ownership of Jobs, Quota and integration with Chargeback*
  22. Pod IAM Setup • Role set as annotation of Pod

    • IPTables rule redirect to local meta- proxy (Drome) • Drome Agnet consult’s Kubelet, acquires token from “Role Assume Service “ K8S Platform - Security[1]
  23. K8S Platform - Networking[2] Support for ENI & Bridge mode

    • Support for AWS IAM role and Security Group, Network Isolation and VPC routable IP • AWS’s Elastic Network Interfaces per pod. • Support different CNIs plugins (Configured by Pod annotations) • Collaborating w/ AWS on amazon-vpc-cni-k8s
  24. Application Tooling Adopting K8S •Developer Experience: Define dev and prod

    user deployment user experience upfront •CLI - PinCloud •UI - Infrastructure Console •App Configs- Pinterest Service Description Spec •Canonical JobTypes, Ownership, SidecarConfig •Deployment workflow manager* •Job Submission Service - Manage canonical job metadata (agnostic of underlying compute infra) JOB SUBMISSION SERVICE UI/CLI K8S HADOOP WORKFLOW MANAGER