Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improving Efficiency of Twitter Infrastructure Using Chargeback

Improving Efficiency of Twitter Infrastructure Using Chargeback

Twitter is powered by thousands of Applications that run on our internal Cloud platform, a suite of multi-tenant platform services that offer Compute, Storage, Messaging, Monitoring, etc as a service. These platforms have thousands of tenants and run atop hundreds of thousands of servers, across multiple zones. This scale makes it difficult to evaluate resource utilization, cost & efficiency across platforms in a canonical way.
We share how we built a platform agnostic metering & chargeback infrastructure for Twitter's complex platform topology. We use our Compute platform (powered by Apache Aurora/Mesos) as a case-study to show how both the platform owner and users of the platform used it to not only measure resource utilization & cost (across private/public Cloud in a canonical way) but also improve overall resource utilization & drive the cost-per-core down leading to huge savings.

Vinu Charanya

August 22, 2016
Tweet

More Decks by Vinu Charanya

Other Decks in Technology

Transcript

  1. FENCING & OWNERSHIP Clear isolation of services & its ownership.

    RELIABILITY
 Failure isolation and graceful degradation SCALABILITY & EFFICIENCY Scale independently ensuring efficient use of infrastructure DEVELOPER PRODUCTIVITY Make it simple for engineers to build and launch services quickly and easily (Micro) Services Oriented Model
  2. August 2 at 7:21:50 PDT 28X increase
 on avg. TPS

    143,199 
 Tweets Per Sec (TPS)
  3. INFRASTRUCTURE AND DATACENTER MANAGEMENT CORE APPLICATION SERVICES TWEETS USERS SOCIAL

    GRAPH PLATFORM SERVICES SEARCH MESSAGING & QUEUES CACHE MONITORING AND ALERTING REVERSE PROXY 
 FRAMEWORK/ LIBRARIES FINAGLE (RPC) SCALDING (Map Reduce in Scala) HERON (Streaming Compute) JVM 
 MANAGEMENT TOOLS SELF SERVE SERVICE DIRECTORY CHARGEBACK CONFIG DATA & ANALYTICS PLATFORM INTERACTIVE QUERY DATA DISCOVERY WORKFLOW MANAGEMENT INFRASTRUCTURE SERVICES MANHATTAN
 (Key-Val Store) HDFS (File System) BLOBSTORE GRAPH STORE S T O R A G E AURORA (Scheduler) HADOOP (Map-Reduce) MESOS (Cluster Manager) C O M P U T E DEPLOY
 (Workflows)
  4. INFRASTRUCTURE AND DATACENTER MANAGEMENT CORE APPLICATION SERVICES TWEETS USERS SOCIAL

    GRAPH PLATFORM SERVICES SEARCH MESSAGING & QUEUES CACHE MONITORING AND ALERTING REVERSE PROXY 
 FRAMEWORK/ LIBRARIES FINAGLE (RPC) SCALDING (Map Reduce in Scala) HERON (Streaming Compute) JVM 
 MANAGEMENT TOOLS SELF SERVE SERVICE DIRECTORY CHARGEBACK CONFIG DATA & ANALYTICS PLATFORM INTERACTIVE QUERY DATA DISCOVERY WORKFLOW MANAGEMENT INFRASTRUCTURE SERVICES MANHATTAN
 (Key-Val Store) HDFS (File System) BLOBSTORE GRAPH STORE S T O R A G E AURORA (Scheduler) HADOOP (Map-Reduce) MESOS (Cluster Manager) C O M P U T E DEPLOY
 (Workflows)
  5. INFRASTRUCTURE AND DATACENTER MANAGEMENT CORE APPLICATION SERVICES TWEETS USERS SOCIAL

    GRAPH PLATFORM SERVICES SEARCH MESSAGING & QUEUES CACHE MONITORING AND ALERTING REVERSE PROXY 
 FRAMEWORK/ LIBRARIES FINAGLE (RPC) SCALDING (Map Reduce in Scala) HERON (Streaming Compute) JVM 
 MANAGEMENT TOOLS SELF SERVE SERVICE DIRECTORY CHARGEBACK CONFIG DATA & ANALYTICS PLATFORM INTERACTIVE QUERY DATA DISCOVERY WORKFLOW MANAGEMENT INFRASTRUCTURE SERVICES MANHATTAN
 (Key-Val Store) HDFS (File System) BLOBSTORE GRAPH STORE S T O R A G E AURORA (Scheduler) HADOOP (Map-Reduce) MESOS (Cluster Manager) C O M P U T E DEPLOY
 (Workflows)
  6. INFRASTRUCTURE AND DATACENTER MANAGEMENT CORE APPLICATION SERVICES TWEETS USERS SOCIAL

    GRAPH PLATFORM SERVICES SEARCH MESSAGING & QUEUES CACHE MONITORING AND ALERTING REVERSE PROXY 
 FRAMEWORK/ LIBRARIES FINAGLE (RPC) SCALDING (Map Reduce in Scala) HERON (Streaming Compute) JVM 
 MANAGEMENT TOOLS SELF SERVE SERVICE DIRECTORY CHARGEBACK DEPLOY
 (Workflows) CONFIG DATA & ANALYTICS PLATFORM INTERACTIVE QUERY DATA DISCOVERY WORKFLOW MANAGEMENT INFRASTRUCTURE SERVICES MANHATTAN
 (Key-Val Store) HDFS (File System) BLOBSTORE GRAPH STORE S T O R A G E AURORA (Scheduler) HADOOP (Map-Reduce) MESOS (Cluster Manager) C O M P U T E
  7. INFRASTRUCTURE AND DATACENTER MANAGEMENT CORE APPLICATION SERVICES TWEETS USERS SOCIAL

    GRAPH PLATFORM SERVICES SEARCH MESSAGING & QUEUES CACHE MONITORING AND ALERTING REVERSE PROXY 
 FRAMEWORK/ LIBRARIES FINAGLE (RPC) SCALDING (Map Reduce in Scala) HERON (Streaming Compute) JVM 
 MANAGEMENT TOOLS SELF SERVE SERVICE DIRECTORY CHARGEBACK CONFIG DATA & ANALYTICS PLATFORM INTERACTIVE QUERY DATA DISCOVERY WORKFLOW MANAGEMENT INFRASTRUCTURE SERVICES MANHATTAN
 (Key-Val Store) HDFS (File System) BLOBSTORE GRAPH STORE S T O R A G E AURORA (Scheduler) HADOOP (Map-Reduce) MESOS (Cluster Manager) C O M P U T E DEPLOY
 (Workflows)
  8. What is the overall use of infrastructure & platform 


    resources across Twitter’s services?
  9. What is the overall use of infrastructure & platform 


    resources across Twitter’s services? How to attribute resource consumption 
 to teams/organization?
  10. What is the overall use of infrastructure & platform 


    resources across Twitter’s services? How to attribute resource consumption 
 to teams/organization? How do you incentivize the right behavior to 
 improve efficiency of resource usage?
  11. Ability to meter allocation and utilization of resources per service,

    per engineering team and charge them accordingly CHARGEBACK
  12. COMPUTE STORAGE PLATFORM AND OTHER SERVICES SERVICE 
 Tweet Service

    SERVICE 
 Ads Shard SERVICE 
 Who To Follow RESOURCE unit of abstraction MULTI-TENANCY
 tenant management using canonical identifiers
  13. SERVICE
 IDENTITY RESOURCE
 CATALOG COMPUTE STORAGE PLATFORM AND OTHER SERVICES

    SERVICE 
 Tweet Service SERVICE 
 Ads Shard SERVICE 
 Who To Follow RESOURCE unit of abstraction MULTI-TENANCY
 tenant management using canonical identifiers
  14. SERVICE
 IDENTITY RESOURCE
 CATALOG COMPUTE STORAGE PLATFORM AND OTHER SERVICES

    SERVICE 
 Tweet Service SERVICE 
 Ads Shard SERVICE 
 Who To Follow RESOURCE unit of abstraction MULTI-TENANCY
 tenant management using canonical identifiers METERING AND CHARGEBACK
  15. SERVICE
 IDENTITY RESOURCE
 CATALOG METERING AND CHARGEBACK COMPUTE STORAGE SERVICE

    METADATA PLATFORM AND OTHER SERVICES SERVICE 
 Tweet Service SERVICE 
 Ads Shard SERVICE 
 Who To Follow RESOURCE unit of abstraction MULTI-TENANCY
 tenant management using canonical identifiers
  16. UNIFIED CLOUD PLATFORM SERVICE
 IDENTITY RESOURCE
 CATALOG METERING AND CHARGEBACK

    COMPUTE STORAGE SERVICE METADATA PLATFORM AND OTHER SERVICES SERVICE 
 Tweet Service SERVICE 
 Ads Shard SERVICE 
 Who To Follow RESOURCE unit of abstraction MULTI-TENANCY
 tenant management using canonical identifiers
  17. • Disparate identifiers across infrastructure and platform services
 • Multiple

    provisioning workflows 
 (Self-Serve, Tickets)
 • Disparate Ownership trackers (Email, LDAP) • Lack of support for public cloud Identity and Access Management systems (IAM) role: cim-service job_name: ui; env: prod id: <role>.<env>.<job_name> app_id: cost_reporting id: <app_id> Project: chargeback Team: Cloud Infra Mgmt Source code: /cim COMPUTE STORAGE PROBLEM BATCH COMPUTE role: cim-service pool: etl_pipe_prod job_name: compute_cost id: <role>.<pool>.<job_name>
  18. DASHBOARD IDENTITY MANAGER PROVISION CONSUMPTION • Designed an Entity Model

    that • Define canonical identifier scheme across infrastructure and platform services • Define ownership structure with org
 • Single pane of glass for every developer to manage their project IDs (including abstracting out public cloud IAM systems)
 • Provider APIs for infrastructure services to provision and manage identity INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE OUR APPROACH API
  19. Source of truth for identifier to org structure mapping 


    improving Service ownership within the Org Enables service to service authentication/authorization IMPACT
  20. BUSINESS OWNER TEAM PROJECT SERVICE/
 SYSTEM ACCOUNT <INFRA, CLIENTID> 1:N

    1:N 1:N 1:N ENTITY MODEL FOR SERVICE IDENTITY Model that provides canonical identifier across infrastructure and platform service and ties it to an org structure
  21. BUSINESS OWNER TEAM PROJECT SERVICE/
 SYSTEM ACCOUNT <INFRA, CLIENTID> 1:N

    1:N 1:N 1:N REVENUE ADS SERVING adshard adshard <Aurora, adshard.prod.adshard> EXAMPLE of services running (on Aurora/Mesos) ADS PREDICTION prediction ads-prediction <Aurora, ads- prediction.prod.campaign-x> ENTITY MODEL: EXAMPLE
  22. • Lack of clarity on what is available & how

    many resources are consumed • Need to capture resource fluidity across infrastructure and platform services
 • Better support to model abstract resources (ex, QPS, Tweets per Second) • Need to define TCO (Total Cost of Ownership) of a resource per unit time PROBLEM CPU MEMORY DISK STORAGE IN GB WPS RPS COMPUTE STORAGE BATCH COMPUTE CPU FILES ACCESSED STORAGE IN GB
  23. CORES MEMORY DISK application = Task( name = 'application', resources

    = Resources(cpu = 1.0, ram = 512 * MB, disk = 1024 * MB), processes = [stage_application, run_application], constraints = order(stage_application, run_application))
  24. CORES MEMORY DISK application = Task( name = 'application', resources

    = Resources(cpu = 1.0, ram = 512 * MB, disk = 1024 * MB), processes = [stage_application, run_application], constraints = order(stage_application, run_application)) GPU NETWORK Need for Fluidity!
  25. • Defining unit price for a resource • Framework to

    price resources. • Ensure Total Cost of Ownership. Eg. License cost, chargeback cost from other services, human cost etc. • Support for Time Granularity. Eg. Machines/VMs used per day, Cores used per day
  26. Used Cores Operational Overhead Headroom Underutilized Quota Allocation Total Cost

    of Ownership Twitter Compute Platform $X core-day Container Size Buffer
 (Underutilized Reservation) Excess Quota and Reservation Non-Prod Used Cores Disaster Recovery & 
 Event Spikes
  27. PROVIDER INFRASTRUCTURE SERVICE OFFERINGS OFFER MEASURES OFFER MEASURE COST 1:N

    1:N 1:N 1:N ENTITY MODEL FOR RESOURCE CATALOG Model that supports Resource Fluidity and captures and manages unit price of a resource over time.
  28. TWITTER DC/ PUBLIC CLOUD AURORA COMPUTE CORE-DAYS $X PROVIDER INFRASTRUCTURE

    SERVICE OFFERINGS OFFER MEASURES OFFER MEASURE COST 1:N 1:N 1:N 1:N EXAMPLE of Resource Catalog TWITTER DC HADOOP STORAGE GB- RAM ENTITY MODEL: EXAMPLE PROCESSING CLUSTER FILE ACCESSES … … GB- RAM FILE ACCESSES … … $X $Y … $M $N … …
  29. CUSTOM REPORTS • Infrastructure & Platform Owners • Overall Cluster

    Growth • Allocation v/s Utilization of resources by Customer Team • Service Owners • Allocation v/s Utilization of resources across each Infrastructure & Platform • Finance • Budget Management (Budget v/s Spend) • Execs • Efficiency • Trends
  30. Jun 1, 2015 Sept 1, 2015 Twitter Compute Platform (Aurora/Mesos)

    3 months (Jun - Sep, 2015) Allocated Quota Utilized Cores
  31. Sept 1, 2015 Jan 1, 2015 Twitter Compute Platform (Aurora/Mesos)

    4 months (Sep, 2015 - Jan, 2016) Allocated Quota Utilized Cores
  32. • Ensures true to the cost unit price computation
 •

    Input for capacity planning and budgeting
 • Visibility into the organizational spend and enables accountability
 • Improved utilization of infrastructure service resources • Enables comparison with Public Cloud Offerings
 • Improved Service Ownership IMPACT
  33. SERVICE IDENTITY MANAGER RESOURCE PROVISIONING MANAGER DASHBOARD (SINGLE PANE OF

    GLASS) REPORTING INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE INFRASTRUCTURE & PLATFORM SERVICE SERVICE LIFECYCLE WORKFLOWS METADATA RESOURCE QUOTA MANAGEMENT DEPLOY METERING & CHARGEBACK IDENTITY PROVIDER APIS & ADAPTERS