Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improving Efficiency of Twitter Infrastructure Using Chargeback

Improving Efficiency of Twitter Infrastructure Using Chargeback

Twitter is powered by thousands of Applications that run on our internal Cloud platform, a suite of multi-tenant platform services that offer Compute, Storage, Messaging, Monitoring, etc as a service. These platforms have thousands of tenants and run atop hundreds of thousands of servers, across multiple zones. This scale makes it difficult to evaluate resource utilization, cost & efficiency across platforms in a canonical way.
We share how we built a platform agnostic metering & chargeback infrastructure for Twitter's complex platform topology. We use our Compute platform (powered by Apache Aurora/Mesos) as a case-study to show how both the platform owner and users of the platform used it to not only measure resource utilization & cost (across private/public Cloud in a canonical way) but also improve overall resource utilization & drive the cost-per-core down leading to huge savings.

D7509d4b132143228be562fe34d10e92?s=128

Vinu Charanya

August 22, 2016
Tweet

Transcript

  1. Improving efficiency of Twitter Infrastructure using Chargeback @vinucharanya @micheal

  2. • Brief History • Problem • Chargeback • Engineering Challenges

    • The product • Impact • Future AGENDA
  3. © Getty Images from http://www.fifa.com/worldcup/news/y=2010/m=7/news=pride-for-africa-spain-strike-gold-2247372.html 2010

  4. © Getty Images from http://www.fifa.com/worldcup/news/y=2010/m=7/news=pride-for-africa-spain-strike-gold-2247372.html

  5. © Getty Images from http://www.fifa.com/worldcup/news/y=2010/m=7/news=pride-for-africa-spain-strike-gold-2247372.html 3283 
 Tweets Per Sec

    (TPS)
  6. © Getty Images from http://www.fifa.com/worldcup/news/y=2010/m=7/news=pride-for-africa-spain-strike-gold-2247372.html 5X increase
 on avg. TPS

    3283 
 Tweets Per Sec (TPS)
  7. ©The Simpsons

  8. None
  9. MONOLITH SERVICES

  10. FENCING & OWNERSHIP Clear isolation of services & its ownership.

    RELIABILITY
 Failure isolation and graceful degradation SCALABILITY & EFFICIENCY Scale independently ensuring efficient use of infrastructure DEVELOPER PRODUCTIVITY Make it simple for engineers to build and launch services quickly and easily (Micro) Services Oriented Model
  11. 2013

  12. August 2 at 7:21:50 PDT

  13. August 2 at 7:21:50 PDT 143,199 
 Tweets Per Sec

    (TPS)
  14. August 2 at 7:21:50 PDT 28X increase
 on avg. TPS

    143,199 
 Tweets Per Sec (TPS)
  15. Hundreds and thousands of #events at any given instant

  16. Most Retweeted Tweet in History

  17. RELIABILITY DEVELOPER 
 AGILITY SCALABILITY EFFICIENCY

  18. “Do More with Less”

  19. Fast forward to 2016

  20. INFRASTRUCTURE AND DATACENTER MANAGEMENT CORE APPLICATION SERVICES TWEETS USERS SOCIAL

    GRAPH PLATFORM SERVICES SEARCH MESSAGING & QUEUES CACHE MONITORING AND ALERTING REVERSE PROXY 
 FRAMEWORK/ LIBRARIES FINAGLE (RPC) SCALDING (Map Reduce in Scala) HERON (Streaming Compute) JVM 
 MANAGEMENT TOOLS SELF SERVE SERVICE DIRECTORY CHARGEBACK CONFIG DATA & ANALYTICS PLATFORM INTERACTIVE QUERY DATA DISCOVERY WORKFLOW MANAGEMENT INFRASTRUCTURE SERVICES MANHATTAN
 (Key-Val Store) HDFS (File System) BLOBSTORE GRAPH STORE S T O R A G E AURORA (Scheduler) HADOOP (Map-Reduce) MESOS (Cluster Manager) C O M P U T E DEPLOY
 (Workflows)
  21. INFRASTRUCTURE AND DATACENTER MANAGEMENT CORE APPLICATION SERVICES TWEETS USERS SOCIAL

    GRAPH PLATFORM SERVICES SEARCH MESSAGING & QUEUES CACHE MONITORING AND ALERTING REVERSE PROXY 
 FRAMEWORK/ LIBRARIES FINAGLE (RPC) SCALDING (Map Reduce in Scala) HERON (Streaming Compute) JVM 
 MANAGEMENT TOOLS SELF SERVE SERVICE DIRECTORY CHARGEBACK CONFIG DATA & ANALYTICS PLATFORM INTERACTIVE QUERY DATA DISCOVERY WORKFLOW MANAGEMENT INFRASTRUCTURE SERVICES MANHATTAN
 (Key-Val Store) HDFS (File System) BLOBSTORE GRAPH STORE S T O R A G E AURORA (Scheduler) HADOOP (Map-Reduce) MESOS (Cluster Manager) C O M P U T E DEPLOY
 (Workflows)
  22. INFRASTRUCTURE AND DATACENTER MANAGEMENT CORE APPLICATION SERVICES TWEETS USERS SOCIAL

    GRAPH PLATFORM SERVICES SEARCH MESSAGING & QUEUES CACHE MONITORING AND ALERTING REVERSE PROXY 
 FRAMEWORK/ LIBRARIES FINAGLE (RPC) SCALDING (Map Reduce in Scala) HERON (Streaming Compute) JVM 
 MANAGEMENT TOOLS SELF SERVE SERVICE DIRECTORY CHARGEBACK CONFIG DATA & ANALYTICS PLATFORM INTERACTIVE QUERY DATA DISCOVERY WORKFLOW MANAGEMENT INFRASTRUCTURE SERVICES MANHATTAN
 (Key-Val Store) HDFS (File System) BLOBSTORE GRAPH STORE S T O R A G E AURORA (Scheduler) HADOOP (Map-Reduce) MESOS (Cluster Manager) C O M P U T E DEPLOY
 (Workflows)
  23. INFRASTRUCTURE AND DATACENTER MANAGEMENT CORE APPLICATION SERVICES TWEETS USERS SOCIAL

    GRAPH PLATFORM SERVICES SEARCH MESSAGING & QUEUES CACHE MONITORING AND ALERTING REVERSE PROXY 
 FRAMEWORK/ LIBRARIES FINAGLE (RPC) SCALDING (Map Reduce in Scala) HERON (Streaming Compute) JVM 
 MANAGEMENT TOOLS SELF SERVE SERVICE DIRECTORY CHARGEBACK DEPLOY
 (Workflows) CONFIG DATA & ANALYTICS PLATFORM INTERACTIVE QUERY DATA DISCOVERY WORKFLOW MANAGEMENT INFRASTRUCTURE SERVICES MANHATTAN
 (Key-Val Store) HDFS (File System) BLOBSTORE GRAPH STORE S T O R A G E AURORA (Scheduler) HADOOP (Map-Reduce) MESOS (Cluster Manager) C O M P U T E
  24. INFRASTRUCTURE AND DATACENTER MANAGEMENT CORE APPLICATION SERVICES TWEETS USERS SOCIAL

    GRAPH PLATFORM SERVICES SEARCH MESSAGING & QUEUES CACHE MONITORING AND ALERTING REVERSE PROXY 
 FRAMEWORK/ LIBRARIES FINAGLE (RPC) SCALDING (Map Reduce in Scala) HERON (Streaming Compute) JVM 
 MANAGEMENT TOOLS SELF SERVE SERVICE DIRECTORY CHARGEBACK CONFIG DATA & ANALYTICS PLATFORM INTERACTIVE QUERY DATA DISCOVERY WORKFLOW MANAGEMENT INFRASTRUCTURE SERVICES MANHATTAN
 (Key-Val Store) HDFS (File System) BLOBSTORE GRAPH STORE S T O R A G E AURORA (Scheduler) HADOOP (Map-Reduce) MESOS (Cluster Manager) C O M P U T E DEPLOY
 (Workflows)
  25. THOUSANDS OF SERVICES HUNDREDS OF TEAMS

  26. What is the overall use of infrastructure & platform 


    resources across Twitter’s services?
  27. What is the overall use of infrastructure & platform 


    resources across Twitter’s services? How to attribute resource consumption 
 to teams/organization?
  28. What is the overall use of infrastructure & platform 


    resources across Twitter’s services? How to attribute resource consumption 
 to teams/organization? How do you incentivize the right behavior to 
 improve efficiency of resource usage?
  29. Ability to meter allocation and utilization of resources per service,

    per engineering team and charge them accordingly CHARGEBACK
  30. COMPUTE STORAGE PLATFORM AND OTHER SERVICES SERVICE 
 Tweet Service

    SERVICE 
 Ads Shard SERVICE 
 Who To Follow RESOURCE unit of abstraction MULTI-TENANCY
 tenant management using canonical identifiers
  31. SERVICE
 IDENTITY RESOURCE
 CATALOG COMPUTE STORAGE PLATFORM AND OTHER SERVICES

    SERVICE 
 Tweet Service SERVICE 
 Ads Shard SERVICE 
 Who To Follow RESOURCE unit of abstraction MULTI-TENANCY
 tenant management using canonical identifiers
  32. SERVICE
 IDENTITY RESOURCE
 CATALOG COMPUTE STORAGE PLATFORM AND OTHER SERVICES

    SERVICE 
 Tweet Service SERVICE 
 Ads Shard SERVICE 
 Who To Follow RESOURCE unit of abstraction MULTI-TENANCY
 tenant management using canonical identifiers METERING AND CHARGEBACK
  33. SERVICE
 IDENTITY RESOURCE
 CATALOG METERING AND CHARGEBACK COMPUTE STORAGE SERVICE

    METADATA PLATFORM AND OTHER SERVICES SERVICE 
 Tweet Service SERVICE 
 Ads Shard SERVICE 
 Who To Follow RESOURCE unit of abstraction MULTI-TENANCY
 tenant management using canonical identifiers
  34. UNIFIED CLOUD PLATFORM SERVICE
 IDENTITY RESOURCE
 CATALOG METERING AND CHARGEBACK

    COMPUTE STORAGE SERVICE METADATA PLATFORM AND OTHER SERVICES SERVICE 
 Tweet Service SERVICE 
 Ads Shard SERVICE 
 Who To Follow RESOURCE unit of abstraction MULTI-TENANCY
 tenant management using canonical identifiers
  35. SERVICE IDENTITY

  36. A canonical way of identifying a service that consumes resources

    on various platform infrastructure.
  37. • Disparate identifiers across infrastructure and platform services
 • Multiple

    provisioning workflows 
 (Self-Serve, Tickets)
 • Disparate Ownership trackers (Email, LDAP) • Lack of support for public cloud Identity and Access Management systems (IAM) role: cim-service job_name: ui; env: prod id: <role>.<env>.<job_name> app_id: cost_reporting id: <app_id> Project: chargeback Team: Cloud Infra Mgmt Source code: /cim COMPUTE STORAGE PROBLEM BATCH COMPUTE role: cim-service pool: etl_pipe_prod job_name: compute_cost id: <role>.<pool>.<job_name>
  38. DASHBOARD IDENTITY MANAGER PROVISION CONSUMPTION • Designed an Entity Model

    that • Define canonical identifier scheme across infrastructure and platform services • Define ownership structure with org
 • Single pane of glass for every developer to manage their project IDs (including abstracting out public cloud IAM systems)
 • Provider APIs for infrastructure services to provision and manage identity INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE OUR APPROACH API
  39. Source of truth for identifier to org structure mapping 


    improving Service ownership within the Org Enables service to service authentication/authorization IMPACT
  40. BUSINESS OWNER TEAM PROJECT SERVICE/
 SYSTEM ACCOUNT <INFRA, CLIENTID> 1:N

    1:N 1:N 1:N ENTITY MODEL FOR SERVICE IDENTITY Model that provides canonical identifier across infrastructure and platform service and ties it to an org structure
  41. BUSINESS OWNER TEAM PROJECT SERVICE/
 SYSTEM ACCOUNT <INFRA, CLIENTID> 1:N

    1:N 1:N 1:N REVENUE ADS SERVING adshard adshard <Aurora, adshard.prod.adshard> EXAMPLE of services running (on Aurora/Mesos) ADS PREDICTION prediction ads-prediction <Aurora, ads- prediction.prod.campaign-x> ENTITY MODEL: EXAMPLE
  42. RESOURCE CATALOG

  43. Consistent way of identifying and inventorying of resources of various

    platform infrastructure.
  44. • Lack of clarity on what is available & how

    many resources are consumed • Need to capture resource fluidity across infrastructure and platform services
 • Better support to model abstract resources (ex, QPS, Tweets per Second) • Need to define TCO (Total Cost of Ownership) of a resource per unit time PROBLEM CPU MEMORY DISK STORAGE IN GB WPS RPS COMPUTE STORAGE BATCH COMPUTE CPU FILES ACCESSED STORAGE IN GB
  45. CORES MEMORY DISK application = Task( name = 'application', resources

    = Resources(cpu = 1.0, ram = 512 * MB, disk = 1024 * MB), processes = [stage_application, run_application], constraints = order(stage_application, run_application))
  46. CORES MEMORY DISK application = Task( name = 'application', resources

    = Resources(cpu = 1.0, ram = 512 * MB, disk = 1024 * MB), processes = [stage_application, run_application], constraints = order(stage_application, run_application)) GPU NETWORK Need for Fluidity!
  47. • Defining unit price for a resource • Framework to

    price resources. • Ensure Total Cost of Ownership. Eg. License cost, chargeback cost from other services, human cost etc. • Support for Time Granularity. Eg. Machines/VMs used per day, Cores used per day
  48. Used Cores Operational Overhead Headroom Underutilized Quota Allocation Total Cost

    of Ownership Twitter Compute Platform $X core-day Container Size Buffer
 (Underutilized Reservation) Excess Quota and Reservation Non-Prod Used Cores Disaster Recovery & 
 Event Spikes
  49. PROVIDER INFRASTRUCTURE SERVICE OFFERINGS OFFER MEASURES OFFER MEASURE COST 1:N

    1:N 1:N 1:N ENTITY MODEL FOR RESOURCE CATALOG Model that supports Resource Fluidity and captures and manages unit price of a resource over time.
  50. TWITTER DC/ PUBLIC CLOUD AURORA COMPUTE CORE-DAYS $X PROVIDER INFRASTRUCTURE

    SERVICE OFFERINGS OFFER MEASURES OFFER MEASURE COST 1:N 1:N 1:N 1:N EXAMPLE of Resource Catalog TWITTER DC HADOOP STORAGE GB- RAM ENTITY MODEL: EXAMPLE PROCESSING CLUSTER FILE ACCESSES … … GB- RAM FILE ACCESSES … … $X $Y … $M $N … …
  51. METERING PIPELINE

  52. HIGH LEVEL ARCHITECTURE

  53. The Product

  54. TEAM/ORG BILL

  55. INFRASTRUCTURE PNL

  56. ORG/TEAM BUDGET

  57. CUSTOM REPORTS • Infrastructure & Platform Owners • Overall Cluster

    Growth • Allocation v/s Utilization of resources by Customer Team • Service Owners • Allocation v/s Utilization of resources across each Infrastructure & Platform • Finance • Budget Management (Budget v/s Spend) • Execs • Efficiency • Trends
  58. What has been the Impact?

  59. Jun 1, 2015 Sept 1, 2015 Twitter Compute Platform (Aurora/Mesos)

    3 months (Jun - Sep, 2015) Allocated Quota Utilized Cores
  60. Sept 1, 2015 Jan 1, 2015 Twitter Compute Platform (Aurora/Mesos)

    4 months (Sep, 2015 - Jan, 2016) Allocated Quota Utilized Cores
  61. More core usage against reservation compared to May 2015 33%

  62. • Ensures true to the cost unit price computation
 •

    Input for capacity planning and budgeting
 • Visibility into the organizational spend and enables accountability
 • Improved utilization of infrastructure service resources • Enables comparison with Public Cloud Offerings
 • Improved Service Ownership IMPACT
  63. Kite - Unified Cloud Platform
 A cloud agnostic service lifecycle

    manager
  64. SERVICE IDENTITY MANAGER RESOURCE PROVISIONING MANAGER DASHBOARD (SINGLE PANE OF

    GLASS) REPORTING INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE INFRASTRUCTURE & PLATFORM SERVICE SERVICE LIFECYCLE WORKFLOWS METADATA RESOURCE QUOTA MANAGEMENT DEPLOY METERING & CHARGEBACK IDENTITY PROVIDER APIS & ADAPTERS
  65. @vinucharanya @dpkagrawal @pragashjj @fvrojas @micheal @igb @imjessicayuen @_jordanly @xcv58

  66. None