Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructure utilization and efficiency at scale

[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructure utilization and efficiency at scale

Twitter is powered by thousands of microservices running on an internal cloud platform consisting of a suite of multitenant platform services that offer compute, storage, messaging, monitoring, etc. as a service. These platforms have thousands of tenants and run atop hundreds of thousands of servers both on-premises and in the public cloud. The scale of diversity in Twitter’s multitenant infrastructure services makes it extremely difficult to effectively forecast capacity, compute resource utilization, and cost and drive efficiency.

Vinu Charanya explains how she and her team are building a system that captures, defines, provisions, meters, and charges infrastructure resources, redefining how systems are built atop Twitter infrastructure. The infrastructure resources include primitive bare metal servers and VMs in the public cloud and abstract resources offered by multitenant services such as a compute platform (powered by Apache Aurora and Mesos), storage (Manhattan for key-value, cache, RDBMS), and observability. Along the way, Vinu shares how Twitter used this data to better plan capacity and drive a cultural change in engineering that helped improve overall resource utilization and led to significant savings in infrastructure spending.

Vinu Charanya

October 04, 2017
Tweet

More Decks by Vinu Charanya

Other Decks in Technology

Transcript

  1. 3 1 2 3 4 History & Context Chargeback @Twitter

    Kite - Service Lifecycle Manager Impact & Future Work Agenda
  2. INFRASTRUCTURE & DATACENTER MANAGEMENT CORE APPLICATION SERVICES TWEETS USERS SOCIAL

    GRAPH PLATFORM SERVICES SEARCH MESSAGING & QUEUES CACHE MONITORING AND ALERTING INGRESS & PROXY 
 FRAMEWORK/ LIBRARIES FINAGLE (RPC) SCALDING (Map Reduce in Scala) HERON (Streaming Compute) JVM 
 MANAGEMENT TOOLS SELF SERVE SERVICE DIRECTORY CHARGEBACK CONFIG MGMT DATA & ANALYTICS PLATFORM INTERACTIVE QUERY DATA DISCOVERY WORKFLOW MANAGEMENT INFRASTRUCTURE SERVICES MANHATTAN BLOBSTORE GRAPHSTORE TIMESERIESDB S T O R A G E MESOS/AURORA HADOOP C O M P U T E MYSQL VERTICA POSTGRES D B / D W DEPLOY
 (Workflows)
  3. Number of Servers MESOS/AURORA HADOOP MANHATTAN 67% How to get

    visibility into resources used by individual jobs & datasets?
  4. Number of Servers MESOS/AURORA HADOOP MANHATTAN 67% How to attribute

    resource consumption
 to teams/organization?
  5. Number of Servers MESOS/AURORA HADOOP MANHATTAN 67% How do you

    incentivize the right behavior to 
 improve efficiency of resource usage?
  6. Chargeback @Twitter Ability to meter allocation & utilization of resources

    per service, per project, per engineering team to improve visibility & enable accountability
  7. 19 Chargeback @Twitter 1. Resource Catalog: Consistent way to inventory

    infrastructure resources Support diverse Infrastructure and Platform Services
  8. 20 Chargeback @Twitter 1. Resource Catalog: Consistent way to inventory

    infrastructure resources • Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets / second”). Extend existing resource Support diverse Infrastructure and Platform Services
  9. 21 Chargeback @Twitter 1. Resource Catalog: Consistent way to inventory

    infrastructure resources • Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets / second”). Extend existing resource 2. Resource <> Client Identifier Ownership: Map of client identifier to an owner to enable accountability Support diverse Infrastructure and Platform Services
  10. TWITTER DC/ PUBLIC CLOUD COMPUTE CORE-DAYS $X PROVIDER INFRASTRUCTURE SERVICE

    OFFERINGS OFFER MEASURES OFFER MEASURE COST 1:N 1:N 1:N 1:N RESOURCE CATALOG ENTITY MODEL
  11. TWITTER DC/ PUBLIC CLOUD COMPUTE CORE-DAYS $X PROVIDER INFRASTRUCTURE SERVICE

    OFFERINGS OFFER MEASURES OFFER MEASURE COST 1:N 1:N 1:N 1:N TWITTER DC STORAGE GB- RAM PROCESSING CLUSTER FILE ACCESSES … … GB- RAM FILE ACCESSE S … … $X $Y … $M $N … … RESOURCE CATALOG ENTITY MODEL
  12. { measures: [ { "measure_id": 1, "measure_label": "core-days", "measure_unit_label": "per

    1 core-day", "offering_id": 1, "offering_label": "Compute", "infrastructure_id": 1, "infrastructure_name": "Aurora" }, { "measure_id": 2, "measure_label": "machine-days", "measure_unit_label": "per 1 machine-day", "offering_id": 2, "offering_label": "zone:aquila", "infrastructure_id": 8, "infrastructure_name": "Physical Infrastructure", }, { /api/1/measures Chargeback @Twitter
  13. So, how do you incentivize the right behavior to 


    improve efficiency of resource usage?
  14. Operational Overhead Headroom Production Used Cores Non-Prod Used Cores Cost

    of Physical Server
 ($X / day) Total available Cores Quota Buffer
 (Underutilized Quota) Container Size Buffer
 (Underutilized Reservation) Total Cost of Ownership for Aurora $X core-day
  15. Operational Overhead Headroom Production Used Cores Non-Prod Used Cores Cost

    of Physical Server
 ($X / day) Total available Cores Quota Buffer
 (Underutilized Quota) Container Size Buffer
 (Underutilized Reservation) Total used Cores Total Cost of Ownership for Aurora $X core-day
  16. Operational Overhead Headroom Production Used Cores Non-Prod Used Cores Cost

    of Physical Server
 ($X / day) Total available Cores Quota Buffer
 (Underutilized Quota) Container Size Buffer
 (Underutilized Reservation) Total used Cores Excess Cores (incl. DR, Spikes, Overallocation) Total Cost of Ownership for Aurora $X core-day
  17. Operational Overhead Headroom Production Used Cores Non-Prod Used Cores Cost

    of Physical Server
 ($X / day) Total available Cores Quota Buffer
 (Underutilized Quota) Container Size Buffer
 (Underutilized Reservation) Total used Cores Excess Cores (incl. DR, Spikes, Overallocation) Cores used by platform
 for operations & maintenance Total Cost of Ownership for Aurora $X core-day
  18. Operational Overhead Headroom Production Used Cores Non-Prod Used Cores Cost

    of Physical Server
 ($X / day) Total available Cores Quota Buffer
 (Underutilized Quota) Container Size Buffer
 (Underutilized Reservation) Total used Cores Excess Cores (incl. DR, Spikes, Overallocation) Cores used by platform
 for operations & maintenance Total Cost of Ownership for Aurora $X core-day
  19. 37 Chargeback @Twitter INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 INGEST

    METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG REPORT REPORT Metering Pipeline (ETL Job) IDENTIFIER OWNERSHIP MAPPING Metrics Ingestor DATA FIDELITY Metering Pipeline (ETL Job)
  20. 38 Chargeback @Twitter INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 INGEST

    METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG REPORT REPORT Metering Pipeline (ETL Job) IDENTIFIER OWNERSHIP MAPPING Schema(client_identifier, offering_measure, volume, metadata, timestamp) DATA FIDELITY Metering Pipeline (ETL Job)
  21. 39 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1

    INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG IDENTIFIER OWNERSHIP MAPPING REPORT REPORT Transformer DATA FIDELITY Metering Pipeline (ETL Job)
  22. 40 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1

    INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG IDENTIFIER OWNERSHIP MAPPING REPORT REPORT 1. Resolve Ownership DATA FIDELITY Metering Pipeline (ETL Job)
  23. 41 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1

    INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG IDENTIFIER OWNERSHIP MAPPING REPORT REPORT 2. Cost Computation DATA FIDELITY Metering Pipeline (ETL Job)
  24. 42 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1

    INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG DATA FIDELITY REPORT REPORT IDENTIFIER OWNERSHIP MAPPING Data Fidelity & Reporting Metering Pipeline (ETL Job)
  25. 43 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1

    INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG REPORT REPORT IDENTIFIER OWNERSHIP MAPPING 1. Verify Data Integrity & Fidelity DATA FIDELITY Metering Pipeline (ETL Job)
  26. 44 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1

    INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG REPORT REPORT IDENTIFIER OWNERSHIP MAPPING 2. Alert when things don’t seem the way it should be DATA FIDELITY Metering Pipeline (ETL Job)
  27. 45 Chargeback @Twitter INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 EXPORT

    METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG IDENTIFIER OWNERSHIP DATA FIDELITY REPORT REPORT Metering Pipeline (ETL Job)
  28. 47 Chargeback @Twitter Customers Infrastructure & Platform Operators Overall Cluster

    Growth Allocation v/s Utilization of resources by Client/Tenant Finance & Execs Budget v/s Spend per Org Infrastructure PnL Overall Efficiency & Trends Service Owners & Developers Team Bill Per Service Allocation vs. Utilization of Resources Reports Customers Infrastructure & Platform Operators Overall Cluster Growth Allocation v/s Utilization of resources by Client/Tenant Finance & Execs Budget v/s Spend per Org Infrastructure PnL Overall Efficiency & Trends
  29. 49 Chargeback @Twitter Customers Infrastructure & Platform Operators Overall Cluster

    Growth Allocation v/s Utilization of resources by Client/Tenant Finance & Execs Budget v/s Spend per Org Infrastructure PnL Overall Efficiency & Trends Service Owners & Developers Team Bill Per Service Allocation vs. Utilization of Resources Reports
  30. 53 1 2 3 4 Learnings Chargeback @Twitter Invest in

    data Fidelity Accurate Ownership Mapping Logical grouping of resources Change History • Trust in data is most important. • Invest in monitoring & alerting for data inconsistencies • Leverage this for detecting abnormal increase/decrease and notify users • Static mappings go out of date quickly • Invest in systems (ex, Kite) for users to manage it themselves • Identifiers were too granular and teams were too broad. • Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain • Unit prices change over time • Orgs / Teams change over time • Resources get added / removed • Change history is essential for consistency which is used for CAP planning
  31. 54 1 2 3 4 Learnings Chargeback @Twitter Invest in

    data Fidelity Accurate Ownership Mapping Logical grouping of resources Change History • Trust in data is most important. • Invest in monitoring & alerting for data inconsistencies • Leverage this for detecting abnormal increase/decrease and notify users • Static mappings go out of date quickly • Invest in systems (ex, Kite) for users to manage it themselves • Identifiers were too granular and teams were too broad. • Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain • Unit prices change over time • Orgs / Teams change over time • Resources get added / removed • Change history is essential for consistency which is used for CAP planning
  32. 55 1 2 3 4 Learnings Chargeback @Twitter Invest in

    data Fidelity Accurate Ownership Mapping Logical grouping of resources Change History • Trust in data is most important. • Invest in monitoring & alerting for data inconsistencies • Leverage this for detecting abnormal increase/decrease and notify users • Static mappings go out of date quickly • Invest in systems (ex, Kite) for users to manage it themselves • Identifiers were too granular and teams were too broad. • Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain • Unit prices change over time • Orgs / Teams change over time • Resources get added / removed • Change history is essential for consistency which is used for CAP planning
  33. 56 1 2 3 4 Learnings Chargeback @Twitter Invest in

    data Fidelity Accurate Ownership Mapping Logical grouping of resources Change History • Trust in data is most important. • Invest in monitoring & alerting for data inconsistencies • Leverage this for detecting abnormal increase/decrease and notify users • Static mappings go out of date quickly • Invest in systems (ex, Kite) for users to manage it themselves • Identifiers were too granular and teams were too broad. • Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain • Unit prices change over time • Orgs / Teams change over time • Resources get added / removed • Change history is essential for consistency which is used for CAP planning
  34. 57 1 2 3 4 Learnings Chargeback @Twitter Invest in

    data Fidelity Accurate Ownership Mapping Logical grouping of resources Change History • Trust in data is most important. • Invest in monitoring & alerting for data inconsistencies • Leverage this for detecting abnormal increase/decrease and notify users • Static mappings go out of date quickly • Invest in systems (ex, Kite) for users to manage it themselves • Identifiers were too granular and teams were too broad. • Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain • Unit prices change over time • Orgs / Teams change over time • Resources get added / removed • Change history is essential for consistency which is used for CAP planning
  35. SERVICE IDENTITY MANAGER RESOURCE PROVISIONING MANAGER DASHBOARD (SINGLE PANE OF

    GLASS) REPORTING INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE INFRASTRUCTURE & PLATFORM SERVICE SERVICE LIFECYCLE WORKFLOWS METADATA RESOURCE QUOTA MANAGEMENT METERING & CHARGEBACK CLIENT IDENTITY PROVIDER APIS & ADAPTERS
  36. 61 Kite @Twitter Identity System: Built a consistent way to

    group client identifiers of different infrastructure services into a project and enabled ownership • Capture Org Structure: Support org structure changes, project transfer workflows to ensure up-to-date ownership of identifiers • Unify client identifier provisioning workflow: Enables single source of truth and reduces operator pain around provisioning and managing client identifiers. Client Identifier Management
  37. IDENTITY ENTITY MODEL SERVICE/
 SYSTEM ACCOUNT <INFRA, CLIENTID> 1:N tweetypie

    <Aurora, tweetypie.prod.tweetypie> ads-prediction <Aurora, ads- prediction.prod.campaign-x>
  38. BUSINESS OWNER TEAM PROJECT SERVICE/
 SYSTEM ACCOUNT <INFRA, CLIENTID> 1:N

    1:N 1:N 1:N INFRASTRUCTURE TWEETYPIE tweetypie tweetypie <Aurora, tweetypie.prod.tweetypie> ADS PREDICTION prediction ads-prediction <Aurora, ads- prediction.prod.campaign-x> REVENUE IDENTITY ENTITY MODEL
  39. BUSINESS OWNER TEAM PROJECT SERVICE/
 SYSTEM ACCOUNT <INFRA, CLIENTID> 1:N

    1:N 1:N 1:N INFRASTRUCTURE TWEETYPIE tweetypie tweetypie <Aurora, tweetypie.prod.tweetypie> ADS PREDICTION prediction ads-prediction <Aurora, ads- prediction.prod.campaign-x> REVENUE IDENTITY ENTITY MODEL Entities are time varying dimensions
  40. 73 Future Work Impact & Future Work 1 2 Capacity

    Planning Extend Quota Manager • Provide historic trends and help with forecast of capacity • Onboard Hadoop, Storage and other systems 3 Enable project deprecation • Detect unused resources, notify users, trigger deprecation process based on policy
  41. 75 1 2 Future Work Impact & Future Work Capacity

    Planning Extend Quota Manager • Provide historic trends and help with forecast of capacity • Onboard Hadoop, Storage and other systems 3 Enable project deprecation • Detect unused resources, notify users, trigger deprecation process based on policy
  42. 76 1 2 Future Work Impact & Future Work Capacity

    Planning Extend Quota Manager • Provide historic trends and help with forecast of capacity • Onboard Hadoop, Storage and other systems 3 Enable project deprecation • Detect unused resources, notify users, trigger deprecation process based on policy