Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Kubecon 2017 Austin, TX] How We Built a Framew...

Avatar for Vinu Charanya Vinu Charanya
December 06, 2017

[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Service Ownership & Improve Infrastructure Utilization at Scale [I] - Vinu Charanya, Twitter

Twitter is powered by thousands of microservices that run on our internal Cloud platform which consists of a suite of multi-tenant platform services that offer Compute, Storage, Messaging, Monitoring, etc as a service. These platforms have thousands of tenants and run atop hundreds of thousands of servers, across on-prem & the public cloud. The scale & diversity in multi-tenant infrastructure services make it extremely difficult to effectively forecast capacity, compute resource utilization & cost and drive efficiency.

In this talk, I would like to share how my team is building a system (Kite - A unified service manager) to help define, model, provision, meter & charge infrastructure resources. The infrastructure resources include primitive bare metal servers / VMs on the public cloud and abstract resources offered by multi-tenant services such as our Compute platform (powered by Apache Aurora/Mesos), Storage (Manhattan for key/val, Cache, RDBMS), Observability. Along with how we solved this problem, I also intend to share a few case-studies on how we were able to use this data to better plan capacity & drive a cultural change in engineering that helped improve overall resource utilization & drive significant savings in infrastructure spending.

Avatar for Vinu Charanya

Vinu Charanya

December 06, 2017
Tweet

More Decks by Vinu Charanya

Other Decks in Technology

Transcript

  1. ★ Sr. Systems Engineer @Twitter ★ Proud being a member

    of @TwitterWomen, @Techwomen and @WomenWhoCode I am @VinuCharanya Hello!
  2. 1 2 3 4 History & Context Chargeback @Twitter Kite

    - Service Lifecycle Manager Impact & Future Work Agenda
  3. INFRASTRUCTURE & DATACENTER MANAGEMENT CORE APPLICATION SERVICES TWEETS USERS SOCIAL

    GRAPH PLATFORM SERVICES SEARCH MESSAGING & QUEUES CACHE MONITORING AND ALERTING INGRESS & PROXY 
 FRAMEWORK/ LIBRARIES FINAGLE (RPC) SCALDING (Map Reduce in Scala) HERON (Streaming Compute) JVM 
 MANAGEMENT TOOLS SELF SERVE SERVICE DIRECTORY CHARGEBACK CONFIG MGMT DATA & ANALYTICS PLATFORM INTERACTIVE QUERY DATA DISCOVERY WORKFLOW MANAGEMENT INFRASTRUCTURE SERVICES MANHATTAN BLOBSTORE GRAPHSTORE TIMESERIESDB S T O R A G E MESOS/AURORA HADOOP C O M P U T E MYSQL VERTICA POSTGRES D B / D W DEPLOY
 (Workflows)
  4. Number of Servers MESOS/AURORA HADOOP MANHATTAN 67% How to get

    visibility into resources used by individual jobs & datasets?
  5. Number of Servers MESOS/AURORA HADOOP MANHATTAN 67% How to attribute

    resource consumption
 to teams/organization?
  6. Number of Servers MESOS/AURORA HADOOP MANHATTAN 67% How do you

    incentivize the right behavior to 
 improve efficiency of resource usage?
  7. Chargeback @Twitter Ability to meter allocation & utilization of resources

    per service, per project, per engineering team to improve visibility & enable accountability
  8. 19 Chargeback @Twitter 1. Resource Catalog: Consistent way to inventory

    infrastructure resources Support diverse Infrastructure and Platform Services
  9. 20 Chargeback @Twitter 1. Resource Catalog: Consistent way to inventory

    infrastructure resources • Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets / second”). Extend existing resource Support diverse Infrastructure and Platform Services
  10. 21 Chargeback @Twitter 1. Resource Catalog: Consistent way to inventory

    infrastructure resources • Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets / second”). Extend existing resource 2. Resource <> Client Identifier Ownership: Map of client identifier to an owner to enable accountability Support diverse Infrastructure and Platform Services
  11. TWITTER DC/ PUBLIC CLOUD COMPUTE CORE-DAYS $X PROVIDER INFRASTRUCTURE SERVICE

    OFFERINGS OFFER MEASURES OFFER MEASURE COST 1:N 1:N 1:N 1:N RESOURCE CATALOG ENTITY MODEL
  12. TWITTER DC/ PUBLIC CLOUD COMPUTE CORE-DAYS $X PROVIDER INFRASTRUCTURE SERVICE

    OFFERINGS OFFER MEASURES OFFER MEASURE COST 1:N 1:N 1:N 1:N TWITTER DC STORAGE GB- RAM PROCESSING CLUSTER FILE ACCESSES … … GB- RAM FILE ACCESSE S … … $X $Y … $M $N … … RESOURCE CATALOG ENTITY MODEL
  13. { measures: [ { "measure_id": 1, "measure_label": "core-days", "measure_unit_label": "per

    1 core-day", "offering_id": 1, "offering_label": "Compute", "infrastructure_id": 1, "infrastructure_name": "Aurora" }, { "measure_id": 2, "measure_label": "machine-days", "measure_unit_label": "per 1 machine-day", "offering_id": 2, "offering_label": “zone:tweety", "infrastructure_id": 8, "infrastructure_name": "Physical Infrastructure", }, { /api/1/measures Chargeback @Twitter
  14. So, how do you incentivize the right behavior to 


    improve efficiency of resource usage?
  15. Operational Overhead Headroom Production Used Cores Non-Prod Used Cores Cost

    of Physical Server
 ($X / day) Total available Cores Quota Buffer
 (Underutilized Quota) Container Size Buffer
 (Underutilized Reservation) Total Cost of Ownership for Aurora $X core-day
  16. Operational Overhead Headroom Production Used Cores Non-Prod Used Cores Cost

    of Physical Server
 ($X / day) Total available Cores Quota Buffer
 (Underutilized Quota) Container Size Buffer
 (Underutilized Reservation) Total used Cores Total Cost of Ownership for Aurora $X core-day
  17. Operational Overhead Headroom Production Used Cores Non-Prod Used Cores Cost

    of Physical Server
 ($X / day) Total available Cores Quota Buffer
 (Underutilized Quota) Container Size Buffer
 (Underutilized Reservation) Total used Cores Excess Cores (incl. DR, Spikes, Overallocation) Total Cost of Ownership for Aurora $X core-day
  18. Operational Overhead Headroom Production Used Cores Non-Prod Used Cores Cost

    of Physical Server
 ($X / day) Total available Cores Quota Buffer
 (Underutilized Quota) Container Size Buffer
 (Underutilized Reservation) Total used Cores Excess Cores (incl. DR, Spikes, Overallocation) Cores used by platform
 for operations & maintenance Total Cost of Ownership for Aurora $X core-day
  19. 35 Chargeback @Twitter INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 INGEST

    METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG REPORT REPORT Metering Pipeline (ETL Job) IDENTIFIER OWNERSHIP MAPPING Metrics Ingestor DATA FIDELITY Metering Pipeline (ETL Job)
  20. 36 Chargeback @Twitter INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 INGEST

    METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG REPORT REPORT Metering Pipeline (ETL Job) IDENTIFIER OWNERSHIP MAPPING Schema(client_identifier, offering_measure, volume, metadata, timestamp) DATA FIDELITY Metering Pipeline (ETL Job)
  21. 37 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1

    INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG IDENTIFIER OWNERSHIP MAPPING REPORT REPORT Transformer DATA FIDELITY Metering Pipeline (ETL Job)
  22. 38 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1

    INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG IDENTIFIER OWNERSHIP MAPPING REPORT REPORT 1. Resolve Ownership DATA FIDELITY Metering Pipeline (ETL Job)
  23. 39 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1

    INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG IDENTIFIER OWNERSHIP MAPPING REPORT REPORT 2. Cost Computation DATA FIDELITY Metering Pipeline (ETL Job)
  24. 40 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1

    INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG DATA FIDELITY REPORT REPORT IDENTIFIER OWNERSHIP MAPPING Data Fidelity & Reporting Metering Pipeline (ETL Job)
  25. 41 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1

    INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG REPORT REPORT IDENTIFIER OWNERSHIP MAPPING 1. Verify Data Integrity & Fidelity DATA FIDELITY Metering Pipeline (ETL Job)
  26. 42 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1

    INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG REPORT REPORT IDENTIFIER OWNERSHIP MAPPING 2. Alert when things don’t seem the way it should be DATA FIDELITY Metering Pipeline (ETL Job)
  27. 43 Chargeback @Twitter INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 EXPORT

    METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG IDENTIFIER OWNERSHIP DATA FIDELITY REPORT REPORT Metering Pipeline (ETL Job)
  28. 45 Chargeback @Twitter Customers Infrastructure & Platform Operators Overall Cluster

    Growth Allocation v/s Utilization of resources by Client/Tenant Finance & Execs Budget v/s Spend per Org Infrastructure PnL Overall Efficiency & Trends Service Owners & Developers Team Bill Per Service Allocation vs. Utilization of Resources Reports Customers Infrastructure & Platform Operators Overall Cluster Growth Allocation v/s Utilization of resources by Client/Tenant Finance & Execs Budget v/s Spend per Org Infrastructure PnL Overall Efficiency & Trends
  29. 47 Chargeback @Twitter Customers Infrastructure & Platform Operators Overall Cluster

    Growth Allocation v/s Utilization of resources by Client/Tenant Finance & Execs Budget v/s Spend per Org Infrastructure PnL Overall Efficiency & Trends Service Owners & Developers Team Bill Per Service Allocation vs. Utilization of Resources Reports
  30. 51 1 2 3 4 Learnings Chargeback @Twitter Invest in

    data Fidelity Accurate Ownership Mapping Logical grouping of resources Track historical data • Trust in data is most important. • Invest in monitoring & alerting for data inconsistencies • Leverage this for detecting abnormal increase/decrease and notify users • Static mappings go out of date quickly • Invest in systems (ex, Kite) for users to manage it themselves • Identifiers were too granular and teams were too broad. • Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain • Unit prices change over time • Orgs / Teams change over time • Resources get added / removed • Change history is essential for consistency which is used for CAP planning
  31. 52 1 2 3 4 Learnings Chargeback @Twitter Invest in

    data Fidelity Accurate Ownership Mapping Logical grouping of resources Track historical data • Trust in data is most important. • Invest in monitoring & alerting for data inconsistencies • Leverage this for detecting abnormal increase/decrease and notify users • Static mappings go out of date quickly • Invest in systems (ex, Kite) for users to manage it themselves • Identifiers were too granular and teams were too broad. • Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain • Unit prices change over time • Orgs / Teams change over time • Resources get added / removed • Change history is essential for consistency which is used for CAP planning
  32. 53 1 2 3 4 Learnings Chargeback @Twitter Invest in

    data Fidelity Accurate Ownership Mapping Logical grouping of resources Track historical data • Trust in data is most important. • Invest in monitoring & alerting for data inconsistencies • Leverage this for detecting abnormal increase/decrease and notify users • Static mappings go out of date quickly • Invest in systems (ex, Kite) for users to manage it themselves • Identifiers were too granular and teams were too broad. • Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain • Unit prices change over time • Orgs / Teams change over time • Resources get added / removed • Change history is essential for consistency which is used for CAP planning
  33. 54 1 2 3 4 Learnings Chargeback @Twitter Invest in

    data Fidelity Accurate Ownership Mapping Logical grouping of resources Track historical data • Trust in data is most important. • Invest in monitoring & alerting for data inconsistencies • Leverage this for detecting abnormal increase/decrease and notify users • Static mappings go out of date quickly • Invest in systems (ex, Kite) for users to manage it themselves • Identifiers were too granular and teams were too broad. • Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain • Unit prices change over time • Orgs / Teams change over time • Resources get added / removed • Change history is essential for consistency which is used for CAP planning
  34. 55 1 2 3 4 Learnings Chargeback @Twitter Invest in

    data Fidelity Accurate Ownership Mapping Logical grouping of resources Track historical data • Trust in data is most important. • Invest in monitoring & alerting for data inconsistencies • Leverage this for detecting abnormal increase/decrease and notify users • Static mappings go out of date quickly • Invest in systems (ex, Kite) for users to manage it themselves • Identifiers were too granular and teams were too broad. • Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain • Unit prices change over time • Orgs / Teams change over time • Resources get added / removed • Change history is essential for consistency which is used for CAP planning
  35. SERVICE IDENTITY MANAGER RESOURCE PROVISIONING MANAGER DASHBOARD (SINGLE PANE OF

    GLASS) REPORTING INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE INFRASTRUCTURE & PLATFORM SERVICE SERVICE LIFECYCLE WORKFLOWS METADATA RESOURCE QUOTA MANAGEMENT METERING & CHARGEBACK CLIENT IDENTITY PROVIDER APIS & ADAPTERS
  36. 59 Kite @Twitter Identity System: Built a consistent way to

    group client identifiers of different infrastructure services into a project and enabled ownership • Capture Org Structure: Support org structure changes, project transfer workflows to ensure up-to-date ownership of identifiers • Unify client identifier provisioning workflow: Enables single source of truth and reduces operator pain around provisioning and managing client identifiers. Client Identifier Management
  37. IDENTITY ENTITY MODEL SERVICE/
 SYSTEM ACCOUNT <INFRA, CLIENTID> 1:N tweetypie

    <Aurora, tweetypie.prod.tweetypie> ads-prediction <Aurora, ads- prediction.prod.campaign-x>
  38. BUSINESS OWNER TEAM PROJECT SERVICE/
 SYSTEM ACCOUNT <INFRA, CLIENTID> 1:N

    1:N 1:N 1:N INFRASTRUCTURE TWEETYPIE tweetypie tweetypie <Aurora, tweetypie.prod.tweetypie> ADS PREDICTION prediction ads-prediction <Aurora, ads- prediction.prod.campaign-x> REVENUE IDENTITY ENTITY MODEL
  39. BUSINESS OWNER TEAM PROJECT SERVICE/
 SYSTEM ACCOUNT <INFRA, CLIENTID> 1:N

    1:N 1:N 1:N INFRASTRUCTURE TWEETYPIE tweetypie tweetypie <Aurora, tweetypie.prod.tweetypie> ADS PREDICTION prediction ads-prediction <Aurora, ads- prediction.prod.campaign-x> REVENUE IDENTITY ENTITY MODEL Entities are time varying dimensions
  40. 75 Future Work Impact & Future Work 1 2 Resource

    provisioning Enable project deprecation • Extend Quota Manager and unify the experience into Kite • Onboard Hadoop, Storage and other systems • Detect unused resources, notify users, trigger deprecation process based on policy 3 Capacity Planning • Provide historic trends and help with forecast of capacity
  41. 76 1 2 Future Work Impact & Future Work Resource

    provisioning Enable project deprecation • Extend Quota Manager and unify the experience into Kite • Onboard Hadoop, Storage and other systems • Detect unused resources, notify users, trigger deprecation process based on policy 3 Capacity Planning • Provide historic trends and help with forecast of capacity
  42. 77 1 2 Future Work Impact & Future Work Resource

    provisioning Enable project deprecation • Extend Quota Manager and unify the experience into Kite • Onboard Hadoop, Storage and other systems • Detect unused resources, notify users, trigger deprecation process based on policy 3 Capacity Planning • Provide historic trends and help with forecast of capacity
  43. 79 1 2 Future Work Impact & Future Work Resource

    provisioning Enable project deprecation • Extend Quota Manager and unify the experience into Kite • Onboard Hadoop, Storage and other systems • Detect unused resources, notify users, trigger deprecation process based on policy 3 Capacity Planning • Provide historic trends and help with forecast of capacity