[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Service Ownership & Improve Infrastructure Utilization at Scale [I] - Vinu Charanya, Twitter

How built a framework to improve infrastructure resource utilization at
scale

★ Sr. Systems Engineer @Twitter ★ Proud being a member
of @TwitterWomen, @Techwomen and @WomenWhoCode I am @VinuCharanya Hello!

1 2 3 4 History & Context Chargeback @Twitter Kite
- Service Lifecycle Manager Impact & Future Work Agenda

History & Context

Thousands of MicroServices

INFRASTRUCTURE & DATACENTER MANAGEMENT CORE APPLICATION SERVICES TWEETS USERS SOCIAL
GRAPH PLATFORM SERVICES SEARCH MESSAGING & QUEUES CACHE MONITORING AND ALERTING INGRESS & PROXY   FRAMEWORK/ LIBRARIES FINAGLE (RPC) SCALDING (Map Reduce in Scala) HERON (Streaming Compute) JVM   MANAGEMENT TOOLS SELF SERVE SERVICE DIRECTORY CHARGEBACK CONFIG MGMT DATA & ANALYTICS PLATFORM INTERACTIVE QUERY DATA DISCOVERY WORKFLOW MANAGEMENT INFRASTRUCTURE SERVICES MANHATTAN BLOBSTORE GRAPHSTORE TIMESERIESDB S T O R A G E MESOS/AURORA HADOOP C O M P U T E MYSQL VERTICA POSTGRES D B / D W DEPLOY  (Workﬂows)

MESOS/AURORA HADOOP MANHATTAN 67% Number of Servers

Number of Servers MESOS/AURORA HADOOP MANHATTAN 67% How to get
visibility into resources used by individual jobs & datasets?

Number of Servers MESOS/AURORA HADOOP MANHATTAN 67% How to attribute
resource consumption  to teams/organization?

Number of Servers MESOS/AURORA HADOOP MANHATTAN 67% How do you
incentivize the right behavior to   improve eﬃciency of resource usage?

Chargeback @Twitter

Chargeback @Twitter Ability to meter allocation & utilization of resources

per service, per project, per engineering team

per service, per project, per engineering team to improve visibility & enable accountability

Features Supports diverse Infra Services Chargeback @Twitter 18 Meters abstract
resources at daily granularity Detailed Reports

19 Chargeback @Twitter 1. Resource Catalog: Consistent way to inventory
infrastructure resources Support diverse Infrastructure and Platform Services

infrastructure resources • Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets / second”). Extend existing resource Support diverse Infrastructure and Platform Services

infrastructure resources • Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets / second”). Extend existing resource 2. Resource <> Client Identiﬁer Ownership: Map of client identiﬁer to an owner to enable accountability Support diverse Infrastructure and Platform Services

OFFER MEASURE COST RESOURCE CATALOG ENTITY MODEL

OFFER MEASURES OFFER MEASURE COST 1:N RESOURCE CATALOG ENTITY MODEL

PROVIDER INFRASTRUCTURE SERVICE OFFERINGS OFFER MEASURES OFFER MEASURE COST 1:N
1:N 1:N 1:N RESOURCE CATALOG ENTITY MODEL

TWITTER DC/ PUBLIC CLOUD COMPUTE CORE-DAYS $X PROVIDER INFRASTRUCTURE SERVICE
OFFERINGS OFFER MEASURES OFFER MEASURE COST 1:N 1:N 1:N 1:N RESOURCE CATALOG ENTITY MODEL

TWITTER DC/ PUBLIC CLOUD COMPUTE CORE-DAYS $X PROVIDER INFRASTRUCTURE SERVICE
OFFERINGS OFFER MEASURES OFFER MEASURE COST 1:N 1:N 1:N 1:N TWITTER DC STORAGE GB- RAM PROCESSING CLUSTER FILE ACCESSES … … GB- RAM FILE ACCESSE S … … $X $Y … $M $N … … RESOURCE CATALOG ENTITY MODEL

{ measures: [ { "measure_id": 1, "measure_label": "core-days", "measure_unit_label": "per
1 core-day", "offering_id": 1, "offering_label": "Compute", "infrastructure_id": 1, "infrastructure_name": "Aurora" }, { "measure_id": 2, "measure_label": "machine-days", "measure_unit_label": "per 1 machine-day", "offering_id": 2, "offering_label": “zone:tweety", "infrastructure_id": 8, "infrastructure_name": "Physical Infrastructure", }, { /api/1/measures Chargeback @Twitter

So, how do you incentivize the right behavior to  
improve eﬃciency of resource usage?

Pricing is one way…

Operational Overhead Headroom Production Used Cores Non-Prod Used Cores Cost
of Physical Server  ($X / day) Total available Cores Quota Buﬀer  (Underutilized Quota) Container Size Buﬀer  (Underutilized Reservation) Total Cost of Ownership for Aurora $X core-day

of Physical Server  ($X / day) Total available Cores Quota Buﬀer  (Underutilized Quota) Container Size Buﬀer  (Underutilized Reservation) Total used Cores Total Cost of Ownership for Aurora $X core-day

of Physical Server  ($X / day) Total available Cores Quota Buﬀer  (Underutilized Quota) Container Size Buﬀer  (Underutilized Reservation) Total used Cores Excess Cores (incl. DR, Spikes, Overallocation) Total Cost of Ownership for Aurora $X core-day

of Physical Server  ($X / day) Total available Cores Quota Buﬀer  (Underutilized Quota) Container Size Buﬀer  (Underutilized Reservation) Total used Cores Excess Cores (incl. DR, Spikes, Overallocation) Cores used by platform  for operations & maintenance Total Cost of Ownership for Aurora $X core-day

Features Supports diverse Infra/Platform Services Chargeback @Twitter 34 Meters abstract

35 Chargeback @Twitter INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 INGEST
METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG REPORT REPORT Metering Pipeline (ETL Job) IDENTIFIER OWNERSHIP MAPPING Metrics Ingestor DATA FIDELITY Metering Pipeline (ETL Job)

36 Chargeback @Twitter INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 INGEST
METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG REPORT REPORT Metering Pipeline (ETL Job) IDENTIFIER OWNERSHIP MAPPING Schema(client_identifier, offering_measure, volume, metadata, timestamp) DATA FIDELITY Metering Pipeline (ETL Job)

37 Chargeback @Twitter Metering Pipeline (ETL Job) INFRASTRUCTURE SERVICE 1
INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG IDENTIFIER OWNERSHIP MAPPING REPORT REPORT Transformer DATA FIDELITY Metering Pipeline (ETL Job)

INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG IDENTIFIER OWNERSHIP MAPPING REPORT REPORT 1. Resolve Ownership DATA FIDELITY Metering Pipeline (ETL Job)

INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG IDENTIFIER OWNERSHIP MAPPING REPORT REPORT 2. Cost Computation DATA FIDELITY Metering Pipeline (ETL Job)

INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG DATA FIDELITY REPORT REPORT IDENTIFIER OWNERSHIP MAPPING Data Fidelity & Reporting Metering Pipeline (ETL Job)

INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG REPORT REPORT IDENTIFIER OWNERSHIP MAPPING 1. Verify Data Integrity & Fidelity DATA FIDELITY Metering Pipeline (ETL Job)

INFRASTRUCTURE SERVICE 2 INGEST METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG REPORT REPORT IDENTIFIER OWNERSHIP MAPPING 2. Alert when things don’t seem the way it should be DATA FIDELITY Metering Pipeline (ETL Job)

43 Chargeback @Twitter INFRASTRUCTURE SERVICE 1 INFRASTRUCTURE SERVICE 2 EXPORT
METRICS RAW FACT TRANSFORMER RESOLVED FACT RESOURCE CATALOG IDENTIFIER OWNERSHIP DATA FIDELITY REPORT REPORT Metering Pipeline (ETL Job)

45 Chargeback @Twitter Customers Infrastructure & Platform Operators Overall Cluster
Growth Allocation v/s Utilization of resources by Client/Tenant Finance & Execs Budget v/s Spend per Org Infrastructure PnL Overall Eﬃciency & Trends Service Owners & Developers Team Bill Per Service Allocation vs. Utilization of Resources Reports Customers Infrastructure & Platform Operators Overall Cluster Growth Allocation v/s Utilization of resources by Client/Tenant Finance & Execs Budget v/s Spend per Org Infrastructure PnL Overall Eﬃciency & Trends

INFRASTRUCTURE PNL

47 Chargeback @Twitter Customers Infrastructure & Platform Operators Overall Cluster
Growth Allocation v/s Utilization of resources by Client/Tenant Finance & Execs Budget v/s Spend per Org Infrastructure PnL Overall Eﬃciency & Trends Service Owners & Developers Team Bill Per Service Allocation vs. Utilization of Resources Reports

CHARGEBACK BILL FOR A TEAM

CHARGEBACK DRILLDOWN FOR A TEAM

51 1 2 3 4 Learnings Chargeback @Twitter Invest in
data Fidelity Accurate Ownership Mapping Logical grouping of resources Track historical data • Trust in data is most important. • Invest in monitoring & alerting for data inconsistencies • Leverage this for detecting abnormal increase/decrease and notify users • Static mappings go out of date quickly • Invest in systems (ex, Kite) for users to manage it themselves • Identiﬁers were too granular and teams were too broad. • Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain • Unit prices change over time • Orgs / Teams change over time • Resources get added / removed • Change history is essential for consistency which is used for CAP planning

SERVICE IDENTITY MANAGER RESOURCE PROVISIONING MANAGER DASHBOARD (SINGLE PANE OF
GLASS) REPORTING INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE INFRASTRUCTURE SERVICE INFRASTRUCTURE & PLATFORM SERVICE SERVICE LIFECYCLE WORKFLOWS METADATA RESOURCE QUOTA MANAGEMENT METERING & CHARGEBACK CLIENT IDENTITY PROVIDER APIS & ADAPTERS

10,000+ Client Identifiers 1,000+ Projects 100+ Teams 8 Infrastructure Services
58 Kite @Twitter

59 Kite @Twitter Identity System: Built a consistent way to
group client identifiers of different infrastructure services into a project and enabled ownership • Capture Org Structure: Support org structure changes, project transfer workflows to ensure up-to-date ownership of identifiers • Unify client identifier provisioning workflow: Enables single source of truth and reduces operator pain around provisioning and managing client identifiers. Client Identifier Management

IDENTITY ENTITY MODEL <INFRA, CLIENTID> <Aurora, tweetypie.prod.tweetypie> <Aurora, ads- prediction.prod.campaign-x>

IDENTITY ENTITY MODEL SERVICE/  SYSTEM ACCOUNT <INFRA, CLIENTID> 1:N tweetypie
<Aurora, tweetypie.prod.tweetypie> ads-prediction <Aurora, ads- prediction.prod.campaign-x>

BUSINESS OWNER TEAM PROJECT SERVICE/  SYSTEM ACCOUNT <INFRA, CLIENTID> 1:N
1:N 1:N 1:N INFRASTRUCTURE TWEETYPIE tweetypie tweetypie <Aurora, tweetypie.prod.tweetypie> ADS PREDICTION prediction ads-prediction <Aurora, ads- prediction.prod.campaign-x> REVENUE IDENTITY ENTITY MODEL

BUSINESS OWNER TEAM PROJECT SERVICE/  SYSTEM ACCOUNT <INFRA, CLIENTID> 1:N
1:N 1:N 1:N INFRASTRUCTURE TWEETYPIE tweetypie tweetypie <Aurora, tweetypie.prod.tweetypie> ADS PREDICTION prediction ads-prediction <Aurora, ads- prediction.prod.campaign-x> REVENUE IDENTITY ENTITY MODEL Entities are time varying dimensions

Impact

10,000+ Client Identifiers

CLAIM OWNERSHIP

PROJECT DISCOVERY

TEAM OVERVIEW

TEAM OVERVIEW Released unused Resources

TEAM OVERVIEW Q2 unit price update

TEAM OVERVIEW New project launch

PROJECT METADATA

AURORA QUOTA MANAGER

Future Work

75 Future Work Impact & Future Work 1 2 Resource
provisioning Enable project deprecation • Extend Quota Manager and unify the experience into Kite • Onboard Hadoop, Storage and other systems • Detect unused resources, notify users, trigger deprecation process based on policy 3 Capacity Planning • Provide historic trends and help with forecast of capacity

76 1 2 Future Work Impact & Future Work Resource

@VinuCharanya

[Kubecon 2017 Austin, TX] How We Built a Framew...

[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Service Ownership & Improve Infrastructure Utilization at Scale [I] - Vinu Charanya, Twitter

More Decks by Vinu Charanya

Other Decks in Technology

Featured

Transcript