Role of Site Reliability Engineers • Paving Roads for Autonomous Teams - Challenge 1: Organization Transformation for Greater Autonomy - Challenge 2: Feasible Self-service for Autonomous Teams
in terms of: • Service availability • Performance • Security • etc… • Build a great platform to support a growing product • Product development optimized platform • Software architects owning comprehensive knowledge for technology Role of Site Reliability Engineers
in terms of: • Service availability • Performance • Security • etc… • Build a great platform to support a growing product • Product development optimized platform • Software architects owning comprehensive knowledge for technology Role of Site Reliability Engineers Control service availability based on various factors
.JTDJOIPVTF 5PPMJOH 3FMFBTF &OHJOFFSJOH 3FTJMJFODF &OHJOFFSJOH w %JTUSJCVUFE5SBDJOH w .FUSJDT.POJUPSJOH w -PHHJOH4ZTUFN w "MFSUT.BOBHFNFOU w .-#BTFE"OPNBMZ%FUFDUJPO w %BUB"OBMZTJT w 5FBN)FBMUI7JTVBMJ[BUJPO w "84$PTU0QUJNJ[BUJPO w %FWFMPQFS'SJFOEMZ"VUI4ZTUFN w FUD w %FQMPZ1JQFMJOF w $POUJOVPVT*OUFHSBUJPO w $POUJOVPVT%FMJWFSZ w %FQMPZ4USBUFHZ w /8'BVMU*OKFDUJPO w 4QPU*OTUBODF w $JSDVJU#SFBLFS w 5ISPUUMJOH w 7.$POUBJOFS1MBUGPSNPO"84 Role of Site Reliability Engineers
Autonomous Teams • Organization Transformation: Chapter and Squad • Development style change for new team structure • Necessity of shared responsibility for service availability Challenge1: Organization Transformation for Greater Autonomy
UK • 2016: 5 people • 2017: 50 people • 2018: 100 people Challenge1: Organization Transformation for Greater Autonomy 8FC J04 "OESPJE 2" 43& 1. .- Team structure in 2016, 2017
Feed API DB Complete feed json/html Cache DB GET /user_id/feed List of activity primary keys in order, paginated Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
Feed API DB Complete feed json/html Cache DB New components developed by a squad GET /user_id/feed List of activity primary keys in order, paginated Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
code for new notification system # https://github.com/cookpad/xxxxxx-squad/issues/yyyyyy Rollout.add :notification_center, owner: "xxxxxx-squad" do # @developer_a, @developer_b, @developer_c, @developer_d, @developer_e Current.user&.id&.in?([AAAAAA, BBBBBB, CCCCCC, DDDDDD, EEEEEE]) end • Feature toggle (application level control) • Prototype environment (platform level control) Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
for new team structure New development styles (Partial release in production) 39 # WIP code for new notification system # https://github.com/cookpad/xxxxxx-squad/issues/yyyyyy Rollout.add :notification_center, owner: "xxxxxx-squad" do # @developer_a, @developer_b, @developer_c, @developer_d, @developer_e Current.user&.id&.in?([AAAAAA, BBBBBB, CCCCCC, DDDDDD, EEEEEE]) end • Feature toggle (application level control) • Prototype environment (platform level control) Only users know answers
of the most successful features in 2018 • New architecture • New technology stack • 100% release in production in short time Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
failures and improvements in short term • Developers had power and responsibility for feature developments • Feed was developed from scratch • Developers could choose appropriate technology • Introduce Streamy, Karafka (stream app frameworks) • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers) Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
for new team structure 42 Why feed was successful? • A lot of trials, failures and improvements in short term • Developers had power and responsibility for feature developments • Feed was developed from scratch • Developers could choose appropriate technology • Introduce Streamy, Karafka (stream app frameworks) • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers) Rapid prototyping was successful
for new team structure 43 Why feed was successful? • A lot of trials, failures and improvements in short term • Developers had power and responsibility for feature developments • Feed was developed from scratch • Developers could choose appropriate technology • Introduce Streamy, Karafka (stream app frameworks) • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers) On the other hand …
Quadrants (Release new feed) Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
failures and improvements in short term • Developers had power and responsibility for feature developments • Feed was developed from scratch • Developers could choose appropriate technology • Introduce Streamy, Karafka (stream app frameworks) • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers) • No concepts of shared responsibility for service availability Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
for organization sustainability • Reach consensus of service availability for each feature • Targets decided by product owners • Higher quality in emergency notifications • Alert handling by appropriate people • Another organization transformation based on ideal tech & business architectures Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
of service availability for each feature • Targets decided by product owners • Higher quality in emergency notifications • Alert handling by appropriate people • Another organization transformation based on ideal tech & business architectures Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability Inverse Conway Maneuver … (WIP) Shared responsibility as Autonomous Teams
for Successful Autonomous Teams • Feasible Self-service for Developers • Our focused scope • Full-containerization • No ssh debugging Challenge2: Feasible Self-service for Autonomous Teams
Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, Deploy pipeline Challenge2: Feasible Self-service for Autonomous Teams
Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, Deploy pipeline Challenge2: Feasible Self-service for Autonomous Teams
Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, Deploy pipeline Challenge2: Feasible Self-service for Autonomous Teams
Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, Deploy pipeline Challenge2: Feasible Self-service for Autonomous Teams
Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams
Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams Organization strategy matter
Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams Organization strategy matter Strong leaderships across tech and business are essential
Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams SRE squad can contribute
Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams SRE squad can contribute Optimized self-service mechanisms providing company-wide best practices in SRE
e.g: Are you sure that developers are happy to learn and maintain k8s yaml? • Secure and painless operations in production • e.g: Are experiences provided by SREs comfortable and secure for developers? Challenge2: Feasible Self-service for Autonomous Teams
control software version upgrade timing • SREs don’t want to maintain legacy VM based service platform • Application of in-house tools and company-wide best practices • Auto Scaling • Cost optimization (spot fleets) • Container apps deployment tool (hako) • Centralized developer console (hako-console) • Easy service mesh integration • etc … • Immutable infrastructure • version controlled applications and infrastructures • No configuration drifts Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
control software version upgrade timing • SREs don’t want to maintain legacy VM based service platform • Application of in-house tools and company-wide best practices • Auto Scaling • Cost optimization (spot fleets) • Container apps deployment tool (hako) • Centralized developer console (hako-console) • Easy service mesh integration • etc … • Immutable infrastructure • version controlled applications and infrastructures • No configuration drifts Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
Happiness Quadrant (Software Upgrade without container) 6OQSPEVDUJWFUBTLT Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ Happiness Quadrant (Software Upgrade with container) Plus, SREs can focus on container platform (more best practices can be introduced)
for Developers • Lack of tools cause chaos • No ssh debugging systems for Global team • Granular and chronological order metrics dashboard • Container optimized New Relic agent deployment • Short-term log collection • Safe rails console for container Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
FYQPSUNFUSJDT (SBGBOB 5JNFTFSJFT%BUBCBTF 6TFS*OUFSGBDF Granular and chronological metrics dashboard Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
cannot dig errors caused by spike resource saturations • After • We can recognize errors caused by spike resource saturations • We can judge that errors should be fixed soon or not Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
ECS cannot deploy a New Relic agent to a specific container ( We want to save ) • Agents are gone when containers are killed accidentally • After • ECS can deploy a New Relic agent to a container • Distributed locking via `consul lock` sidecar • Rack middleware that provides an endpoint to start the New Relic agent • Agents are launched in a container when agent start flag exists on shared memory Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
wait for few minutes to search logs • After • Developers can check logs nearly real-time Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
ssh to servers and run `rails -c` (Sometimes `rails -c -s`) • Developers can run write queries in production ( historical technical debt ) • After • Developers can use REPL via web browser with safe options selected by SREs • Developers can only run read queries on designated database instance Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
Apr 19 10:53:28 UTC 2018 [email protected]:~$ htop PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command [snip] 31817 cookpad 20 0 794M 129M 188 R 93.7 1.7 1923h ruby bin/rails console production -s 8773 cookpad 20 0 734M 165M 152 R 91.7 2.2 1800h ruby bin/rails console production -s 8107 cookpad 20 0 959M 734M 14228 R 83.7 9.8 40h01:04 ruby bin/rails c production [snip] Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
of Site Reliability Engineers • Paving Roads for Autonomous Teams - Challenge 1: Organization Transformation for Greater Autonomy - Challenge 2: Feasible Self-service for Autonomous Teams