Slide 1

Slide 1 text

Cookpad Inc. Feb 27th, 2019 Takayuki Watanabe Technology Department SRE Group Challenges for Global Service
 from a Perspective of SRE ~ 2nd season ~

Slide 2

Slide 2 text

About me 2 • Takayuki Watanabe - Twitter: takanabe_w / GitHub: takanabe • Site Reliability Engineer (a.k.a SRE) - Focus on Cookpad Global Projects

Slide 3

Slide 3 text

Today’s menu 3 • What is Cookpad Global ? • Role of Site Reliability Engineers • Paving Roads for Autonomous Teams - Challenge 1: Organization Transformation for Greater Autonomy - Challenge 2: Feasible Self-service for Autonomous Teams

Slide 4

Slide 4 text

What is Cookpad Global ? 4

Slide 5

Slide 5 text

What is Cookpad Global Service? 5 +BQBOFTF$PPLQBE"QQ (MPCBM$PPLQBE"QQT What is Cookpad Global Service?

Slide 6

Slide 6 text

What is Cookpad Global Service? What is Cookpad Global Service? 6 +BQBOFTF$PPLQBE"QQ (MPCBM$PPLQBE"QQT JP Service ≠ Global Service

Slide 7

Slide 7 text

Our users across the globe 7 4FSWJDFQSPWJEFE DPVOUSJFT What is Cookpad Global Service?

Slide 8

Slide 8 text

What is Cookpad Global Service? Our users across the globe 8 4FSWJDFQSPWJEFE DPVOUSJFT 71 Countries 26 Languages

Slide 9

Slide 9 text

What is Cookpad Global Service? Our users across the globe 9 4FSWJDFQSPWJEFE DPVOUSJFT 94 Million Monthly Average Users

Slide 10

Slide 10 text

# of Recipes for Global Service 10 # of Recipes 0WFSNJMMJPOSFDJQFT NJMMJPOSFDJQFTTJODF What is Cookpad Global Service?

Slide 11

Slide 11 text

Users and Developers across the globe 11 • Global Service and SRE ? • Empower high perform technology organization Global head quarter 
 UK, Bristol 11

Slide 12

Slide 12 text

Users and Developers across the globe 12 • Global Service and SRE ? • Empower high perform technology organization Global head quarter 
 UK, Bristol 12 100 People 25 Nationalities

Slide 13

Slide 13 text

Users and Developers across the globe 13 • Global Service and SRE ? • Empower high perform technology organization Global head quarter 
 UK, Bristol 13 The best people join from all over the world

Slide 14

Slide 14 text

Role of Site Reliability Engineers 14

Slide 15

Slide 15 text

15

Slide 16

Slide 16 text

16 A user living beyond a log

Slide 17

Slide 17 text

17

Slide 18

Slide 18 text

18 Our Product Developers

Slide 19

Slide 19 text

Missions for SREs in Cookpad 19 • Maximize user experiences in terms of: • Service availability • Performance • Security • etc… • Build a great platform to support a growing product • Product development optimized platform • Software architects owning comprehensive knowledge for technology Role of Site Reliability Engineers

Slide 20

Slide 20 text

Missions for SREs in Cookpad 20 • Maximize user experiences in terms of: • Service availability • Performance • Security • etc… • Build a great platform to support a growing product • Product development optimized platform • Software architects owning comprehensive knowledge for technology Role of Site Reliability Engineers Control service availability based on various factors

Slide 21

Slide 21 text

SRE technology scope in Cookpad 21 4FSWJDF 1MBUGPSNT w 7.$POUBJOFS1MBUGPSNPO"84 Role of Site Reliability Engineers

Slide 22

Slide 22 text

SRE technology scope in Cookpad 22 4FSWJDF 1MBUGPSNT 0CTFSWBCJMJUZ &OHJOFFSJOH .JTDJOIPVTF 5PPMJOH 3FMFBTF &OHJOFFSJOH 3FTJMJFODF &OHJOFFSJOH w %JTUSJCVUFE5SBDJOH w .FUSJDT.POJUPSJOH w -PHHJOH4ZTUFN w "MFSUT.BOBHFNFOU w .-#BTFE"OPNBMZ%FUFDUJPO w %BUB"OBMZTJT w 5FBN)FBMUI7JTVBMJ[BUJPO w "84$PTU0QUJNJ[BUJPO w %FWFMPQFS'SJFOEMZ"VUI4ZTUFN w FUD w %FQMPZ1JQFMJOF w $POUJOVPVT*OUFHSBUJPO w $POUJOVPVT%FMJWFSZ w %FQMPZ4USBUFHZ w /8'BVMU*OKFDUJPO w 4QPU*OTUBODF w $JSDVJU#SFBLFS w 5ISPUUMJOH w 7.$POUBJOFS1MBUGPSNPO"84 Role of Site Reliability Engineers

Slide 23

Slide 23 text

23 Challenges in 2018 Attacks from China GDPR Recipe data migration EKS based staging Recruitment in UK Observability Full containerization 23 Spot instances Expense reduction Toil analysis automation

Slide 24

Slide 24 text

Paving Roads for Autonomous Teams 24

Slide 25

Slide 25 text

Paving Roads for Autonomous Teams 25 • Challenge 1: Organization Transformation for Greater Autonomy • Challenge 2: Feasible Self-service for Autonomous Teams

Slide 26

Slide 26 text

Challenge 1 Organization Transformation for Greater Autonomy 26

Slide 27

Slide 27 text

Organization Transformation for Greater Autonomy 27 • Tipping Points for Autonomous Teams • Organization Transformation: Chapter and Squad • Development style change for new team structure • Necessity of shared responsibility for service availability Challenge1: Organization Transformation for Greater Autonomy

Slide 28

Slide 28 text

Tipping Points for Autonomous Teams 28 • Cookpad employees in UK • 2016: 5 people • 2017: 50 people • 2018: 100 people Challenge1: Organization Transformation for Greater Autonomy 8FC J04 "OESPJE 2" 43& 1. .- Team structure in 2016, 2017

Slide 29

Slide 29 text

Tipping Points for Autonomous Teams 29 Challenge1: Organization Transformation for Greater Autonomy lines = n(n − 1) 2

Slide 30

Slide 30 text

Tipping Points for Autonomous Teams 30 Challenge1: Organization Transformation for Greater Autonomy lines = n(n − 1) 2 Communication cost ↑

Slide 31

Slide 31 text

Organization Transformation: Chapter and Squad 31 Challenge1: Organization Transformation for Greater Autonomy 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 43& Chapter 1. 1. 1. Product Squad .- Cross-platform Squad ɾɾɾ 8FC J04 "OESPJE 2" 43& 1. .- After Before

Slide 32

Slide 32 text

Organization Transformation: Chapter and Squad 32 Challenge1: Organization Transformation for Greater Autonomy 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 43& Chapter 1. 1. 1. Product Squad .- Cross-platform Squad ɾɾɾ 8FC J04 "OESPJE 2" 43& 1. .- After Before

Slide 33

Slide 33 text

Organization Transformation: Chapter and Squad 33 Challenge1: Organization Transformation for Greater Autonomy 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 43& Chapter 1. 1. 1. Product Squad .- Cross-platform Squad ɾɾɾ 8FC J04 "OESPJE 2" 43& 1. .- After Before

Slide 34

Slide 34 text

1 34 Challenge1: Organization Transformation for Greater Autonomy 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 43& Chapter 1. 1. 1. Product Squad .- Cross-platform Squad ɾɾɾ 8FC J04 "OESPJE 2" 43& 1. .- After Before Conway's law … http://www.melconway.com/Home/Conways_Law.html

Slide 35

Slide 35 text

Development style change for new team structure 35 • Architecture of new feed • New development styles Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure

Slide 36

Slide 36 text

Architecture of new feed 36 Message broker Main API Cache Feed API DB Complete feed json/html Cache DB GET /user_id/feed List of activity primary keys in order, paginated Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure

Slide 37

Slide 37 text

Architecture of new feed 37 Message broker Main API Cache Feed API DB Complete feed json/html Cache DB New components developed by a squad GET /user_id/feed List of activity primary keys in order, paginated Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure

Slide 38

Slide 38 text

New development styles (Partial release in production) 38 # WIP code for new notification system # https://github.com/cookpad/xxxxxx-squad/issues/yyyyyy Rollout.add :notification_center, owner: "xxxxxx-squad" do # @developer_a, @developer_b, @developer_c, @developer_d, @developer_e Current.user&.id&.in?([AAAAAA, BBBBBB, CCCCCC, DDDDDD, EEEEEE]) end • Feature toggle (application level control) • Prototype environment (platform level control) Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure

Slide 39

Slide 39 text

Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure New development styles (Partial release in production) 39 # WIP code for new notification system # https://github.com/cookpad/xxxxxx-squad/issues/yyyyyy Rollout.add :notification_center, owner: "xxxxxx-squad" do # @developer_a, @developer_b, @developer_c, @developer_d, @developer_e Current.user&.id&.in?([AAAAAA, BBBBBB, CCCCCC, DDDDDD, EEEEEE]) end • Feature toggle (application level control) • Prototype environment (platform level control) Only users know answers

Slide 40

Slide 40 text

Feed was successful feature? 40 • Yes, feed was one of the most successful features in 2018 • New architecture • New technology stack • 100% release in production in short time Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure

Slide 41

Slide 41 text

41 Why feed was successful? • A lot of trials, failures and improvements in short term • Developers had power and responsibility for feature developments • Feed was developed from scratch • Developers could choose appropriate technology • Introduce Streamy, Karafka (stream app frameworks) • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers) Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure

Slide 42

Slide 42 text

Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure 42 Why feed was successful? • A lot of trials, failures and improvements in short term • Developers had power and responsibility for feature developments • Feed was developed from scratch • Developers could choose appropriate technology • Introduce Streamy, Karafka (stream app frameworks) • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers) Rapid prototyping was successful

Slide 43

Slide 43 text

Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure 43 Why feed was successful? • A lot of trials, failures and improvements in short term • Developers had power and responsibility for feature developments • Feed was developed from scratch • Developers could choose appropriate technology • Introduce Streamy, Karafka (stream app frameworks) • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers) On the other hand …

Slide 44

Slide 44 text

Necessity of shared responsibility for service availability 44 Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability

Slide 45

Slide 45 text

Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability Necessity of shared responsibility for service availability 45 Too many errors SREs cannot understand…

Slide 46

Slide 46 text

46 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT OFXQSPEVDU IBQQZ IBQQZ VOIBQQZ VOIBQQZ OFXQSPEVDU Happiness Quadrant (release new feed) Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability

Slide 47

Slide 47 text

47 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ UPVHIFYQFSJFODFT OFXQSPEVDU Happiness Quadrant (release new feed) Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability

Slide 48

Slide 48 text

48 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT UPVHIFYQFSJFODFT IBQQZ IBQQZ VOIBQQZ VOIBQQZ OFXQSPEVDU Happiness Quadrant (release new feed) Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability

Slide 49

Slide 49 text

49 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT UPVHIFYQFSJFODFT IBQQZ IBQQZ VOIBQQZ VOIBQQZ OFXQSPEVDU Happiness Quadrants (Release new feed) Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability

Slide 50

Slide 50 text

Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability 50 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT UPVHIFYQFSJFODFT IBQQZ IBQQZ VOIBQQZ VOIBQQZ OFXQSPEVDU Happiness Quadrants (Release new feed) Not sustainable …

Slide 51

Slide 51 text

51 Why this situation happen? • A lot of trials, failures and improvements in short term • Developers had power and responsibility for feature developments • Feed was developed from scratch • Developers could choose appropriate technology • Introduce Streamy, Karafka (stream app frameworks) • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers) • No concepts of shared responsibility for service availability Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability

Slide 52

Slide 52 text

52 (WIP) Shared responsibility as Autonomous Teams • Shared responsibility for organization sustainability • Reach consensus of service availability for each feature • Targets decided by product owners • Higher quality in emergency notifications • Alert handling by appropriate people • Another organization transformation based on ideal tech & business architectures Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability

Slide 53

Slide 53 text

53 • Shared responsibility for organization sustainability • Reach consensus of service availability for each feature • Targets decided by product owners • Higher quality in emergency notifications • Alert handling by appropriate people • Another organization transformation based on ideal tech & business architectures Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability Inverse Conway Maneuver … (WIP) Shared responsibility as Autonomous Teams

Slide 54

Slide 54 text

Challenge 2 Feasible Self-service for Autonomous Teams 54

Slide 55

Slide 55 text

Feasible Self-service for Autonomous Teams 55 • Four Important Keys for Successful Autonomous Teams • Feasible Self-service for Developers • Our focused scope • Full-containerization • No ssh debugging Challenge2: Feasible Self-service for Autonomous Teams

Slide 56

Slide 56 text

Four Important Keys for Successful Autonomous Teams 56 • Discipline: Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, Deploy pipeline Challenge2: Feasible Self-service for Autonomous Teams

Slide 57

Slide 57 text

Four Important Keys for Successful Autonomous Teams 57 • Discipline: Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, Deploy pipeline Challenge2: Feasible Self-service for Autonomous Teams

Slide 58

Slide 58 text

Four Important Keys for Successful Autonomous Teams 58 • Discipline: Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, Deploy pipeline Challenge2: Feasible Self-service for Autonomous Teams

Slide 59

Slide 59 text

Four Important Keys for Successful Autonomous Teams 59 • Discipline: Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, Deploy pipeline Challenge2: Feasible Self-service for Autonomous Teams

Slide 60

Slide 60 text

Four Important Keys for Successful Autonomous Teams 60 • Discipline: Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams

Slide 61

Slide 61 text

Four Important Keys for Successful Autonomous Teams 61 • Discipline: Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams Organization strategy matter

Slide 62

Slide 62 text

Four Important Keys for Successful Autonomous Teams 62 • Discipline: Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams Organization strategy matter Strong leaderships across tech and business are essential

Slide 63

Slide 63 text

Four Important Keys for Successful Autonomous Teams 63 • Discipline: Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams SRE squad can contribute

Slide 64

Slide 64 text

Four Important Keys for Successful Autonomous Teams 64 • Discipline: Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams SRE squad can contribute Optimized self-service mechanisms providing company-wide best practices in SRE

Slide 65

Slide 65 text

Feasible Self-service for Developers 65 • Low learning cost • e.g: Are you sure that developers are happy to learn and maintain k8s yaml? • Secure and painless operations in production • e.g: Are experiences provided by SREs comfortable and secure for developers? Challenge2: Feasible Self-service for Autonomous Teams

Slide 66

Slide 66 text

Our focused scope 66 • Full-containerization • No ssh debugging Challenge2: Feasible Self-service for Autonomous Teams

Slide 67

Slide 67 text

67 Full-containerization

Slide 68

Slide 68 text

Pros of Applications on Container Platform 68 • Developers can control software version upgrade timing • SREs don’t want to maintain legacy VM based service platform • Application of in-house tools and company-wide best practices • Auto Scaling • Cost optimization (spot fleets) • Container apps deployment tool (hako) • Centralized developer console (hako-console) • Easy service mesh integration • etc … • Immutable infrastructure • version controlled applications and infrastructures • No configuration drifts Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

Slide 69

Slide 69 text

Pros of Applications on Container Platform 69 • Developers can control software version upgrade timing • SREs don’t want to maintain legacy VM based service platform • Application of in-house tools and company-wide best practices • Auto Scaling • Cost optimization (spot fleets) • Container apps deployment tool (hako) • Centralized developer console (hako-console) • Easy service mesh integration • etc … • Immutable infrastructure • version controlled applications and infrastructures • No configuration drifts Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

Slide 70

Slide 70 text

Development Lead Time 70

Slide 71

Slide 71 text

71 %FW 43&

Slide 72

Slide 72 text

72 72

Slide 73

Slide 73 text

73 73 Gaps between Devs & SREs …

Slide 74

Slide 74 text

Happiness Quadrant (Software Upgrade without container) 74 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ 6OQSPEVDUJWFUBTLT /FXTPGUXBSFWFSTJPO Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

Slide 75

Slide 75 text

75 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT #MPDLFEUJNFFYQFSJFODF /FXTPGUXBSFWFSTJPO /FXTPGUXBSFWFSTJPO IBQQZ IBQQZ VOIBQQZ VOIBQQZ Happiness Quadrant (Software Upgrade without container) 6OQSPEVDUJWFUBTLT Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

Slide 76

Slide 76 text

76 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ Happiness Quadrant (Software Upgrade without container) Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

Slide 77

Slide 77 text

77 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ (SFBUNFDIBOJTNNJHIUQVUUIF WFDUPSPOUPUIFTURVBESBOUʜ Happiness Quadrant (Software Upgrade without container) Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

Slide 78

Slide 78 text

Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization Happiness quadrant (Software Upgrade without container) 78 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ (SFBUNFDIBOJTNNJHIUQVUUIF WFDUPSPOUPUIFTURVBESBOUʜ Run all stateless applications on container clusters

Slide 79

Slide 79 text

Progress of Full-containerization in Global 79 Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

Slide 80

Slide 80 text

Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization Progress of Full-containerization in Global 80 17/18 apps are running on containers (94 % is completed)

Slide 81

Slide 81 text

81 81 %FW

Slide 82

Slide 82 text

82 82 5IFEBUF3VCZXBTSFMFBTFE %FDUI %FWFMPQFSTDBODPOUSPM3VCZWFSTJPOTXJUIPVU43&T`TVQQPSUT %FW

Slide 83

Slide 83 text

83 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT USBJOJOHDPTU MFBSOJOHDPTU /FXTPGUXBSFWFSTJPO /FXTPGUXBSFWFSTJPOPQFSBUJPOBMDPTUSFEVDUJPO IBQQZ IBQQZ VOIBQQZ VOIBQQZ Happiness Quadrant (Software Upgrade with container) Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

Slide 84

Slide 84 text

84 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ Happiness Quadrant (Software Upgrade with container) Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

Slide 85

Slide 85 text

Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization 85 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ Happiness Quadrant (Software Upgrade with container) Win - Win

Slide 86

Slide 86 text

Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization 86 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ Happiness Quadrant (Software Upgrade with container) Plus, SREs can focus on container platform (more best practices can be introduced)

Slide 87

Slide 87 text

Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging SSH SSH 87 No ssh debugging

Slide 88

Slide 88 text

Cons of Applications on Container Platform 88 • Additional Complexities for Developers • Lack of tools cause chaos Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

Slide 89

Slide 89 text

Cons of Applications on Container Platform 89 • Additional Complexities for Developers • Lack of tools create chaos SFGIUUQTTQFBLFSEFDLDPNUBLBOBCFDIBMMFOHFTGPSHMPCBMTFSWJDFGSPNBQFSTQFDUJWFPGTSF TMJEF

Slide 90

Slide 90 text

Cons of Applications on Container Platform 90 • Additional Complexities for Developers • Lack of tools create chaos SFGIUUQTTQFBLFSEFDLDPNUBLBOBCFDIBMMFOHFTGPSHMPCBMTFSWJDFGSPNBQFSTQFDUJWFPGTSF TMJEF Already Enough ?

Slide 91

Slide 91 text

91

Slide 92

Slide 92 text

Cons of Applications on Container Platform 92 • Additional Complexities for Developers • Lack of tools cause chaos • No ssh debugging systems for Global team • Granular and chronological order metrics dashboard • Container optimized New Relic agent deployment • Short-term log collection • Safe rails console for container Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

Slide 93

Slide 93 text

93 $POUBJOFST 4IPSUUFSN 5%# -POHUFSN 5%# *OqVY%# 1SPNFUIFVT %FWFMPQFS EPXOTBNQMJOH FYQPSUNFUSJDT (SBGBOB 5JNFTFSJFT%BUBCBTF 6TFS*OUFSGBDF Granular and chronological metrics dashboard Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

Slide 94

Slide 94 text

94 Granular and chronological metrics dashboard • Before • We cannot dig errors caused by spike resource saturations • After • We can recognize errors caused by spike resource saturations • We can judge that errors should be fixed soon or not Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

Slide 95

Slide 95 text

Container optimized New Relic agent deployment 95 $POUBJOFS TIBSFENFNPSZ BHFOUTUBSUqBH IUUQBQQ@OFX@SFMJDTUBSU "11 SBDLOFX@SFMJDTUBSUFS IBLPQBSUJBSFMJD
 FYFDDPOTVMMPDL Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

Slide 96

Slide 96 text

Container optimized New Relic agent deployment 96 • Before • ECS cannot deploy a New Relic agent to a specific container ( We want to save ) • Agents are gone when containers are killed accidentally • After • ECS can deploy a New Relic agent to a container • Distributed locking via `consul lock` sidecar • Rack middleware that provides an endpoint to start the New Relic agent • Agents are launched in a container when agent start flag exists on shared memory Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

Slide 97

Slide 97 text

Short-term log collection 97 $POUBJOFST 4IPSUUFSN MPHTFBSDI -POHUFSN MPHTFBSDI 4 "UIFOB &MBTUJDTFBSDI FYQPSUMPHT IBLPDPOTPMF -PHTFBSDI 6TFS*OUFSGBDF ,JCBOB %FWFMPQFS FYQPSUMPHT Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

Slide 98

Slide 98 text

Short-term log collection 98 • Before • Developers have to wait for few minutes to search logs • After • Developers can check logs nearly real-time Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

Slide 99

Slide 99 text

Safe rails console for container 99 $POUBJOFST &YQPSUBVEJUMPHT USBQEPPSDPOTPMF 6TFS*OUFSGBDF 4MBDL USBQEPPSBHFOU "11 *OUFSBDUJWFDPNNVOJDBUJPO
 WJB8FC4PDLFU %FWFMPQFS .BOBHFBDDFTT QSJWJMFHFT "[VSF"% USBQEPPSQSPYZ #JOFYFD *OUFSBDUJWFDPNNVOJDBUJPO
 WJB8FC4PDLFU EBUBPOMZDPOUBJOFS Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

Slide 100

Slide 100 text

Safe rails console for container 100 • Before • Developers ssh to servers and run `rails -c` (Sometimes `rails -c -s`) • Developers can run write queries in production ( historical technical debt ) • After • Developers can use REPL via web browser with safe options selected by SREs • Developers can only run read queries on designated database instance Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

Slide 101

Slide 101 text

Safe rails console for container (before) 101 takayuki-watanabe@ssh-accepatable-host-xxx:~$ date Thu Apr 19 10:53:28 UTC 2018 takayuki-watanabe@ssh-accepatable-host-xxx:~$ htop PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command [snip] 31817 cookpad 20 0 794M 129M 188 R 93.7 1.7 1923h ruby bin/rails console production -s 8773 cookpad 20 0 734M 165M 152 R 91.7 2.2 1800h ruby bin/rails console production -s 8107 cookpad 20 0 959M 734M 14228 R 83.7 9.8 40h01:04 ruby bin/rails c production [snip] Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

Slide 102

Slide 102 text

Safe rails console for container (after) 102 Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

Slide 103

Slide 103 text

Safe rails console for container (after) 103 Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging Feasible Self-service make product development reliable and autonomous !!

Slide 104

Slide 104 text

Recap 104 • What is Cookpad Global ? • Role of Site Reliability Engineers • Paving Roads for Autonomous Teams - Challenge 1: Organization Transformation for Greater Autonomy - Challenge 2: Feasible Self-service for Autonomous Teams

Slide 105

Slide 105 text

105 Thank you !! ([email protected])