Challenges for Global Service from a Perspective of SRE 2nd season

Challenges for Global Service from a Perspective of SRE 2nd season

Cookpad TechConf 2019: https://techconf.cookpad.com/2019/

Transcript

  1. Cookpad Inc. Feb 27th, 2019 Takayuki Watanabe Technology Department SRE

    Group Challenges for Global Service
 from a Perspective of SRE ~ 2nd season ~
  2. About me 2 • Takayuki Watanabe - Twitter: takanabe_w /

    GitHub: takanabe • Site Reliability Engineer (a.k.a SRE) - Focus on Cookpad Global Projects
  3. Today’s menu 3 • What is Cookpad Global ? •

    Role of Site Reliability Engineers • Paving Roads for Autonomous Teams - Challenge 1: Organization Transformation for Greater Autonomy - Challenge 2: Feasible Self-service for Autonomous Teams
  4. What is Cookpad Global ? 4

  5. What is Cookpad Global Service? 5 +BQBOFTF$PPLQBE"QQ (MPCBM$PPLQBE"QQT What is

    Cookpad Global Service?
  6. What is Cookpad Global Service? What is Cookpad Global Service?

    6 +BQBOFTF$PPLQBE"QQ (MPCBM$PPLQBE"QQT JP Service ≠ Global Service
  7. Our users across the globe 7 4FSWJDFQSPWJEFE DPVOUSJFT What is

    Cookpad Global Service?
  8. What is Cookpad Global Service? Our users across the globe

    8 4FSWJDFQSPWJEFE DPVOUSJFT 71 Countries 26 Languages
  9. What is Cookpad Global Service? Our users across the globe

    9 4FSWJDFQSPWJEFE DPVOUSJFT 94 Million Monthly Average Users
  10. # of Recipes for Global Service 10 # of Recipes

      0WFSNJMMJPOSFDJQFT NJMMJPOSFDJQFTTJODF What is Cookpad Global Service?
  11. Users and Developers across the globe 11 • Global Service

    and SRE ? • Empower high perform technology organization Global head quarter 
 UK, Bristol 11
  12. Users and Developers across the globe 12 • Global Service

    and SRE ? • Empower high perform technology organization Global head quarter 
 UK, Bristol 12 100 People 25 Nationalities
  13. Users and Developers across the globe 13 • Global Service

    and SRE ? • Empower high perform technology organization Global head quarter 
 UK, Bristol 13 The best people join from all over the world
  14. Role of Site Reliability Engineers 14

  15. 15

  16. 16 A user living beyond a log

  17. 17

  18. 18 Our Product Developers

  19. Missions for SREs in Cookpad 19 • Maximize user experiences

    in terms of: • Service availability • Performance • Security • etc… • Build a great platform to support a growing product • Product development optimized platform • Software architects owning comprehensive knowledge for technology Role of Site Reliability Engineers
  20. Missions for SREs in Cookpad 20 • Maximize user experiences

    in terms of: • Service availability • Performance • Security • etc… • Build a great platform to support a growing product • Product development optimized platform • Software architects owning comprehensive knowledge for technology Role of Site Reliability Engineers Control service availability based on various factors
  21. SRE technology scope in Cookpad 21 4FSWJDF 1MBUGPSNT w 7.$POUBJOFS1MBUGPSNPO"84

    Role of Site Reliability Engineers
  22. SRE technology scope in Cookpad 22 4FSWJDF 1MBUGPSNT 0CTFSWBCJMJUZ &OHJOFFSJOH

    .JTDJOIPVTF 5PPMJOH 3FMFBTF &OHJOFFSJOH 3FTJMJFODF &OHJOFFSJOH w %JTUSJCVUFE5SBDJOH w .FUSJDT.POJUPSJOH w -PHHJOH4ZTUFN w "MFSUT.BOBHFNFOU w .-#BTFE"OPNBMZ%FUFDUJPO w %BUB"OBMZTJT w 5FBN)FBMUI7JTVBMJ[BUJPO w "84$PTU0QUJNJ[BUJPO w %FWFMPQFS'SJFOEMZ"VUI4ZTUFN w FUD w %FQMPZ1JQFMJOF w $POUJOVPVT*OUFHSBUJPO w $POUJOVPVT%FMJWFSZ w %FQMPZ4USBUFHZ w /8'BVMU*OKFDUJPO w 4QPU*OTUBODF w $JSDVJU#SFBLFS w 5ISPUUMJOH w 7.$POUBJOFS1MBUGPSNPO"84 Role of Site Reliability Engineers
  23. 23 Challenges in 2018 Attacks from China GDPR Recipe data

    migration EKS based staging Recruitment in UK Observability Full containerization 23 Spot instances Expense reduction Toil analysis automation
  24. Paving Roads for Autonomous Teams 24

  25. Paving Roads for Autonomous Teams 25 • Challenge 1: Organization

    Transformation for Greater Autonomy • Challenge 2: Feasible Self-service for Autonomous Teams
  26. Challenge 1 Organization Transformation for Greater Autonomy 26

  27. Organization Transformation for Greater Autonomy 27 • Tipping Points for

    Autonomous Teams • Organization Transformation: Chapter and Squad • Development style change for new team structure • Necessity of shared responsibility for service availability Challenge1: Organization Transformation for Greater Autonomy
  28. Tipping Points for Autonomous Teams 28 • Cookpad employees in

    UK • 2016: 5 people • 2017: 50 people • 2018: 100 people Challenge1: Organization Transformation for Greater Autonomy 8FC J04 "OESPJE 2" 43& 1. .- Team structure in 2016, 2017
  29. Tipping Points for Autonomous Teams 29 Challenge1: Organization Transformation for

    Greater Autonomy lines = n(n − 1) 2
  30. Tipping Points for Autonomous Teams 30 Challenge1: Organization Transformation for

    Greater Autonomy lines = n(n − 1) 2 Communication cost ↑
  31. Organization Transformation: Chapter and Squad 31 Challenge1: Organization Transformation for

    Greater Autonomy 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 43& Chapter 1. 1. 1. Product Squad .- Cross-platform Squad ɾɾɾ 8FC J04 "OESPJE 2" 43& 1. .- After Before
  32. Organization Transformation: Chapter and Squad 32 Challenge1: Organization Transformation for

    Greater Autonomy 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 43& Chapter 1. 1. 1. Product Squad .- Cross-platform Squad ɾɾɾ 8FC J04 "OESPJE 2" 43& 1. .- After Before
  33. Organization Transformation: Chapter and Squad 33 Challenge1: Organization Transformation for

    Greater Autonomy 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 43& Chapter 1. 1. 1. Product Squad .- Cross-platform Squad ɾɾɾ 8FC J04 "OESPJE 2" 43& 1. .- After Before
  34. 1 34 Challenge1: Organization Transformation for Greater Autonomy 8FC J04

    "OESPJE 2" 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 43& Chapter 1. 1. 1. Product Squad .- Cross-platform Squad ɾɾɾ 8FC J04 "OESPJE 2" 43& 1. .- After Before Conway's law … http://www.melconway.com/Home/Conways_Law.html
  35. Development style change for new team structure 35 • Architecture

    of new feed • New development styles Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
  36. Architecture of new feed 36 Message broker Main API Cache

    Feed API DB Complete feed json/html Cache DB GET /user_id/feed List of activity primary keys in order, paginated Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
  37. Architecture of new feed 37 Message broker Main API Cache

    Feed API DB Complete feed json/html Cache DB New components developed by a squad GET /user_id/feed List of activity primary keys in order, paginated Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
  38. New development styles (Partial release in production) 38 # WIP

    code for new notification system # https://github.com/cookpad/xxxxxx-squad/issues/yyyyyy Rollout.add :notification_center, owner: "xxxxxx-squad" do # @developer_a, @developer_b, @developer_c, @developer_d, @developer_e Current.user&.id&.in?([AAAAAA, BBBBBB, CCCCCC, DDDDDD, EEEEEE]) end • Feature toggle (application level control) • Prototype environment (platform level control) Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
  39. Challenge1: Organization Transformation for Greater Autonomy > Development style change

    for new team structure New development styles (Partial release in production) 39 # WIP code for new notification system # https://github.com/cookpad/xxxxxx-squad/issues/yyyyyy Rollout.add :notification_center, owner: "xxxxxx-squad" do # @developer_a, @developer_b, @developer_c, @developer_d, @developer_e Current.user&.id&.in?([AAAAAA, BBBBBB, CCCCCC, DDDDDD, EEEEEE]) end • Feature toggle (application level control) • Prototype environment (platform level control) Only users know answers
  40. Feed was successful feature? 40 • Yes, feed was one

    of the most successful features in 2018 • New architecture • New technology stack • 100% release in production in short time Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
  41. 41 Why feed was successful? • A lot of trials,

    failures and improvements in short term • Developers had power and responsibility for feature developments • Feed was developed from scratch • Developers could choose appropriate technology • Introduce Streamy, Karafka (stream app frameworks) • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers) Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
  42. Challenge1: Organization Transformation for Greater Autonomy > Development style change

    for new team structure 42 Why feed was successful? • A lot of trials, failures and improvements in short term • Developers had power and responsibility for feature developments • Feed was developed from scratch • Developers could choose appropriate technology • Introduce Streamy, Karafka (stream app frameworks) • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers) Rapid prototyping was successful
  43. Challenge1: Organization Transformation for Greater Autonomy > Development style change

    for new team structure 43 Why feed was successful? • A lot of trials, failures and improvements in short term • Developers had power and responsibility for feature developments • Feed was developed from scratch • Developers could choose appropriate technology • Introduce Streamy, Karafka (stream app frameworks) • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers) On the other hand …
  44. Necessity of shared responsibility for service availability 44 Challenge1: Organization

    Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
  45. Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared

    responsibility for service availability Necessity of shared responsibility for service availability 45 Too many errors SREs cannot understand…
  46. 46 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT OFXQSPEVDU IBQQZ IBQQZ VOIBQQZ VOIBQQZ OFXQSPEVDU Happiness

    Quadrant (release new feed) Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
  47. 47 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ UPVHIFYQFSJFODFT OFXQSPEVDU Happiness

    Quadrant (release new feed) Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
  48. 48 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT UPVHIFYQFSJFODFT IBQQZ IBQQZ VOIBQQZ VOIBQQZ OFXQSPEVDU Happiness

    Quadrant (release new feed) Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
  49. 49 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT UPVHIFYQFSJFODFT IBQQZ IBQQZ VOIBQQZ VOIBQQZ OFXQSPEVDU Happiness

    Quadrants (Release new feed) Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
  50. Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared

    responsibility for service availability 50 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT UPVHIFYQFSJFODFT IBQQZ IBQQZ VOIBQQZ VOIBQQZ OFXQSPEVDU Happiness Quadrants (Release new feed) Not sustainable …
  51. 51 Why this situation happen? • A lot of trials,

    failures and improvements in short term • Developers had power and responsibility for feature developments • Feed was developed from scratch • Developers could choose appropriate technology • Introduce Streamy, Karafka (stream app frameworks) • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers) • No concepts of shared responsibility for service availability Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
  52. 52 (WIP) Shared responsibility as Autonomous Teams • Shared responsibility

    for organization sustainability • Reach consensus of service availability for each feature • Targets decided by product owners • Higher quality in emergency notifications • Alert handling by appropriate people • Another organization transformation based on ideal tech & business architectures Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
  53. 53 • Shared responsibility for organization sustainability • Reach consensus

    of service availability for each feature • Targets decided by product owners • Higher quality in emergency notifications • Alert handling by appropriate people • Another organization transformation based on ideal tech & business architectures Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability Inverse Conway Maneuver … (WIP) Shared responsibility as Autonomous Teams
  54. Challenge 2 Feasible Self-service for Autonomous Teams 54

  55. Feasible Self-service for Autonomous Teams 55 • Four Important Keys

    for Successful Autonomous Teams • Feasible Self-service for Developers • Our focused scope • Full-containerization • No ssh debugging Challenge2: Feasible Self-service for Autonomous Teams
  56. Four Important Keys for Successful Autonomous Teams 56 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, Deploy pipeline Challenge2: Feasible Self-service for Autonomous Teams
  57. Four Important Keys for Successful Autonomous Teams 57 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, Deploy pipeline Challenge2: Feasible Self-service for Autonomous Teams
  58. Four Important Keys for Successful Autonomous Teams 58 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, Deploy pipeline Challenge2: Feasible Self-service for Autonomous Teams
  59. Four Important Keys for Successful Autonomous Teams 59 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, Deploy pipeline Challenge2: Feasible Self-service for Autonomous Teams
  60. Four Important Keys for Successful Autonomous Teams 60 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams
  61. Four Important Keys for Successful Autonomous Teams 61 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams Organization strategy matter
  62. Four Important Keys for Successful Autonomous Teams 62 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams Organization strategy matter Strong leaderships across tech and business are essential
  63. Four Important Keys for Successful Autonomous Teams 63 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams SRE squad can contribute
  64. Four Important Keys for Successful Autonomous Teams 64 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams SRE squad can contribute Optimized self-service mechanisms providing company-wide best practices in SRE
  65. Feasible Self-service for Developers 65 • Low learning cost •

    e.g: Are you sure that developers are happy to learn and maintain k8s yaml? • Secure and painless operations in production • e.g: Are experiences provided by SREs comfortable and secure for developers? Challenge2: Feasible Self-service for Autonomous Teams
  66. Our focused scope 66 • Full-containerization • No ssh debugging

    Challenge2: Feasible Self-service for Autonomous Teams
  67. 67 Full-containerization

  68. Pros of Applications on Container Platform 68 • Developers can

    control software version upgrade timing • SREs don’t want to maintain legacy VM based service platform • Application of in-house tools and company-wide best practices • Auto Scaling • Cost optimization (spot fleets) • Container apps deployment tool (hako) • Centralized developer console (hako-console) • Easy service mesh integration • etc … • Immutable infrastructure • version controlled applications and infrastructures • No configuration drifts Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
  69. Pros of Applications on Container Platform 69 • Developers can

    control software version upgrade timing • SREs don’t want to maintain legacy VM based service platform • Application of in-house tools and company-wide best practices • Auto Scaling • Cost optimization (spot fleets) • Container apps deployment tool (hako) • Centralized developer console (hako-console) • Easy service mesh integration • etc … • Immutable infrastructure • version controlled applications and infrastructures • No configuration drifts Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
  70. Development Lead Time 70

  71. 71 %FW 43&

  72. 72 72

  73. 73 73 Gaps between Devs & SREs …

  74. Happiness Quadrant (Software Upgrade without container) 74 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT IBQQZ

    IBQQZ VOIBQQZ VOIBQQZ 6OQSPEVDUJWFUBTLT /FXTPGUXBSFWFSTJPO Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
  75. 75 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT #MPDLFEUJNFFYQFSJFODF /FXTPGUXBSFWFSTJPO /FXTPGUXBSFWFSTJPO IBQQZ IBQQZ VOIBQQZ VOIBQQZ

    Happiness Quadrant (Software Upgrade without container) 6OQSPEVDUJWFUBTLT Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
  76. 76 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ Happiness Quadrant

    (Software Upgrade without container) Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
  77. 77 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ (SFBUNFDIBOJTNNJHIUQVUUIF WFDUPSPOUPUIFTURVBESBOUʜ

    Happiness Quadrant (Software Upgrade without container) Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
  78. Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization Happiness quadrant

    (Software Upgrade without container) 78 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ (SFBUNFDIBOJTNNJHIUQVUUIF WFDUPSPOUPUIFTURVBESBOUʜ Run all stateless applications on container clusters
  79. Progress of Full-containerization in Global 79 Challenge2: Feasible Self-service for

    Autonomous Teams > Full-containerization
  80. Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization Progress of

    Full-containerization in Global 80 17/18 apps are running on containers (94 % is completed)
  81. 81 81 %FW

  82. 82 82 5IFEBUF3VCZXBTSFMFBTFE %FDUI %FWFMPQFSTDBODPOUSPM3VCZWFSTJPOTXJUIPVU43&T`TVQQPSUT %FW

  83. 83 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT USBJOJOHDPTU MFBSOJOHDPTU /FXTPGUXBSFWFSTJPO /FXTPGUXBSFWFSTJPO PQFSBUJPOBMDPTUSFEVDUJPO IBQQZ IBQQZ

    VOIBQQZ VOIBQQZ Happiness Quadrant (Software Upgrade with container) Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
  84. 84 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ Happiness Quadrant

    (Software Upgrade with container) Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
  85. Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization 85 %FWFMPQFST`IBQQJOFTT

    43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ Happiness Quadrant (Software Upgrade with container) Win - Win
  86. Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization 86 %FWFMPQFST`IBQQJOFTT

    43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ Happiness Quadrant (Software Upgrade with container) Plus, SREs can focus on container platform (more best practices can be introduced)
  87. Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

    SSH SSH 87 No ssh debugging
  88. Cons of Applications on Container Platform 88 • Additional Complexities

    for Developers • Lack of tools cause chaos Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  89. Cons of Applications on Container Platform 89 • Additional Complexities

    for Developers • Lack of tools create chaos SFGIUUQTTQFBLFSEFDLDPNUBLBOBCFDIBMMFOHFTGPSHMPCBMTFSWJDFGSPNBQFSTQFDUJWFPGTSF TMJEF
  90. Cons of Applications on Container Platform 90 • Additional Complexities

    for Developers • Lack of tools create chaos SFGIUUQTTQFBLFSEFDLDPNUBLBOBCFDIBMMFOHFTGPSHMPCBMTFSWJDFGSPNBQFSTQFDUJWFPGTSF TMJEF Already Enough ?
  91. 91

  92. Cons of Applications on Container Platform 92 • Additional Complexities

    for Developers • Lack of tools cause chaos • No ssh debugging systems for Global team • Granular and chronological order metrics dashboard • Container optimized New Relic agent deployment • Short-term log collection • Safe rails console for container Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  93. 93 $POUBJOFST 4IPSUUFSN 5%# -POHUFSN 5%# *OqVY%# 1SPNFUIFVT %FWFMPQFS EPXOTBNQMJOH

    FYQPSUNFUSJDT (SBGBOB 5JNFTFSJFT%BUBCBTF 6TFS*OUFSGBDF Granular and chronological metrics dashboard Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  94. 94 Granular and chronological metrics dashboard • Before • We

    cannot dig errors caused by spike resource saturations • After • We can recognize errors caused by spike resource saturations • We can judge that errors should be fixed soon or not Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  95. Container optimized New Relic agent deployment 95 $POUBJOFS TIBSFENFNPSZ BHFOUTUBSUqBH

    IUUQBQQ@OFX@SFMJDTUBSU "11 SBDLOFX@SFMJDTUBSUFS IBLPQBSUJBSFMJD
 FYFDDPOTVMMPDL Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  96. Container optimized New Relic agent deployment 96 • Before •

    ECS cannot deploy a New Relic agent to a specific container ( We want to save ) • Agents are gone when containers are killed accidentally • After • ECS can deploy a New Relic agent to a container • Distributed locking via `consul lock` sidecar • Rack middleware that provides an endpoint to start the New Relic agent • Agents are launched in a container when agent start flag exists on shared memory Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  97. Short-term log collection 97 $POUBJOFST 4IPSUUFSN MPHTFBSDI -POHUFSN MPHTFBSDI 4

    "UIFOB &MBTUJDTFBSDI FYQPSUMPHT IBLPDPOTPMF -PHTFBSDI 6TFS*OUFSGBDF ,JCBOB %FWFMPQFS FYQPSUMPHT Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  98. Short-term log collection 98 • Before • Developers have to

    wait for few minutes to search logs • After • Developers can check logs nearly real-time Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  99. Safe rails console for container 99 $POUBJOFST &YQPSUBVEJUMPHT USBQEPPSDPOTPMF 6TFS*OUFSGBDF

    4MBDL USBQEPPSBHFOU "11 *OUFSBDUJWFDPNNVOJDBUJPO
 WJB8FC4PDLFU %FWFMPQFS .BOBHFBDDFTT QSJWJMFHFT "[VSF"% USBQEPPSQSPYZ #JOFYFD *OUFSBDUJWFDPNNVOJDBUJPO
 WJB8FC4PDLFU  EBUBPOMZDPOUBJOFS Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  100. Safe rails console for container 100 • Before • Developers

    ssh to servers and run `rails -c` (Sometimes `rails -c -s`) • Developers can run write queries in production ( historical technical debt ) • After • Developers can use REPL via web browser with safe options selected by SREs • Developers can only run read queries on designated database instance Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  101. Safe rails console for container (before) 101 takayuki-watanabe@ssh-accepatable-host-xxx:~$ date Thu

    Apr 19 10:53:28 UTC 2018 takayuki-watanabe@ssh-accepatable-host-xxx:~$ htop PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command [snip] 31817 cookpad 20 0 794M 129M 188 R 93.7 1.7 1923h ruby bin/rails console production -s 8773 cookpad 20 0 734M 165M 152 R 91.7 2.2 1800h ruby bin/rails console production -s 8107 cookpad 20 0 959M 734M 14228 R 83.7 9.8 40h01:04 ruby bin/rails c production [snip] Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  102. Safe rails console for container (after) 102 Challenge2: Feasible Self-service

    for Autonomous Teams > No ssh debugging
  103. Safe rails console for container (after) 103 Challenge2: Feasible Self-service

    for Autonomous Teams > No ssh debugging Feasible Self-service make product development reliable and autonomous !!
  104. Recap 104 • What is Cookpad Global ? • Role

    of Site Reliability Engineers • Paving Roads for Autonomous Teams - Challenge 1: Organization Transformation for Greater Autonomy - Challenge 2: Feasible Self-service for Autonomous Teams
  105. 105 Thank you !! (takayuki-watanabe@cookpad.com)