Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Challenges for Global Service from a Perspective of SRE 2nd season

Challenges for Global Service from a Perspective of SRE 2nd season

Cookpad TechConf 2019: https://techconf.cookpad.com/2019/

More Decks by Takayuki WATANABE (渡辺 喬之)

Other Decks in Technology

Transcript

  1. Cookpad Inc. Feb 27th, 2019 Takayuki Watanabe Technology Department SRE

    Group Challenges for Global Service
 from a Perspective of SRE ~ 2nd season ~
  2. About me 2 • Takayuki Watanabe - Twitter: takanabe_w /

    GitHub: takanabe • Site Reliability Engineer (a.k.a SRE) - Focus on Cookpad Global Projects
  3. Today’s menu 3 • What is Cookpad Global ? •

    Role of Site Reliability Engineers • Paving Roads for Autonomous Teams - Challenge 1: Organization Transformation for Greater Autonomy - Challenge 2: Feasible Self-service for Autonomous Teams
  4. What is Cookpad Global Service? What is Cookpad Global Service?

    6 +BQBOFTF$PPLQBE"QQ (MPCBM$PPLQBE"QQT JP Service ≠ Global Service
  5. What is Cookpad Global Service? Our users across the globe

    8 4FSWJDFQSPWJEFE DPVOUSJFT 71 Countries 26 Languages
  6. What is Cookpad Global Service? Our users across the globe

    9 4FSWJDFQSPWJEFE DPVOUSJFT 94 Million Monthly Average Users
  7. # of Recipes for Global Service 10 # of Recipes

      0WFSNJMMJPOSFDJQFT NJMMJPOSFDJQFTTJODF What is Cookpad Global Service?
  8. Users and Developers across the globe 11 • Global Service

    and SRE ? • Empower high perform technology organization Global head quarter 
 UK, Bristol 11
  9. Users and Developers across the globe 12 • Global Service

    and SRE ? • Empower high perform technology organization Global head quarter 
 UK, Bristol 12 100 People 25 Nationalities
  10. Users and Developers across the globe 13 • Global Service

    and SRE ? • Empower high perform technology organization Global head quarter 
 UK, Bristol 13 The best people join from all over the world
  11. 15

  12. 17

  13. Missions for SREs in Cookpad 19 • Maximize user experiences

    in terms of: • Service availability • Performance • Security • etc… • Build a great platform to support a growing product • Product development optimized platform • Software architects owning comprehensive knowledge for technology Role of Site Reliability Engineers
  14. Missions for SREs in Cookpad 20 • Maximize user experiences

    in terms of: • Service availability • Performance • Security • etc… • Build a great platform to support a growing product • Product development optimized platform • Software architects owning comprehensive knowledge for technology Role of Site Reliability Engineers Control service availability based on various factors
  15. SRE technology scope in Cookpad 22 4FSWJDF 1MBUGPSNT 0CTFSWBCJMJUZ &OHJOFFSJOH

    .JTDJOIPVTF 5PPMJOH 3FMFBTF &OHJOFFSJOH 3FTJMJFODF &OHJOFFSJOH w %JTUSJCVUFE5SBDJOH w .FUSJDT.POJUPSJOH w -PHHJOH4ZTUFN w "MFSUT.BOBHFNFOU w .-#BTFE"OPNBMZ%FUFDUJPO w %BUB"OBMZTJT w 5FBN)FBMUI7JTVBMJ[BUJPO w "84$PTU0QUJNJ[BUJPO w %FWFMPQFS'SJFOEMZ"VUI4ZTUFN w FUD w %FQMPZ1JQFMJOF w $POUJOVPVT*OUFHSBUJPO w $POUJOVPVT%FMJWFSZ w %FQMPZ4USBUFHZ w /8'BVMU*OKFDUJPO w 4QPU*OTUBODF w $JSDVJU#SFBLFS w 5ISPUUMJOH w 7.$POUBJOFS1MBUGPSNPO"84 Role of Site Reliability Engineers
  16. 23 Challenges in 2018 Attacks from China GDPR Recipe data

    migration EKS based staging Recruitment in UK Observability Full containerization 23 Spot instances Expense reduction Toil analysis automation
  17. Paving Roads for Autonomous Teams 25 • Challenge 1: Organization

    Transformation for Greater Autonomy • Challenge 2: Feasible Self-service for Autonomous Teams
  18. Organization Transformation for Greater Autonomy 27 • Tipping Points for

    Autonomous Teams • Organization Transformation: Chapter and Squad • Development style change for new team structure • Necessity of shared responsibility for service availability Challenge1: Organization Transformation for Greater Autonomy
  19. Tipping Points for Autonomous Teams 28 • Cookpad employees in

    UK • 2016: 5 people • 2017: 50 people • 2018: 100 people Challenge1: Organization Transformation for Greater Autonomy 8FC J04 "OESPJE 2" 43& 1. .- Team structure in 2016, 2017
  20. Tipping Points for Autonomous Teams 30 Challenge1: Organization Transformation for

    Greater Autonomy lines = n(n − 1) 2 Communication cost ↑
  21. Organization Transformation: Chapter and Squad 31 Challenge1: Organization Transformation for

    Greater Autonomy 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 43& Chapter 1. 1. 1. Product Squad .- Cross-platform Squad ɾɾɾ 8FC J04 "OESPJE 2" 43& 1. .- After Before
  22. Organization Transformation: Chapter and Squad 32 Challenge1: Organization Transformation for

    Greater Autonomy 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 43& Chapter 1. 1. 1. Product Squad .- Cross-platform Squad ɾɾɾ 8FC J04 "OESPJE 2" 43& 1. .- After Before
  23. Organization Transformation: Chapter and Squad 33 Challenge1: Organization Transformation for

    Greater Autonomy 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 43& Chapter 1. 1. 1. Product Squad .- Cross-platform Squad ɾɾɾ 8FC J04 "OESPJE 2" 43& 1. .- After Before
  24. 1 34 Challenge1: Organization Transformation for Greater Autonomy 8FC J04

    "OESPJE 2" 8FC J04 "OESPJE 2" 8FC J04 "OESPJE 2" 43& Chapter 1. 1. 1. Product Squad .- Cross-platform Squad ɾɾɾ 8FC J04 "OESPJE 2" 43& 1. .- After Before Conway's law … http://www.melconway.com/Home/Conways_Law.html
  25. Development style change for new team structure 35 • Architecture

    of new feed • New development styles Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
  26. Architecture of new feed 36 Message broker Main API Cache

    Feed API DB Complete feed json/html Cache DB GET /user_id/feed List of activity primary keys in order, paginated Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
  27. Architecture of new feed 37 Message broker Main API Cache

    Feed API DB Complete feed json/html Cache DB New components developed by a squad GET /user_id/feed List of activity primary keys in order, paginated Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
  28. New development styles (Partial release in production) 38 # WIP

    code for new notification system # https://github.com/cookpad/xxxxxx-squad/issues/yyyyyy Rollout.add :notification_center, owner: "xxxxxx-squad" do # @developer_a, @developer_b, @developer_c, @developer_d, @developer_e Current.user&.id&.in?([AAAAAA, BBBBBB, CCCCCC, DDDDDD, EEEEEE]) end • Feature toggle (application level control) • Prototype environment (platform level control) Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
  29. Challenge1: Organization Transformation for Greater Autonomy > Development style change

    for new team structure New development styles (Partial release in production) 39 # WIP code for new notification system # https://github.com/cookpad/xxxxxx-squad/issues/yyyyyy Rollout.add :notification_center, owner: "xxxxxx-squad" do # @developer_a, @developer_b, @developer_c, @developer_d, @developer_e Current.user&.id&.in?([AAAAAA, BBBBBB, CCCCCC, DDDDDD, EEEEEE]) end • Feature toggle (application level control) • Prototype environment (platform level control) Only users know answers
  30. Feed was successful feature? 40 • Yes, feed was one

    of the most successful features in 2018 • New architecture • New technology stack • 100% release in production in short time Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
  31. 41 Why feed was successful? • A lot of trials,

    failures and improvements in short term • Developers had power and responsibility for feature developments • Feed was developed from scratch • Developers could choose appropriate technology • Introduce Streamy, Karafka (stream app frameworks) • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers) Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
  32. Challenge1: Organization Transformation for Greater Autonomy > Development style change

    for new team structure 42 Why feed was successful? • A lot of trials, failures and improvements in short term • Developers had power and responsibility for feature developments • Feed was developed from scratch • Developers could choose appropriate technology • Introduce Streamy, Karafka (stream app frameworks) • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers) Rapid prototyping was successful
  33. Challenge1: Organization Transformation for Greater Autonomy > Development style change

    for new team structure 43 Why feed was successful? • A lot of trials, failures and improvements in short term • Developers had power and responsibility for feature developments • Feed was developed from scratch • Developers could choose appropriate technology • Introduce Streamy, Karafka (stream app frameworks) • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers) On the other hand …
  34. Necessity of shared responsibility for service availability 44 Challenge1: Organization

    Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
  35. Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared

    responsibility for service availability Necessity of shared responsibility for service availability 45 Too many errors SREs cannot understand…
  36. 46 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT OFXQSPEVDU IBQQZ IBQQZ VOIBQQZ VOIBQQZ OFXQSPEVDU Happiness

    Quadrant (release new feed) Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
  37. 47 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ UPVHIFYQFSJFODFT OFXQSPEVDU Happiness

    Quadrant (release new feed) Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
  38. 48 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT UPVHIFYQFSJFODFT IBQQZ IBQQZ VOIBQQZ VOIBQQZ OFXQSPEVDU Happiness

    Quadrant (release new feed) Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
  39. 49 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT UPVHIFYQFSJFODFT IBQQZ IBQQZ VOIBQQZ VOIBQQZ OFXQSPEVDU Happiness

    Quadrants (Release new feed) Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
  40. Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared

    responsibility for service availability 50 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT UPVHIFYQFSJFODFT IBQQZ IBQQZ VOIBQQZ VOIBQQZ OFXQSPEVDU Happiness Quadrants (Release new feed) Not sustainable …
  41. 51 Why this situation happen? • A lot of trials,

    failures and improvements in short term • Developers had power and responsibility for feature developments • Feed was developed from scratch • Developers could choose appropriate technology • Introduce Streamy, Karafka (stream app frameworks) • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers) • No concepts of shared responsibility for service availability Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
  42. 52 (WIP) Shared responsibility as Autonomous Teams • Shared responsibility

    for organization sustainability • Reach consensus of service availability for each feature • Targets decided by product owners • Higher quality in emergency notifications • Alert handling by appropriate people • Another organization transformation based on ideal tech & business architectures Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
  43. 53 • Shared responsibility for organization sustainability • Reach consensus

    of service availability for each feature • Targets decided by product owners • Higher quality in emergency notifications • Alert handling by appropriate people • Another organization transformation based on ideal tech & business architectures Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability Inverse Conway Maneuver … (WIP) Shared responsibility as Autonomous Teams
  44. Feasible Self-service for Autonomous Teams 55 • Four Important Keys

    for Successful Autonomous Teams • Feasible Self-service for Developers • Our focused scope • Full-containerization • No ssh debugging Challenge2: Feasible Self-service for Autonomous Teams
  45. Four Important Keys for Successful Autonomous Teams 56 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, Deploy pipeline Challenge2: Feasible Self-service for Autonomous Teams
  46. Four Important Keys for Successful Autonomous Teams 57 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, Deploy pipeline Challenge2: Feasible Self-service for Autonomous Teams
  47. Four Important Keys for Successful Autonomous Teams 58 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, Deploy pipeline Challenge2: Feasible Self-service for Autonomous Teams
  48. Four Important Keys for Successful Autonomous Teams 59 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, Deploy pipeline Challenge2: Feasible Self-service for Autonomous Teams
  49. Four Important Keys for Successful Autonomous Teams 60 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams
  50. Four Important Keys for Successful Autonomous Teams 61 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams Organization strategy matter
  51. Four Important Keys for Successful Autonomous Teams 62 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams Organization strategy matter Strong leaderships across tech and business are essential
  52. Four Important Keys for Successful Autonomous Teams 63 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams SRE squad can contribute
  53. Four Important Keys for Successful Autonomous Teams 64 • Discipline:

    Common rules in organization - Technology stack, team structure • Freedom: Ownership for individual developments - Small team, technology selection, system design • Responsibility: Commitments for whole software life cycle - Design, implementation, test, deploy, service availability monitoring • Optimization: Best practices for product developments - Logging and monitoring system, deploy pipeline, feature toggle Challenge2: Feasible Self-service for Autonomous Teams SRE squad can contribute Optimized self-service mechanisms providing company-wide best practices in SRE
  54. Feasible Self-service for Developers 65 • Low learning cost •

    e.g: Are you sure that developers are happy to learn and maintain k8s yaml? • Secure and painless operations in production • e.g: Are experiences provided by SREs comfortable and secure for developers? Challenge2: Feasible Self-service for Autonomous Teams
  55. Our focused scope 66 • Full-containerization • No ssh debugging

    Challenge2: Feasible Self-service for Autonomous Teams
  56. Pros of Applications on Container Platform 68 • Developers can

    control software version upgrade timing • SREs don’t want to maintain legacy VM based service platform • Application of in-house tools and company-wide best practices • Auto Scaling • Cost optimization (spot fleets) • Container apps deployment tool (hako) • Centralized developer console (hako-console) • Easy service mesh integration • etc … • Immutable infrastructure • version controlled applications and infrastructures • No configuration drifts Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
  57. Pros of Applications on Container Platform 69 • Developers can

    control software version upgrade timing • SREs don’t want to maintain legacy VM based service platform • Application of in-house tools and company-wide best practices • Auto Scaling • Cost optimization (spot fleets) • Container apps deployment tool (hako) • Centralized developer console (hako-console) • Easy service mesh integration • etc … • Immutable infrastructure • version controlled applications and infrastructures • No configuration drifts Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
  58. Happiness Quadrant (Software Upgrade without container) 74 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT IBQQZ

    IBQQZ VOIBQQZ VOIBQQZ 6OQSPEVDUJWFUBTLT /FXTPGUXBSFWFSTJPO Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
  59. 75 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT #MPDLFEUJNFFYQFSJFODF /FXTPGUXBSFWFSTJPO /FXTPGUXBSFWFSTJPO IBQQZ IBQQZ VOIBQQZ VOIBQQZ

    Happiness Quadrant (Software Upgrade without container) 6OQSPEVDUJWFUBTLT Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
  60. 76 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ Happiness Quadrant

    (Software Upgrade without container) Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
  61. 77 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ (SFBUNFDIBOJTNNJHIUQVUUIF WFDUPSPOUPUIFTURVBESBOUʜ

    Happiness Quadrant (Software Upgrade without container) Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
  62. Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization Happiness quadrant

    (Software Upgrade without container) 78 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ (SFBUNFDIBOJTNNJHIUQVUUIF WFDUPSPOUPUIFTURVBESBOUʜ Run all stateless applications on container clusters
  63. Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization Progress of

    Full-containerization in Global 80 17/18 apps are running on containers (94 % is completed)
  64. 83 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT USBJOJOHDPTU MFBSOJOHDPTU /FXTPGUXBSFWFSTJPO /FXTPGUXBSFWFSTJPO PQFSBUJPOBMDPTUSFEVDUJPO IBQQZ IBQQZ

    VOIBQQZ VOIBQQZ Happiness Quadrant (Software Upgrade with container) Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
  65. 84 %FWFMPQFST`IBQQJOFTT 43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ Happiness Quadrant

    (Software Upgrade with container) Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
  66. Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization 85 %FWFMPQFST`IBQQJOFTT

    43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ Happiness Quadrant (Software Upgrade with container) Win - Win
  67. Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization 86 %FWFMPQFST`IBQQJOFTT

    43&T`IBQQJOFTT 5PUBMIBQQJOFTT IBQQZ IBQQZ VOIBQQZ VOIBQQZ Happiness Quadrant (Software Upgrade with container) Plus, SREs can focus on container platform (more best practices can be introduced)
  68. Cons of Applications on Container Platform 88 • Additional Complexities

    for Developers • Lack of tools cause chaos Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  69. Cons of Applications on Container Platform 89 • Additional Complexities

    for Developers • Lack of tools create chaos SFGIUUQTTQFBLFSEFDLDPNUBLBOBCFDIBMMFOHFTGPSHMPCBMTFSWJDFGSPNBQFSTQFDUJWFPGTSF TMJEF
  70. Cons of Applications on Container Platform 90 • Additional Complexities

    for Developers • Lack of tools create chaos SFGIUUQTTQFBLFSEFDLDPNUBLBOBCFDIBMMFOHFTGPSHMPCBMTFSWJDFGSPNBQFSTQFDUJWFPGTSF TMJEF Already Enough ?
  71. 91

  72. Cons of Applications on Container Platform 92 • Additional Complexities

    for Developers • Lack of tools cause chaos • No ssh debugging systems for Global team • Granular and chronological order metrics dashboard • Container optimized New Relic agent deployment • Short-term log collection • Safe rails console for container Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  73. 93 $POUBJOFST 4IPSUUFSN 5%# -POHUFSN 5%# *OqVY%# 1SPNFUIFVT %FWFMPQFS EPXOTBNQMJOH

    FYQPSUNFUSJDT (SBGBOB 5JNFTFSJFT%BUBCBTF 6TFS*OUFSGBDF Granular and chronological metrics dashboard Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  74. 94 Granular and chronological metrics dashboard • Before • We

    cannot dig errors caused by spike resource saturations • After • We can recognize errors caused by spike resource saturations • We can judge that errors should be fixed soon or not Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  75. Container optimized New Relic agent deployment 95 $POUBJOFS TIBSFENFNPSZ BHFOUTUBSUqBH

    IUUQBQQ@OFX@SFMJDTUBSU "11 SBDLOFX@SFMJDTUBSUFS IBLPQBSUJBSFMJD
 FYFDDPOTVMMPDL Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  76. Container optimized New Relic agent deployment 96 • Before •

    ECS cannot deploy a New Relic agent to a specific container ( We want to save ) • Agents are gone when containers are killed accidentally • After • ECS can deploy a New Relic agent to a container • Distributed locking via `consul lock` sidecar • Rack middleware that provides an endpoint to start the New Relic agent • Agents are launched in a container when agent start flag exists on shared memory Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  77. Short-term log collection 97 $POUBJOFST 4IPSUUFSN MPHTFBSDI -POHUFSN MPHTFBSDI 4

    "UIFOB &MBTUJDTFBSDI FYQPSUMPHT IBLPDPOTPMF -PHTFBSDI 6TFS*OUFSGBDF ,JCBOB %FWFMPQFS FYQPSUMPHT Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  78. Short-term log collection 98 • Before • Developers have to

    wait for few minutes to search logs • After • Developers can check logs nearly real-time Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  79. Safe rails console for container 99 $POUBJOFST &YQPSUBVEJUMPHT USBQEPPSDPOTPMF 6TFS*OUFSGBDF

    4MBDL USBQEPPSBHFOU "11 *OUFSBDUJWFDPNNVOJDBUJPO
 WJB8FC4PDLFU %FWFMPQFS .BOBHFBDDFTT QSJWJMFHFT "[VSF"% USBQEPPSQSPYZ #JOFYFD *OUFSBDUJWFDPNNVOJDBUJPO
 WJB8FC4PDLFU  EBUBPOMZDPOUBJOFS Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  80. Safe rails console for container 100 • Before • Developers

    ssh to servers and run `rails -c` (Sometimes `rails -c -s`) • Developers can run write queries in production ( historical technical debt ) • After • Developers can use REPL via web browser with safe options selected by SREs • Developers can only run read queries on designated database instance Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  81. Safe rails console for container (before) 101 takayuki-watanabe@ssh-accepatable-host-xxx:~$ date Thu

    Apr 19 10:53:28 UTC 2018 takayuki-watanabe@ssh-accepatable-host-xxx:~$ htop PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command [snip] 31817 cookpad 20 0 794M 129M 188 R 93.7 1.7 1923h ruby bin/rails console production -s 8773 cookpad 20 0 734M 165M 152 R 91.7 2.2 1800h ruby bin/rails console production -s 8107 cookpad 20 0 959M 734M 14228 R 83.7 9.8 40h01:04 ruby bin/rails c production [snip] Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
  82. Safe rails console for container (after) 103 Challenge2: Feasible Self-service

    for Autonomous Teams > No ssh debugging Feasible Self-service make product development reliable and autonomous !!
  83. Recap 104 • What is Cookpad Global ? • Role

    of Site Reliability Engineers • Paving Roads for Autonomous Teams - Challenge 1: Organization Transformation for Greater Autonomy - Challenge 2: Feasible Self-service for Autonomous Teams