$30 off During Our Annual Pro Sale. View Details »

Challenges for Global Service from a Perspective of SRE 2nd season

Challenges for Global Service from a Perspective of SRE 2nd season

Cookpad TechConf 2019: https://techconf.cookpad.com/2019/

More Decks by Takayuki WATANABE (渡辺 喬之)

Other Decks in Technology

Transcript

  1. Cookpad Inc.
    Feb 27th, 2019
    Takayuki Watanabe
    Technology Department SRE Group
    Challenges for Global Service

    from a Perspective of SRE
    ~ 2nd season ~

    View Slide

  2. About me
    2
    • Takayuki Watanabe
    - Twitter: takanabe_w / GitHub: takanabe
    • Site Reliability Engineer (a.k.a SRE)
    - Focus on Cookpad Global Projects

    View Slide

  3. Today’s menu
    3
    • What is Cookpad Global ?
    • Role of Site Reliability Engineers
    • Paving Roads for Autonomous Teams
    - Challenge 1: Organization Transformation for Greater Autonomy
    - Challenge 2: Feasible Self-service for Autonomous Teams

    View Slide

  4. What is Cookpad Global ?
    4

    View Slide

  5. What is Cookpad Global Service?
    5
    +BQBOFTF$PPLQBE"QQ (MPCBM$PPLQBE"QQT
    What is Cookpad Global Service?

    View Slide

  6. What is Cookpad Global Service?
    What is Cookpad Global Service?
    6
    +BQBOFTF$PPLQBE"QQ (MPCBM$PPLQBE"QQT
    JP Service ≠ Global Service

    View Slide

  7. Our users across the globe
    7
    4FSWJDFQSPWJEFE
    DPVOUSJFT
    What is Cookpad Global Service?

    View Slide

  8. What is Cookpad Global Service?
    Our users across the globe
    8
    4FSWJDFQSPWJEFE
    DPVOUSJFT
    71 Countries
    26 Languages

    View Slide

  9. What is Cookpad Global Service?
    Our users across the globe
    9
    4FSWJDFQSPWJEFE
    DPVOUSJFT
    94 Million Monthly Average Users

    View Slide

  10. # of Recipes for Global Service
    10
    # of Recipes

    0WFSNJMMJPOSFDJQFT
    NJMMJPOSFDJQFTTJODF

    What is Cookpad Global Service?

    View Slide

  11. Users and Developers across the globe
    11
    • Global Service and SRE ?
    • Empower high perform technology organization
    Global head quarter 

    UK, Bristol
    11

    View Slide

  12. Users and Developers across the globe
    12
    • Global Service and SRE ?
    • Empower high perform technology organization
    Global head quarter 

    UK, Bristol
    12
    100 People
    25 Nationalities

    View Slide

  13. Users and Developers across the globe
    13
    • Global Service and SRE ?
    • Empower high perform technology organization
    Global head quarter 

    UK, Bristol
    13
    The best people join
    from all over the world

    View Slide

  14. Role of Site Reliability Engineers
    14

    View Slide

  15. 15

    View Slide

  16. 16
    A user living beyond a log

    View Slide

  17. 17

    View Slide

  18. 18
    Our Product Developers

    View Slide

  19. Missions for SREs in Cookpad
    19
    • Maximize user experiences in terms of:
    • Service availability
    • Performance
    • Security
    • etc…
    • Build a great platform to support a growing product
    • Product development optimized platform
    • Software architects owning comprehensive knowledge for technology
    Role of Site Reliability Engineers

    View Slide

  20. Missions for SREs in Cookpad
    20
    • Maximize user experiences in terms of:
    • Service availability
    • Performance
    • Security
    • etc…
    • Build a great platform to support a growing product
    • Product development optimized platform
    • Software architects owning comprehensive knowledge for technology
    Role of Site Reliability Engineers
    Control service availability
    based on various factors

    View Slide

  21. SRE technology scope in Cookpad
    21
    4FSWJDF
    1MBUGPSNT
    w 7.$POUBJOFS1MBUGPSNPO"84
    Role of Site Reliability Engineers

    View Slide

  22. SRE technology scope in Cookpad
    22
    4FSWJDF
    1MBUGPSNT
    0CTFSWBCJMJUZ
    &OHJOFFSJOH
    .JTDJOIPVTF
    5PPMJOH
    3FMFBTF
    &OHJOFFSJOH
    3FTJMJFODF
    &OHJOFFSJOH
    w %JTUSJCVUFE5SBDJOH
    w .FUSJDT.POJUPSJOH
    w -PHHJOH4ZTUFN
    w "MFSUT.BOBHFNFOU
    w .-#BTFE"OPNBMZ%FUFDUJPO
    w %BUB"OBMZTJT
    w 5FBN)FBMUI7JTVBMJ[BUJPO
    w "84$PTU0QUJNJ[BUJPO
    w %FWFMPQFS'SJFOEMZ"VUI4ZTUFN
    w FUD
    w %FQMPZ1JQFMJOF
    w $POUJOVPVT*OUFHSBUJPO
    w $POUJOVPVT%FMJWFSZ
    w %FQMPZ4USBUFHZ
    w /8'BVMU*OKFDUJPO
    w 4QPU*OTUBODF
    w $JSDVJU#SFBLFS
    w 5ISPUUMJOH
    w 7.$POUBJOFS1MBUGPSNPO"84
    Role of Site Reliability Engineers

    View Slide

  23. 23
    Challenges in 2018
    Attacks from China
    GDPR
    Recipe data migration
    EKS based staging
    Recruitment in UK
    Observability
    Full containerization
    23
    Spot instances
    Expense reduction
    Toil analysis automation

    View Slide

  24. Paving Roads for Autonomous Teams
    24

    View Slide

  25. Paving Roads for Autonomous Teams
    25
    • Challenge 1: Organization Transformation for Greater Autonomy
    • Challenge 2: Feasible Self-service for Autonomous Teams

    View Slide

  26. Challenge 1
    Organization Transformation for Greater Autonomy
    26

    View Slide

  27. Organization Transformation for Greater Autonomy
    27
    • Tipping Points for Autonomous Teams
    • Organization Transformation: Chapter and Squad
    • Development style change for new team structure
    • Necessity of shared responsibility for service availability
    Challenge1: Organization Transformation for Greater Autonomy

    View Slide

  28. Tipping Points for Autonomous Teams
    28
    • Cookpad employees in UK
    • 2016: 5 people
    • 2017: 50 people
    • 2018: 100 people
    Challenge1: Organization Transformation for Greater Autonomy
    8FC
    J04
    "OESPJE
    2"
    43&
    1.
    .-
    Team structure in 2016, 2017

    View Slide

  29. Tipping Points for Autonomous Teams
    29
    Challenge1: Organization Transformation for Greater Autonomy
    lines =
    n(n − 1)
    2

    View Slide

  30. Tipping Points for Autonomous Teams
    30
    Challenge1: Organization Transformation for Greater Autonomy
    lines =
    n(n − 1)
    2
    Communication cost ↑

    View Slide

  31. Organization Transformation: Chapter and Squad
    31
    Challenge1: Organization Transformation for Greater Autonomy
    8FC
    J04
    "OESPJE
    2"
    8FC
    J04
    "OESPJE
    2"
    8FC
    J04
    "OESPJE
    2"
    43&
    Chapter
    1. 1. 1.
    Product Squad
    .-
    Cross-platform Squad
    ɾɾɾ
    8FC
    J04
    "OESPJE
    2"
    43&
    1.
    .-
    After
    Before

    View Slide

  32. Organization Transformation: Chapter and Squad
    32
    Challenge1: Organization Transformation for Greater Autonomy
    8FC
    J04
    "OESPJE
    2"
    8FC
    J04
    "OESPJE
    2"
    8FC
    J04
    "OESPJE
    2"
    43&
    Chapter
    1. 1. 1.
    Product Squad
    .-
    Cross-platform Squad
    ɾɾɾ
    8FC
    J04
    "OESPJE
    2"
    43&
    1.
    .-
    After
    Before

    View Slide

  33. Organization Transformation: Chapter and Squad
    33
    Challenge1: Organization Transformation for Greater Autonomy
    8FC
    J04
    "OESPJE
    2"
    8FC
    J04
    "OESPJE
    2"
    8FC
    J04
    "OESPJE
    2"
    43&
    Chapter
    1. 1. 1.
    Product Squad
    .-
    Cross-platform Squad
    ɾɾɾ
    8FC
    J04
    "OESPJE
    2"
    43&
    1.
    .-
    After
    Before

    View Slide

  34. 1
    34
    Challenge1: Organization Transformation for Greater Autonomy
    8FC
    J04
    "OESPJE
    2"
    8FC
    J04
    "OESPJE
    2"
    8FC
    J04
    "OESPJE
    2"
    43&
    Chapter
    1. 1. 1.
    Product Squad
    .-
    Cross-platform Squad
    ɾɾɾ
    8FC
    J04
    "OESPJE
    2"
    43&
    1.
    .-
    After
    Before
    Conway's law …
    http://www.melconway.com/Home/Conways_Law.html

    View Slide

  35. Development style change for new team structure
    35
    • Architecture of new feed
    • New development styles
    Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure

    View Slide

  36. Architecture of new feed
    36
    Message broker
    Main
    API
    Cache
    Feed

    API
    DB
    Complete feed json/html
    Cache
    DB
    GET /user_id/feed
    List of activity primary keys in order, paginated
    Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure

    View Slide

  37. Architecture of new feed
    37
    Message broker
    Main

    API
    Cache
    Feed

    API
    DB
    Complete feed json/html
    Cache
    DB
    New components developed by a squad
    GET /user_id/feed
    List of activity primary keys in order, paginated
    Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure

    View Slide

  38. New development styles (Partial release in production)
    38
    # WIP code for new notification system
    # https://github.com/cookpad/xxxxxx-squad/issues/yyyyyy
    Rollout.add :notification_center, owner: "xxxxxx-squad" do
    # @developer_a, @developer_b, @developer_c, @developer_d, @developer_e
    Current.user&.id&.in?([AAAAAA, BBBBBB, CCCCCC, DDDDDD, EEEEEE])
    end
    • Feature toggle (application level control)
    • Prototype environment (platform level control)
    Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure

    View Slide

  39. Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
    New development styles (Partial release in production)
    39
    # WIP code for new notification system
    # https://github.com/cookpad/xxxxxx-squad/issues/yyyyyy
    Rollout.add :notification_center, owner: "xxxxxx-squad" do
    # @developer_a, @developer_b, @developer_c, @developer_d, @developer_e
    Current.user&.id&.in?([AAAAAA, BBBBBB, CCCCCC, DDDDDD, EEEEEE])
    end
    • Feature toggle (application level control)
    • Prototype environment (platform level control)
    Only users know answers

    View Slide

  40. Feed was successful feature?
    40
    • Yes, feed was one of the most successful features in 2018
    • New architecture
    • New technology stack
    • 100% release in production in short time
    Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure

    View Slide

  41. 41
    Why feed was successful?
    • A lot of trials, failures and improvements in short term
    • Developers had power and responsibility for feature developments
    • Feed was developed from scratch
    • Developers could choose appropriate technology
    • Introduce Streamy, Karafka (stream app frameworks)
    • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers)
    Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure

    View Slide

  42. Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
    42
    Why feed was successful?
    • A lot of trials, failures and improvements in short term
    • Developers had power and responsibility for feature developments
    • Feed was developed from scratch
    • Developers could choose appropriate technology
    • Introduce Streamy, Karafka (stream app frameworks)
    • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers)
    Rapid prototyping was successful

    View Slide

  43. Challenge1: Organization Transformation for Greater Autonomy > Development style change for new team structure
    43
    Why feed was successful?
    • A lot of trials, failures and improvements in short term
    • Developers had power and responsibility for feature developments
    • Feed was developed from scratch
    • Developers could choose appropriate technology
    • Introduce Streamy, Karafka (stream app frameworks)
    • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers)
    On the other hand …

    View Slide

  44. Necessity of shared responsibility for service availability
    44
    Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability

    View Slide

  45. Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
    Necessity of shared responsibility for service availability
    45
    Too many errors SREs cannot understand…

    View Slide

  46. 46
    %FWFMPQFST`IBQQJOFTT
    43&T`IBQQJOFTT
    OFXQSPEVDU
    IBQQZ
    IBQQZ
    VOIBQQZ
    VOIBQQZ
    OFXQSPEVDU
    Happiness Quadrant (release new feed)
    Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability

    View Slide

  47. 47
    %FWFMPQFST`IBQQJOFTT
    43&T`IBQQJOFTT
    IBQQZ
    IBQQZ
    VOIBQQZ
    VOIBQQZ
    UPVHIFYQFSJFODFT
    OFXQSPEVDU
    Happiness Quadrant (release new feed)
    Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability

    View Slide

  48. 48
    %FWFMPQFST`IBQQJOFTT
    43&T`IBQQJOFTT
    UPVHIFYQFSJFODFT
    IBQQZ
    IBQQZ
    VOIBQQZ
    VOIBQQZ
    OFXQSPEVDU
    Happiness Quadrant (release new feed)
    Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability

    View Slide

  49. 49
    %FWFMPQFST`IBQQJOFTT
    43&T`IBQQJOFTT
    UPVHIFYQFSJFODFT
    IBQQZ
    IBQQZ
    VOIBQQZ
    VOIBQQZ
    OFXQSPEVDU
    Happiness Quadrants (Release new feed)
    Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability

    View Slide

  50. Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
    50
    %FWFMPQFST`IBQQJOFTT
    43&T`IBQQJOFTT
    UPVHIFYQFSJFODFT
    IBQQZ
    IBQQZ
    VOIBQQZ
    VOIBQQZ
    OFXQSPEVDU
    Happiness Quadrants (Release new feed)
    Not sustainable …

    View Slide

  51. 51
    Why this situation happen?
    • A lot of trials, failures and improvements in short term
    • Developers had power and responsibility for feature developments
    • Feed was developed from scratch
    • Developers could choose appropriate technology
    • Introduce Streamy, Karafka (stream app frameworks)
    • Test Kafka, RabbitMQ, SQS, Kinesis (message brokers)
    • No concepts of shared responsibility for service availability
    Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability

    View Slide

  52. 52
    (WIP) Shared responsibility as Autonomous Teams
    • Shared responsibility for organization sustainability
    • Reach consensus of service availability for each feature
    • Targets decided by product owners
    • Higher quality in emergency notifications
    • Alert handling by appropriate people
    • Another organization transformation based on ideal tech & business architectures
    Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability

    View Slide

  53. 53
    • Shared responsibility for organization sustainability
    • Reach consensus of service availability for each feature
    • Targets decided by product owners
    • Higher quality in emergency notifications
    • Alert handling by appropriate people
    • Another organization transformation based on ideal tech & business architectures
    Challenge1: Organization Transformation for Greater Autonomy > Necessity of shared responsibility for service availability
    Inverse Conway Maneuver …
    (WIP) Shared responsibility as Autonomous Teams

    View Slide

  54. Challenge 2
    Feasible Self-service for Autonomous Teams
    54

    View Slide

  55. Feasible Self-service for Autonomous Teams
    55
    • Four Important Keys for Successful Autonomous Teams
    • Feasible Self-service for Developers
    • Our focused scope
    • Full-containerization
    • No ssh debugging
    Challenge2: Feasible Self-service for Autonomous Teams

    View Slide

  56. Four Important Keys for Successful Autonomous Teams
    56
    • Discipline: Common rules in organization
    - Technology stack, team structure
    • Freedom: Ownership for individual developments
    - Small team, technology selection, system design
    • Responsibility: Commitments for whole software life cycle
    - Design, implementation, test, deploy, service availability monitoring
    • Optimization: Best practices for product developments
    - Logging and monitoring system, Deploy pipeline
    Challenge2: Feasible Self-service for Autonomous Teams

    View Slide

  57. Four Important Keys for Successful Autonomous Teams
    57
    • Discipline: Common rules in organization
    - Technology stack, team structure
    • Freedom: Ownership for individual developments
    - Small team, technology selection, system design
    • Responsibility: Commitments for whole software life cycle
    - Design, implementation, test, deploy, service availability monitoring
    • Optimization: Best practices for product developments
    - Logging and monitoring system, Deploy pipeline
    Challenge2: Feasible Self-service for Autonomous Teams

    View Slide

  58. Four Important Keys for Successful Autonomous Teams
    58
    • Discipline: Common rules in organization
    - Technology stack, team structure
    • Freedom: Ownership for individual developments
    - Small team, technology selection, system design
    • Responsibility: Commitments for whole software life cycle
    - Design, implementation, test, deploy, service availability monitoring
    • Optimization: Best practices for product developments
    - Logging and monitoring system, Deploy pipeline
    Challenge2: Feasible Self-service for Autonomous Teams

    View Slide

  59. Four Important Keys for Successful Autonomous Teams
    59
    • Discipline: Common rules in organization
    - Technology stack, team structure
    • Freedom: Ownership for individual developments
    - Small team, technology selection, system design
    • Responsibility: Commitments for whole software life cycle
    - Design, implementation, test, deploy, service availability monitoring
    • Optimization: Best practices for product developments
    - Logging and monitoring system, Deploy pipeline
    Challenge2: Feasible Self-service for Autonomous Teams

    View Slide

  60. Four Important Keys for Successful Autonomous Teams
    60
    • Discipline: Common rules in organization
    - Technology stack, team structure
    • Freedom: Ownership for individual developments
    - Small team, technology selection, system design
    • Responsibility: Commitments for whole software life cycle
    - Design, implementation, test, deploy, service availability monitoring
    • Optimization: Best practices for product developments
    - Logging and monitoring system, deploy pipeline, feature toggle
    Challenge2: Feasible Self-service for Autonomous Teams

    View Slide

  61. Four Important Keys for Successful Autonomous Teams
    61
    • Discipline: Common rules in organization
    - Technology stack, team structure
    • Freedom: Ownership for individual developments
    - Small team, technology selection, system design
    • Responsibility: Commitments for whole software life cycle
    - Design, implementation, test, deploy, service availability monitoring
    • Optimization: Best practices for product developments
    - Logging and monitoring system, deploy pipeline, feature toggle
    Challenge2: Feasible Self-service for Autonomous Teams
    Organization strategy matter

    View Slide

  62. Four Important Keys for Successful Autonomous Teams
    62
    • Discipline: Common rules in organization
    - Technology stack, team structure
    • Freedom: Ownership for individual developments
    - Small team, technology selection, system design
    • Responsibility: Commitments for whole software life cycle
    - Design, implementation, test, deploy, service availability monitoring
    • Optimization: Best practices for product developments
    - Logging and monitoring system, deploy pipeline, feature toggle
    Challenge2: Feasible Self-service for Autonomous Teams
    Organization strategy matter
    Strong leaderships
    across tech and business are essential

    View Slide

  63. Four Important Keys for Successful Autonomous Teams
    63
    • Discipline: Common rules in organization
    - Technology stack, team structure
    • Freedom: Ownership for individual developments
    - Small team, technology selection, system design
    • Responsibility: Commitments for whole software life cycle
    - Design, implementation, test, deploy, service availability monitoring
    • Optimization: Best practices for product developments
    - Logging and monitoring system, deploy pipeline, feature toggle
    Challenge2: Feasible Self-service for Autonomous Teams
    SRE squad can contribute

    View Slide

  64. Four Important Keys for Successful Autonomous Teams
    64
    • Discipline: Common rules in organization
    - Technology stack, team structure
    • Freedom: Ownership for individual developments
    - Small team, technology selection, system design
    • Responsibility: Commitments for whole software life cycle
    - Design, implementation, test, deploy, service availability monitoring
    • Optimization: Best practices for product developments
    - Logging and monitoring system, deploy pipeline, feature toggle
    Challenge2: Feasible Self-service for Autonomous Teams
    SRE squad can contribute
    Optimized self-service mechanisms providing
    company-wide best practices in SRE

    View Slide

  65. Feasible Self-service for Developers
    65
    • Low learning cost
    • e.g: Are you sure that developers are happy to learn and maintain k8s yaml?
    • Secure and painless operations in production
    • e.g: Are experiences provided by SREs comfortable and secure for developers?
    Challenge2: Feasible Self-service for Autonomous Teams

    View Slide

  66. Our focused scope
    66
    • Full-containerization
    • No ssh debugging
    Challenge2: Feasible Self-service for Autonomous Teams

    View Slide

  67. 67
    Full-containerization

    View Slide

  68. Pros of Applications on Container Platform
    68
    • Developers can control software version upgrade timing
    • SREs don’t want to maintain legacy VM based service platform
    • Application of in-house tools and company-wide best practices
    • Auto Scaling
    • Cost optimization (spot fleets)
    • Container apps deployment tool (hako)
    • Centralized developer console (hako-console)
    • Easy service mesh integration
    • etc …
    • Immutable infrastructure
    • version controlled applications and infrastructures
    • No configuration drifts
    Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

    View Slide

  69. Pros of Applications on Container Platform
    69
    • Developers can control software version upgrade timing
    • SREs don’t want to maintain legacy VM based service platform
    • Application of in-house tools and company-wide best practices
    • Auto Scaling
    • Cost optimization (spot fleets)
    • Container apps deployment tool (hako)
    • Centralized developer console (hako-console)
    • Easy service mesh integration
    • etc …
    • Immutable infrastructure
    • version controlled applications and infrastructures
    • No configuration drifts
    Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

    View Slide

  70. Development Lead Time
    70

    View Slide

  71. 71
    %FW
    43&

    View Slide

  72. 72
    72

    View Slide

  73. 73
    73
    Gaps between Devs & SREs …

    View Slide

  74. Happiness Quadrant (Software Upgrade without container)
    74
    %FWFMPQFST`IBQQJOFTT
    43&T`IBQQJOFTT
    IBQQZ
    IBQQZ
    VOIBQQZ
    VOIBQQZ
    6OQSPEVDUJWFUBTLT /FXTPGUXBSFWFSTJPO
    Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

    View Slide

  75. 75
    %FWFMPQFST`IBQQJOFTT
    43&T`IBQQJOFTT
    #MPDLFEUJNFFYQFSJFODF
    /FXTPGUXBSFWFSTJPO
    /FXTPGUXBSFWFSTJPO
    IBQQZ
    IBQQZ
    VOIBQQZ
    VOIBQQZ
    Happiness Quadrant (Software Upgrade without container)
    6OQSPEVDUJWFUBTLT
    Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

    View Slide

  76. 76
    %FWFMPQFST`IBQQJOFTT
    43&T`IBQQJOFTT
    5PUBMIBQQJOFTT
    IBQQZ
    IBQQZ
    VOIBQQZ
    VOIBQQZ
    Happiness Quadrant (Software Upgrade without container)
    Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

    View Slide

  77. 77
    %FWFMPQFST`IBQQJOFTT
    43&T`IBQQJOFTT
    5PUBMIBQQJOFTT
    IBQQZ
    IBQQZ
    VOIBQQZ
    VOIBQQZ
    (SFBUNFDIBOJTNNJHIUQVUUIF
    WFDUPSPOUPUIFTURVBESBOUʜ
    Happiness Quadrant (Software Upgrade without container)
    Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

    View Slide

  78. Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
    Happiness quadrant (Software Upgrade without container)
    78
    %FWFMPQFST`IBQQJOFTT
    43&T`IBQQJOFTT
    5PUBMIBQQJOFTT
    IBQQZ
    IBQQZ
    VOIBQQZ
    VOIBQQZ
    (SFBUNFDIBOJTNNJHIUQVUUIF
    WFDUPSPOUPUIFTURVBESBOUʜ
    Run all stateless applications
    on container clusters

    View Slide

  79. Progress of Full-containerization in Global
    79
    Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

    View Slide

  80. Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
    Progress of Full-containerization in Global
    80
    17/18 apps are running on containers
    (94 % is completed)

    View Slide

  81. 81
    81
    %FW

    View Slide

  82. 82
    82
    5IFEBUF3VCZXBTSFMFBTFE %FDUI

    %FWFMPQFSTDBODPOUSPM3VCZWFSTJPOTXJUIPVU43&T`TVQQPSUT
    %FW

    View Slide

  83. 83
    %FWFMPQFST`IBQQJOFTT
    43&T`IBQQJOFTT
    USBJOJOHDPTU
    MFBSOJOHDPTU
    /FXTPGUXBSFWFSTJPO
    /FXTPGUXBSFWFSTJPOPQFSBUJPOBMDPTUSFEVDUJPO
    IBQQZ
    IBQQZ
    VOIBQQZ
    VOIBQQZ
    Happiness Quadrant (Software Upgrade with container)
    Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

    View Slide

  84. 84
    %FWFMPQFST`IBQQJOFTT
    43&T`IBQQJOFTT
    5PUBMIBQQJOFTT
    IBQQZ
    IBQQZ
    VOIBQQZ
    VOIBQQZ
    Happiness Quadrant (Software Upgrade with container)
    Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization

    View Slide

  85. Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
    85
    %FWFMPQFST`IBQQJOFTT
    43&T`IBQQJOFTT
    5PUBMIBQQJOFTT
    IBQQZ
    IBQQZ
    VOIBQQZ
    VOIBQQZ
    Happiness Quadrant (Software Upgrade with container)
    Win - Win

    View Slide

  86. Challenge2: Feasible Self-service for Autonomous Teams > Full-containerization
    86
    %FWFMPQFST`IBQQJOFTT
    43&T`IBQQJOFTT
    5PUBMIBQQJOFTT
    IBQQZ
    IBQQZ
    VOIBQQZ
    VOIBQQZ
    Happiness Quadrant (Software Upgrade with container)
    Plus, SREs can focus on container platform
    (more best practices can be introduced)

    View Slide

  87. Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
    SSH
    SSH
    87
    No ssh debugging

    View Slide

  88. Cons of Applications on Container Platform
    88
    • Additional Complexities for Developers
    • Lack of tools cause chaos
    Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

    View Slide

  89. Cons of Applications on Container Platform
    89
    • Additional Complexities for Developers
    • Lack of tools create chaos
    SFGIUUQTTQFBLFSEFDLDPNUBLBOBCFDIBMMFOHFTGPSHMPCBMTFSWJDFGSPNBQFSTQFDUJWFPGTSF TMJEF

    View Slide

  90. Cons of Applications on Container Platform
    90
    • Additional Complexities for Developers
    • Lack of tools create chaos
    SFGIUUQTTQFBLFSEFDLDPNUBLBOBCFDIBMMFOHFTGPSHMPCBMTFSWJDFGSPNBQFSTQFDUJWFPGTSF TMJEF
    Already Enough ?

    View Slide

  91. 91

    View Slide

  92. Cons of Applications on Container Platform
    92
    • Additional Complexities for Developers
    • Lack of tools cause chaos
    • No ssh debugging systems for Global team
    • Granular and chronological order metrics dashboard
    • Container optimized New Relic agent deployment
    • Short-term log collection
    • Safe rails console for container
    Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

    View Slide

  93. 93
    $POUBJOFST
    4IPSUUFSN
    5%#
    -POHUFSN
    5%#
    *OqVY%#
    1SPNFUIFVT
    %FWFMPQFS
    EPXOTBNQMJOH
    FYQPSUNFUSJDT
    (SBGBOB
    5JNFTFSJFT%BUBCBTF
    6TFS*OUFSGBDF
    Granular and chronological metrics dashboard
    Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

    View Slide

  94. 94
    Granular and chronological metrics dashboard
    • Before
    • We cannot dig errors caused by spike resource saturations
    • After
    • We can recognize errors caused by spike resource saturations
    • We can judge that errors should be fixed soon or not
    Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

    View Slide

  95. Container optimized New Relic agent deployment
    95
    $POUBJOFS
    TIBSFENFNPSZ BHFOUTUBSUqBH

    IUUQBQQ@OFX@SFMJDTUBSU
    "11 SBDLOFX@SFMJDTUBSUFS
    IBLPQBSUJBSFMJD

    FYFDDPOTVMMPDL

    Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

    View Slide

  96. Container optimized New Relic agent deployment
    96
    • Before
    • ECS cannot deploy a New Relic agent to a specific container ( We want to save )
    • Agents are gone when containers are killed accidentally
    • After
    • ECS can deploy a New Relic agent to a container
    • Distributed locking via `consul lock` sidecar
    • Rack middleware that provides an endpoint to start the New Relic agent
    • Agents are launched in a container when agent start flag exists on shared memory
    Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

    View Slide

  97. Short-term log collection
    97
    $POUBJOFST
    4IPSUUFSN
    MPHTFBSDI
    -POHUFSN
    MPHTFBSDI
    4 "UIFOB

    &MBTUJDTFBSDI
    FYQPSUMPHT
    IBLPDPOTPMF
    -PHTFBSDI
    6TFS*OUFSGBDF
    ,JCBOB
    %FWFMPQFS
    FYQPSUMPHT
    Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

    View Slide

  98. Short-term log collection
    98
    • Before
    • Developers have to wait for few minutes to search logs
    • After
    • Developers can check logs nearly real-time
    Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

    View Slide

  99. Safe rails console for container
    99
    $POUBJOFST &YQPSUBVEJUMPHT
    USBQEPPSDPOTPMF
    6TFS*OUFSGBDF
    4MBDL
    USBQEPPSBHFOU
    "11 *OUFSBDUJWFDPNNVOJDBUJPO

    WJB8FC4PDLFU
    %FWFMPQFS
    .BOBHFBDDFTT
    QSJWJMFHFT
    "[VSF"%
    USBQEPPSQSPYZ
    #JOFYFD
    *OUFSBDUJWFDPNNVOJDBUJPO

    WJB8FC4PDLFU
    EBUBPOMZDPOUBJOFS

    Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

    View Slide

  100. Safe rails console for container
    100
    • Before
    • Developers ssh to servers and run `rails -c` (Sometimes `rails -c -s`)
    • Developers can run write queries in production ( historical technical debt )
    • After
    • Developers can use REPL via web browser with safe options selected by SREs
    • Developers can only run read queries on designated database instance
    Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

    View Slide

  101. Safe rails console for container (before)
    101
    takayuki-watanabe@ssh-accepatable-host-xxx:~$ date
    Thu Apr 19 10:53:28 UTC 2018
    takayuki-watanabe@ssh-accepatable-host-xxx:~$ htop
    PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
    [snip]
    31817 cookpad 20 0 794M 129M 188 R 93.7 1.7 1923h ruby bin/rails console production -s
    8773 cookpad 20 0 734M 165M 152 R 91.7 2.2 1800h ruby bin/rails console production -s
    8107 cookpad 20 0 959M 734M 14228 R 83.7 9.8 40h01:04 ruby bin/rails c production
    [snip]

    Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

    View Slide

  102. Safe rails console for container (after)
    102
    Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging

    View Slide

  103. Safe rails console for container (after)
    103
    Challenge2: Feasible Self-service for Autonomous Teams > No ssh debugging
    Feasible Self-service make product
    development reliable and autonomous !!

    View Slide

  104. Recap
    104
    • What is Cookpad Global ?
    • Role of Site Reliability Engineers
    • Paving Roads for Autonomous Teams
    - Challenge 1: Organization Transformation for Greater Autonomy
    - Challenge 2: Feasible Self-service for Autonomous Teams

    View Slide

  105. 105
    Thank you !!
    ([email protected])

    View Slide