Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Outage Handling Process and Culture of the LINE Platform Server

Outage Handling Process and Culture of the LINE Platform Server

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Agenda - Reliability is one of the important our value
    - Introduce outage handling process
    - Continuous improvement from the developer
    culture

    View full-size slide

  2. - LINE Messaging Platform
    - LINE Official Account
    - LINE Shop and LINE STORE
    - LINE Developer Platform
    - LINE OpenChat
    LINE Platform Server

    View full-size slide

  3. Reliability
    is one of the important value in Server Platform

    View full-size slide

  4. Outage Count Trends
    Jan to Sep 2021
    0
    2
    4
    6
    8
    Jan Feb Mar Apr May Jun Jul Aug Sep

    View full-size slide

  5. Continuous Improvement from the Outage
    Learning from failure
    Outages
    Not to fear
    Developer
    Culture
    Meaningful
    Experiences
    Improved
    Platform

    View full-size slide

  6. Outage Handling Process

    View full-size slide

  7. Outage Handling Process
    RETROSPECTIVE
    Arrange
    retrospective
    meeting
    REPAIR
    Restore
    service
    CLASSIFY
    Classify
    outage
    WRITE REPORT
    Write & distribute
    outage report
    BROADCAST
    Broadcast
    outage
    DETECT & CONTACT
    Contact the team
    in charge

    View full-size slide

  8. Outage Handling Process
    DETECT & CONTACT
    - Contact List
    - Easy to Find (indexed by product and component)
    - A primary contact point is a person
    - A secondary contact point is slack channel

    View full-size slide

  9. Outage Handling Process
    CLASSIFY
    Each outage has level (1~5)
    - Outage Level decides by each product rule
    - Ex: Axis for classify
    - Coverage – DAU percentage affected by outage
    - Seriousness – decide by feature and
    status(unusable, unstable)
    Serious-
    ness
    Coverage
    ≥ 1% 0.3 to < 1% < 0.3%
    High 1 1 3
    Medium
    Hight
    2 3 4
    Medium 2 3 4
    Medium
    Low
    3 4 4
    Low 4 4 5

    View full-size slide

  10. Outage Handling Process
    BROADCAST
    - Dev Lead has responsibility or delegate to other
    member
    - Clarify the sharing channel
    - LINE has internal slack channel for outage sharing
    - Information expands to stakeholders
    - The Sharing template helps clear communication
    [Outage notice]
    • Outage level:
    • Outage product:
    • Detection time:
    • Issues:
    • Cause:
    • Services affected:
    • Status:

    View full-size slide

  11. Outage Handling Process
    REPAIR
    - Focus to reduce outage time and effect
    - Dev Lead has the responsibility control repairing and updating status

    View full-size slide

  12. Outage Handling Process
    WRITE REPORT
    - Reporting has a deadline. (Within 1 working day)
    - Outage report is created from the template and
    standardized
    - All corrective and preventive measures are
    registered as issue tickets.
    - The report is distributed through email to related
    product members
    [Summary]
    Product / Region
    Level / Seriousness
    Coverage(%) / Reliability (%)
    Occurred time/ Detected time/ Resolved time
    Brief description
    [Detail]
    Services affected by the outage
    Cause / Timeline and resolution
    [Corrective and preventive measures]
    Preventing future outages
    Improving outage detection
    Improving outage handling

    View full-size slide

  13. Outage Handling Process
    RETROSPECTIVE
    - Hold the retrospective meeting within 5 working days
    - Invite All Platform Server Members
    - Mandatory - Members of teams and services who were directly affected by the outage
    - Non Mandatory - All Platform Server Member
    - Explain outage reports and get feedback from various views.
    - Update and finalize report after the retrospective

    View full-size slide

  14. Developer Culture

    View full-size slide

  15. Prevention from early stage
    Review & Test
    Design Review
    The developer writes
    tech spec documents.
    (describe intention and
    design)
    The document is
    reviewed by a peer
    developer.
    Code Review
    All code change requires
    peer review.
    Code cannot be
    released without
    approval from the peer
    developers.
    Testing
    It's hard to be merged
    without unit tests
    recently.
    Many teams are
    operating end-to-end
    tests. (It’s the outcome
    from preventive
    measures)

    View full-size slide

  16. On-Call Duty
    Reacting from monitoring system alarm
    - LINE platform servers have a lot of monitoring conditions.
    (outcome of outage corrective measures)
    Some developers have to respond to the alarm.
    - On-Call duty is a rotating responsibility among developers.
    - We prevent complex outages by early detection and reaction.

    View full-size slide

  17. Managing Outage related action items
    OKR of LINE Platform Server
    - All outage action items are registered by issue tickets
    - OKR for outage ticket resolving

    View full-size slide

  18. Outage Action Items Status
    Jan to Sep 2021
    83%
    17%
    Completed, In Progress To do
    Total : 242

    View full-size slide

  19. Case Study
    API inactivity due to increased GC during restart

    View full-size slide

  20. Outage Broadcast

    View full-size slide

  21. Outage Summary
    Product : LINE DPP
    Level : 4
    Seriousness : Medium
    Region : All
    Coverage : 0.01%
    Reliability : 99%
    Occurred At : 4th Aug 17:50:35
    Detected At : 4th Aug 17:63:00
    Resolved At : 4th Aug 18:32:09
    Brief Description
    Due to Proxy configuration Error, the
    request control did not work properly
    during the Channel Gateway restart.
    As a result, user requests were
    delivered normally before the server
    started, and GC frequently occurred
    during server start-up because the
    slow start was not working…

    View full-size slide

  22. Services affected by the outage
    Overview
    Affected Users: 16,818
    Affected APIs: 92 kinds 139,860 requests
    Affected Channels : 338

    View full-size slide

  23. Time Name Detail Remarks
    8/3 13:24 Updated configuration
    Missing some fields data in
    private servers.
    8/4 17:50 Restart servers 0VUBHFTUBSUFE
    8/4 17:53 Received error notification
    [ERROR]
    (rejected_requests[RedisClientCircuitB
    reaker] :
    count_redisCircuitBreaker > 0) 14.0 > 0
    Outage detected
    8/4 18:01 Start to rollback Start Handling
    8/4 18:32 Rollback finished Outage Recovered
    Timeline and resolution

    View full-size slide

  24. Corrective and preventive measures
    Preventing future outages
    • Active health check
    from proxy server to
    backends
    Improving outage detection
    • Adding Health metrics
    Improving outage handling
    • Improve rollback &
    service out steps on
    release guide
    Rule
    • Every action item should be executable
    • Every Action Item should be registered as a ticket on issue tracking system

    View full-size slide

  25. Lessons and Learns during 10 years
    The process is fragile. It needs to keep updated and be awarded
    Some things are not solved autonomously
    Culture over Process

    View full-size slide

  26. Summary
    Outage Handling Process and Culture in LINE Platform’s Server
    Platform
    Reliability
    Learn
    from
    Outages
    Outage
    Handling
    Process
    Develop
    Culture

    View full-size slide