Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Outage Handling Process and Culture of the LINE...

Outage Handling Process and Culture of the LINE Platform Server

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Agenda - Reliability is one of the important our value

    - Introduce outage handling process - Continuous improvement from the developer culture
  2. - LINE Messaging Platform - LINE Official Account - LINE

    Shop and LINE STORE - LINE Developer Platform - LINE OpenChat LINE Platform Server
  3. Outage Count Trends Jan to Sep 2021 0 2 4

    6 8 Jan Feb Mar Apr May Jun Jul Aug Sep
  4. Continuous Improvement from the Outage Learning from failure Outages Not

    to fear Developer Culture Meaningful Experiences Improved Platform
  5. Outage Handling Process RETROSPECTIVE Arrange retrospective meeting REPAIR Restore service

    CLASSIFY Classify outage WRITE REPORT Write & distribute outage report BROADCAST Broadcast outage DETECT & CONTACT Contact the team in charge
  6. Outage Handling Process DETECT & CONTACT - Contact List -

    Easy to Find (indexed by product and component) - A primary contact point is a person - A secondary contact point is slack channel
  7. Outage Handling Process CLASSIFY Each outage has level (1~5) -

    Outage Level decides by each product rule - Ex: Axis for classify - Coverage – DAU percentage affected by outage - Seriousness – decide by feature and status(unusable, unstable) Serious- ness Coverage ≥ 1% 0.3 to < 1% < 0.3% High 1 1 3 Medium Hight 2 3 4 Medium 2 3 4 Medium Low 3 4 4 Low 4 4 5
  8. Outage Handling Process BROADCAST - Dev Lead has responsibility or

    delegate to other member - Clarify the sharing channel - LINE has internal slack channel for outage sharing - Information expands to stakeholders - The Sharing template helps clear communication [Outage notice] • Outage level: • Outage product: • Detection time: • Issues: • Cause: • Services affected: • Status:
  9. Outage Handling Process REPAIR - Focus to reduce outage time

    and effect - Dev Lead has the responsibility control repairing and updating status
  10. Outage Handling Process WRITE REPORT - Reporting has a deadline.

    (Within 1 working day) - Outage report is created from the template and standardized - All corrective and preventive measures are registered as issue tickets. - The report is distributed through email to related product members [Summary] Product / Region Level / Seriousness Coverage(%) / Reliability (%) Occurred time/ Detected time/ Resolved time Brief description [Detail] Services affected by the outage Cause / Timeline and resolution [Corrective and preventive measures] Preventing future outages Improving outage detection Improving outage handling
  11. Outage Handling Process RETROSPECTIVE - Hold the retrospective meeting within

    5 working days - Invite All Platform Server Members - Mandatory - Members of teams and services who were directly affected by the outage - Non Mandatory - All Platform Server Member - Explain outage reports and get feedback from various views. - Update and finalize report after the retrospective
  12. Prevention from early stage Review & Test Design Review The

    developer writes tech spec documents. (describe intention and design) The document is reviewed by a peer developer. Code Review All code change requires peer review. Code cannot be released without approval from the peer developers. Testing It's hard to be merged without unit tests recently. Many teams are operating end-to-end tests. (It’s the outcome from preventive measures)
  13. On-Call Duty Reacting from monitoring system alarm - LINE platform

    servers have a lot of monitoring conditions. (outcome of outage corrective measures) Some developers have to respond to the alarm. - On-Call duty is a rotating responsibility among developers. - We prevent complex outages by early detection and reaction.
  14. Managing Outage related action items OKR of LINE Platform Server

    - All outage action items are registered by issue tickets - OKR for outage ticket resolving
  15. Outage Action Items Status Jan to Sep 2021 83% 17%

    Completed, In Progress To do Total : 242
  16. Outage Summary Product : LINE DPP Level : 4 Seriousness

    : Medium Region : All Coverage : 0.01% Reliability : 99% Occurred At : 4th Aug 17:50:35 Detected At : 4th Aug 17:63:00 Resolved At : 4th Aug 18:32:09 Brief Description Due to Proxy configuration Error, the request control did not work properly during the Channel Gateway restart. As a result, user requests were delivered normally before the server started, and GC frequently occurred during server start-up because the slow start was not working…
  17. Services affected by the outage Overview Affected Users: 16,818 Affected

    APIs: 92 kinds 139,860 requests Affected Channels : 338 <error counts>
  18. Time Name Detail Remarks 8/3 13:24 Updated configuration <source link

    of script> Missing some fields data in private servers. 8/4 17:50 Restart servers 0VUBHFTUBSUFE 8/4 17:53 Received error notification [ERROR] (rejected_requests[RedisClientCircuitB reaker] : count_redisCircuitBreaker > 0) 14.0 > 0 Outage detected 8/4 18:01 Start to rollback Start Handling 8/4 18:32 Rollback finished Outage Recovered Timeline and resolution
  19. Corrective and preventive measures Preventing future outages • Active health

    check from proxy server to backends Improving outage detection • Adding Health metrics Improving outage handling • Improve rollback & service out steps on release guide Rule • Every action item should be executable • Every Action Item should be registered as a ticket on issue tracking system
  20. Lessons and Learns during 10 years The process is fragile.

    It needs to keep updated and be awarded Some things are not solved autonomously Culture over Process
  21. Summary Outage Handling Process and Culture in LINE Platform’s Server

    Platform Reliability Learn from Outages Outage Handling Process Develop Culture