Save 37% off PRO during our Black Friday Sale! »

How LINE operates tens of thousands of physical machines

How LINE operates tens of thousands of physical machines

Kodai Matsumoto
LINE System Development Team Infrastructure Engineer
https://linedevday.linecorp.com/2020/ja/sessions/6330
https://linedevday.linecorp.com/2020/en/sessions/6330

Eebedc2ee7ff95ffb9d9102c6d4a065c?s=128

LINE DevDay 2020

November 25, 2020
Tweet

Transcript

  1. None
  2. Speaker › Kodai Matsumoto › LINE Corporation › System Development

    Team › Joined LINE in 2019 as New Grad
  3. Agenda › LINE Infrastructure › LINE Server Operation › New

    Year / COVID-19 › Next Step
  4. LINE Infrastructure Private Cloud On-Premises Infrastructure …

  5. Scale of LINE Infrastructure Physical Servers 50,000+ Virtual Servers 67,000+

    Peak User Traffic 3Tbps+ 4.9 billion+ Daily exchange Messages
  6. Number of Physical Servers 0 5000 10000 15000 20000 25000

    30000 35000 40000 45000 50000 55000 2014 2015 2016 2017 2018 2019 2020
  7. LINE Server Operation How is LINE operating tens of thousands

    of physical machines
  8. Roles Hardware OS Application LINE Developer

  9. Roles Hardware OS Application System Engineer IDC Operator

  10. System Engineer Purchase Operation Setup Performance Test Capacity Management Hardware

    Configuration OS Provisioning Monitoring Troubleshooting
  11. Hardware Monitoring Disk Memory PSU Ping BGP-Peer Status ... IDC

    Operator Troubleshooting Hardware Alert Server ... LINE Developer Inquiry System Engineer
  12. Scale of Monitoring Monitoring Targets 117,000+ Monthly Hardware Incident 500+

    Daily syslog lines 2 billion+
  13. Server Operation v0 Mail + Excel based manual operation System

    Engineer Server IDC Operator Developer A Operation Document Developer B Developer C Hostname prefix Chat Tool Alert troubleshooting Mail + Excel ...
  14. Server Operation v0 Mail + Excel based manual operation System

    Engineer Server IDC Operator Developer A Operation Document Developer B Developer C Alert Mail + Excel ...
  15. Server Operation v0 Mail + Excel based manual operation System

    Engineer Server IDC Operator Developer A Operation Document Developer B Developer C Hostname prefix Alert Mail + Excel ...
  16. Server Operation v0 Mail + Excel based manual operation System

    Engineer Server IDC Operator Developer A Operation Document Developer B Developer C Hostname prefix Chat Tool Alert troubleshooting Mail + Excel ...
  17. Server Operation v0 Mail + Excel based manual operation System

    Engineer IDC Operator Developer A Developer B Developer C Chat Tool Mail + Excel Too Much Communication Server Operation Document Hostname prefix Alert troubleshooting ...
  18. Server Operation v0 Mail + Excel based manual operation System

    Engineer Server IDC Operator Developer A Operation Document Developer B Developer C Hostname prefix Chat Tool Alert troubleshooting Mail + Excel Not User-friendly ...
  19. Server Operation v1 Ticket-based Operation, IOC Server IOC Alert Mail

    Troubleshooting LINE Developer System Engineer IDC Operator Ticket-based Hostname Operation Manual ... Watch
  20. Server Operation v1 › Implemented Ticket-System for hardware incident ›

    Summarize information Ticket-based Operation
  21. Server Operation v1 Ticket-based Operation LINE Developer Queue IDC Operator

    Queue System Engineer Queue
  22. Server Operation v1 Ticket-based Operation LINE Developer Queue IDC Operator

    Queue System Engineer Queue LINE Developer IDC Operator System Engineer Watch Watch Watch
  23. Server Operation v1 Infra Operation Center - IOC Group A

    Server 1 Server 3 Server 5 Server 2 Server 4 Server 6 Operation Manuals Operation Manuals Operation Manuals ... ... Group B Server 7 Server 9 Server 11 Server 8 Server 10 Server 12 Operation Manuals Operation Manuals Operation Manuals ... › LINE Unique System › Server Grouping › Centralized management of operation manuals
  24. Server Operation v1 Infra Operation Center - IOC IDC Operator

    Hostname = Server 1 Failure Component = Disk Group A Server 1 Server 3 Server 5 Server 2 Server 4 Server 6 Operation Manuals Operation Manuals Operation Manuals ... ... Group B Server 7 Server 9 Server 11 Server 8 Server 10 Server 12 Operation Manuals Operation Manuals Operation Manuals ... IOC
  25. Server Operation v1 Infra Operation Center - IOC IDC Operator

    Hostname = Server 1 Failure Component = Disk Group A Server 1 Server 3 Server 5 Server 2 Server 4 Server 6 Operation Manuals Operation Manuals Operation Manuals ... ... Group B Server 7 Server 9 Server 11 Server 8 Server 10 Server 12 Operation Manuals Operation Manuals Operation Manuals ... IOC Auto Check Now Time = 12:00 Server Type = Hypervisor …
  26. Server Operation v1 Infra Operation Center - IOC IDC Operator

    Hostname = Server 1 Failure Component = Disk Group A Server 1 Server 3 Server 5 Server 2 Server 4 Server 6 Operation Manuals Operation Manuals Operation Manuals ... ... Group B Server 7 Server 9 Server 11 Server 8 Server 10 Server 12 Operation Manuals Operation Manuals Operation Manuals ... IOC Operation Manual Auto Check Now Time = 12:00 Server Type = Hypervisor Other Info…
  27. Server Operation v1 Ticket-based operation, IOC Server IOC Alert Mail

    Troubleshooting LINE Developer System Engineer IDC Operator Ticket-based Hostname Operation Manual ... Manually create incident Manually close incident Check
  28. Server Operation v2 Incident-Initiator, Incident-Closure Server Alert LINE Developer System

    Engineer IDC Operator Ticket-based ... Incident-Initiator Create incident Incident-Closure Check Close incident
  29. Server Operation v2 Alert Incident-Initiator Incident Incident-Initiator › Alert Type

    › Disk Error › Memory Error › Etc… › Server Information › Hostname › Region › Etc… ... Create
  30. Server Operation v2 Hardware Status Incident-Closure Incident › Change Status

    › Comment on Server status Incident-Closure Close
  31. Server Operation v2 Incident-Closure Server Failure Incident-Closure Close Incident Troubleshooting

    Hardware Status Escalate to Engineer IDC Operator System Engineer
  32. Server Operation v2 Incident-Closure › Server status › CPU ›

    Memory › Disk › Fan › PSU › etc...
  33. LINE Developer & System Engineer System Engineer Current Operation Lifecycle

    Server Failure Alert Create Incident Operation Manual Troubleshooting Close Incident Service-In Server Failure Alert Communication Troubleshooting Manual Server Check Service-In Incident-Initiator Incident-Closure IOC
  34. Incident Automation Milestone 0 100 200 300 400 500 600

    19/07 19/08 19/09 19/10 19/11 19/12 20/01 20/02 20/03 20/04 20/05 20/06 20/07 20/08 by Incident-Closure by Engineer
  35. New Year / COVID-19

  36. Server Utilization Rate › Required computing resources are not always

    constant › Depends on different situations › New Year ”あけおめ” › COVID-19
  37. System Engineer Purchase Operation Setup Performance Test Capacity Management Hardware

    Configuration OS Provisioning Monitoring Troubleshooting
  38. New Year “あけおめ” Network Traffic (JST) 23:30 23:40 23:50 0:00

    0:10 0:20 0:30 0:40 0:50 1:00 1:10 1:20 1:30 1:40 1:50 2:00 2:10 2:20 2:30 New Year Other Day Japan Taiwan Thailand
  39. New Year “あけおめ” › 2800+ physical server expansions in the

    past › Prepared from a few months ago
  40. COVID-19 GroupCall(VoIP) Network Traffic 2/24 2/27 3/1 3/4 3/7 3/10

    3/13 3/16 3/19 3/22 3/25 3/28 3/31 4/3 4/6 4/9 4/12 4/15 4/18 4/21 4/24 4/27
  41. COVID-19 GroupCall(VoIP) Network Traffic 2/24 2/27 3/1 3/4 3/7 3/10

    3/13 3/16 3/19 3/22 3/25 3/28 3/31 4/3 4/6 4/9 4/12 4/15 4/18 4/21 4/24 4/27 › Google Trends “LINE 飲み会” › Weekend Peak Traffic +750% › 650+ physical server expansions
  42. Supply Chain Disruption Parts Vendor Server Vendor LINE Parts Vendor

    Parts Vendor Delay Flight reductions
  43. Next Step

  44. Next Step Server Operation v2 Now Zero Touch Provisioning

  45. ZTP Overview Server ZTP Server OS Installer CMDB Verda PM

    Service ... IDC Operator Rack-Mount Cabling … Auto Registration
  46. Thank you