Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How LINE operates tens of thousands of physical machines

How LINE operates tens of thousands of physical machines

Kodai Matsumoto
LINE System Development Team Infrastructure Engineer
https://linedevday.linecorp.com/2020/ja/sessions/6330
https://linedevday.linecorp.com/2020/en/sessions/6330

LINE DevDay 2020

November 25, 2020
Tweet

More Decks by LINE DevDay 2020

Other Decks in Technology

Transcript

  1. Scale of LINE Infrastructure Physical Servers 50,000+ Virtual Servers 67,000+

    Peak User Traffic 3Tbps+ 4.9 billion+ Daily exchange Messages
  2. Number of Physical Servers 0 5000 10000 15000 20000 25000

    30000 35000 40000 45000 50000 55000 2014 2015 2016 2017 2018 2019 2020
  3. System Engineer Purchase Operation Setup Performance Test Capacity Management Hardware

    Configuration OS Provisioning Monitoring Troubleshooting
  4. Hardware Monitoring Disk Memory PSU Ping BGP-Peer Status ... IDC

    Operator Troubleshooting Hardware Alert Server ... LINE Developer Inquiry System Engineer
  5. Server Operation v0 Mail + Excel based manual operation System

    Engineer Server IDC Operator Developer A Operation Document Developer B Developer C Hostname prefix Chat Tool Alert troubleshooting Mail + Excel ...
  6. Server Operation v0 Mail + Excel based manual operation System

    Engineer Server IDC Operator Developer A Operation Document Developer B Developer C Alert Mail + Excel ...
  7. Server Operation v0 Mail + Excel based manual operation System

    Engineer Server IDC Operator Developer A Operation Document Developer B Developer C Hostname prefix Alert Mail + Excel ...
  8. Server Operation v0 Mail + Excel based manual operation System

    Engineer Server IDC Operator Developer A Operation Document Developer B Developer C Hostname prefix Chat Tool Alert troubleshooting Mail + Excel ...
  9. Server Operation v0 Mail + Excel based manual operation System

    Engineer IDC Operator Developer A Developer B Developer C Chat Tool Mail + Excel Too Much Communication Server Operation Document Hostname prefix Alert troubleshooting ...
  10. Server Operation v0 Mail + Excel based manual operation System

    Engineer Server IDC Operator Developer A Operation Document Developer B Developer C Hostname prefix Chat Tool Alert troubleshooting Mail + Excel Not User-friendly ...
  11. Server Operation v1 Ticket-based Operation, IOC Server IOC Alert Mail

    Troubleshooting LINE Developer System Engineer IDC Operator Ticket-based Hostname Operation Manual ... Watch
  12. Server Operation v1 Ticket-based Operation LINE Developer Queue IDC Operator

    Queue System Engineer Queue LINE Developer IDC Operator System Engineer Watch Watch Watch
  13. Server Operation v1 Infra Operation Center - IOC Group A

    Server 1 Server 3 Server 5 Server 2 Server 4 Server 6 Operation Manuals Operation Manuals Operation Manuals ... ... Group B Server 7 Server 9 Server 11 Server 8 Server 10 Server 12 Operation Manuals Operation Manuals Operation Manuals ... › LINE Unique System › Server Grouping › Centralized management of operation manuals
  14. Server Operation v1 Infra Operation Center - IOC IDC Operator

    Hostname = Server 1 Failure Component = Disk Group A Server 1 Server 3 Server 5 Server 2 Server 4 Server 6 Operation Manuals Operation Manuals Operation Manuals ... ... Group B Server 7 Server 9 Server 11 Server 8 Server 10 Server 12 Operation Manuals Operation Manuals Operation Manuals ... IOC
  15. Server Operation v1 Infra Operation Center - IOC IDC Operator

    Hostname = Server 1 Failure Component = Disk Group A Server 1 Server 3 Server 5 Server 2 Server 4 Server 6 Operation Manuals Operation Manuals Operation Manuals ... ... Group B Server 7 Server 9 Server 11 Server 8 Server 10 Server 12 Operation Manuals Operation Manuals Operation Manuals ... IOC Auto Check Now Time = 12:00 Server Type = Hypervisor …
  16. Server Operation v1 Infra Operation Center - IOC IDC Operator

    Hostname = Server 1 Failure Component = Disk Group A Server 1 Server 3 Server 5 Server 2 Server 4 Server 6 Operation Manuals Operation Manuals Operation Manuals ... ... Group B Server 7 Server 9 Server 11 Server 8 Server 10 Server 12 Operation Manuals Operation Manuals Operation Manuals ... IOC Operation Manual Auto Check Now Time = 12:00 Server Type = Hypervisor Other Info…
  17. Server Operation v1 Ticket-based operation, IOC Server IOC Alert Mail

    Troubleshooting LINE Developer System Engineer IDC Operator Ticket-based Hostname Operation Manual ... Manually create incident Manually close incident Check
  18. Server Operation v2 Incident-Initiator, Incident-Closure Server Alert LINE Developer System

    Engineer IDC Operator Ticket-based ... Incident-Initiator Create incident Incident-Closure Check Close incident
  19. Server Operation v2 Alert Incident-Initiator Incident Incident-Initiator › Alert Type

    › Disk Error › Memory Error › Etc… › Server Information › Hostname › Region › Etc… ... Create
  20. Server Operation v2 Hardware Status Incident-Closure Incident › Change Status

    › Comment on Server status Incident-Closure Close
  21. Server Operation v2 Incident-Closure Server Failure Incident-Closure Close Incident Troubleshooting

    Hardware Status Escalate to Engineer IDC Operator System Engineer
  22. Server Operation v2 Incident-Closure › Server status › CPU ›

    Memory › Disk › Fan › PSU › etc...
  23. LINE Developer & System Engineer System Engineer Current Operation Lifecycle

    Server Failure Alert Create Incident Operation Manual Troubleshooting Close Incident Service-In Server Failure Alert Communication Troubleshooting Manual Server Check Service-In Incident-Initiator Incident-Closure IOC
  24. Incident Automation Milestone 0 100 200 300 400 500 600

    19/07 19/08 19/09 19/10 19/11 19/12 20/01 20/02 20/03 20/04 20/05 20/06 20/07 20/08 by Incident-Closure by Engineer
  25. Server Utilization Rate › Required computing resources are not always

    constant › Depends on different situations › New Year ”あけおめ” › COVID-19
  26. System Engineer Purchase Operation Setup Performance Test Capacity Management Hardware

    Configuration OS Provisioning Monitoring Troubleshooting
  27. New Year “あけおめ” Network Traffic (JST) 23:30 23:40 23:50 0:00

    0:10 0:20 0:30 0:40 0:50 1:00 1:10 1:20 1:30 1:40 1:50 2:00 2:10 2:20 2:30 New Year Other Day Japan Taiwan Thailand
  28. COVID-19 GroupCall(VoIP) Network Traffic 2/24 2/27 3/1 3/4 3/7 3/10

    3/13 3/16 3/19 3/22 3/25 3/28 3/31 4/3 4/6 4/9 4/12 4/15 4/18 4/21 4/24 4/27
  29. COVID-19 GroupCall(VoIP) Network Traffic 2/24 2/27 3/1 3/4 3/7 3/10

    3/13 3/16 3/19 3/22 3/25 3/28 3/31 4/3 4/6 4/9 4/12 4/15 4/18 4/21 4/24 4/27 › Google Trends “LINE 飲み会” › Weekend Peak Traffic +750% › 650+ physical server expansions
  30. ZTP Overview Server ZTP Server OS Installer CMDB Verda PM

    Service ... IDC Operator Rack-Mount Cabling … Auto Registration