Upgrade to Pro — share decks privately, control downloads, hide ads and more …



2020 Open Infrastructure Summit

Speakers: Motomu Utsumi, Reedip Banerjee


LINE Developers

October 20, 2020

More Decks by LINE Developers

Other Decks in Technology



    On A Large Scale
  2. 2 Motomu Utsumi Motomu is working on the operation and

    development of LINE private cloud. He enjoys writing code to solve the problem the team faces through the operation. Reedip Banerjee Reedip has been working with the OpenStack community since Mitaka, and is currently a Cloud Infrastructure Engineer in LINE Corporation. He has interest in Networking Concepts Linked-In: https://www.linkedin.com/in/reedip/ About Us
  3. Agenda 1. Background 1. Introduction to LINE’s Infra 2. Outages

    faced in LINE 3. What can be done to improve the situation 2. Introduction to oslo.metrics 1. What is oslo.metrics 2. Architecture 3. How we are using oslo.metrics 1. Metrics Visualisation 2. Troubleshooting 3. Metrics trend monitoring 3
  4. Background ➤Who are we ➤Introduction to LINE Infrastructure ➤Issues faced

    in LINE’s Large Scale Infrastructure ➤What can be done to improve the situation 4
  5. Who Are We ➤ LINE Corp: ➤ Messenger ➤ Pay

    ➤ Games ➤ Music ➤ TV ➤ And many more services…. 5
  6. Who Are We ➤ From Japan, to Taiwan, Indonesia and

    Korea 6 https://www.statista.com/statistics/560545/number-of-monthly-active-line-app-users-japan/ As of Jun. 2020 NUMBER OF MONTHLY ACTIVE USERS OF LINE APP IN 4 MAJOR COUNTRIES 166m NUMBER OF MONTHLY ACTIVE USERS OF LINE APP IN JAPAN 84m Number of monthly users in Japan (millions)
  7. Introduction to LINE Infrastructure (1/3) 7

  8. 2000+ Hypervisor 50+ Virtual Machines 20000~ Physical Servers Hypervisors Virtual

    Machine Baremetal Thousand LINE TV Introduction to LINE Infrastructure (2/3) Data as of 7/13/2020 8
  9. Introduction to LINE Infrastructure (3/3) IaaS PaaS FaaS VM Identity

    Network Image DNS Block Storage Object Storage Bare metal LB Kubernetes Kafka Redis MySQL ElasticSearch Function as a Service 9
  10. Outages Faced in LINE • Outages which motivated us to

    look inside: • RabbitMQ messages are lost • RabbitMQ messages delayed in delivery • RPC Server got exception and stopped working • Time taken by Server to perform RPC >> RPC Timeout • RabbitMQ Node went down due to High Memory/CPU usage/memory leak/max socket connections (aka Too many open files) • RabbitMQ split brain, unsynchronised queues 10
  11. What Can Be Done To Improve the Situation • Verify

    Scalability Issues • Identify bottlenecks, limits with increase of scale • Tuning parameters / modification to the architecture • Improve OpenStack Reliability • Monitor and track latency issues • Monitor number of parallel connections and time taken • Make Troubleshooting investigations easier 11
  12. Introduction To oslo.metrics • What is oslo.metrics • Architecture 12

  13. What Is oslo.metrics • OSS Library • Collect metrics exposed

    by the internal Oslo libraries ( oslo.messaging, oslo.db ) • Monitor Usage of Oslo libraries by Admin and Operators • Information of the number of RPC calls /API call • delta Change in the RPC calls/ delta Change in RPC time with increase/decrease in API Servers/Clients. • Similar to rpc_monitor’s implementation for oslo.messaging, but adaptable to other Oslo libraries as well 13
  14. Architecture(1/3) • Oslo libraries -> oslo.metrics -> Prometheus [1] 14

  15. Architecture(2/3) • Collect metrics from both RPC Layer and AMPQ

    Driver layer. 15
  16. Architecture(3/3) • Should work in an isolated internal network (

    Security ) . • Uses UDP Unix socket to communicate • All processes send data to the same unix socket (for eg. Nova-API and Neutron-server on Controller ). • Oslo libraries (eg. oslo.messaging) expose the data in oslo.metrics format • oslo.metrics exposes the data to the HTTP Endpoint in Prometheus Exposition Format (but changeable to other standards) • Required data format shared by Oslo library to oslo.metrics 16
  17. 17 How Are We Using oslo.metrics? • Metrics visualization •

    Troubleshoot • Metrics trend monitoring
  18. Metrics Visualization With Prometheus and Grafana 18

  19. 19 Metrics Change: Instance Build RPC server invocation count RPC

    server average processing time (s)
  20. 20 Metrics Change: Scheduling Instances Average select_destination processing time (s)

    10 instances 100 instances 500 instances
  21. Instance had duplicated volumes_attached entries 21 Troubleshoot With oslo.metrics

  22. 22 Did We Get Error? nova reserve_block_device_name Messaging Timeout RPC

    client exception count
  23. 23 How Was the RPC Processing Time? Increased gradually And

    Exceeded timeout threshold reserve_block_device_name processing time (s)
  24. 24 How Many Times Did We Had This Issue? How

    Likely Will It Happen? Processing time did not Exceeded timeout threshold In last 2 weeks reserve_block_device_name processing time (s)
  25. 25 Metrics Trend Monitoring RPC processing time 0 30 60

    90 120 2019 Q3 2019 Q4 2020 Q1 2020 Q2 2020 Q3 2020 Q4 RPC X RPC Y RPC Z Likely to exceed timeout threshold next quarter RPC timeout threshold
  26. • Integration with other Oslo libraries • Bottleneck analysis with

    large scale hypervisor emulation 26 What’s in Progress?
  27. • oslo.db • Query time • How many queries are

    executed? • oslo.concurrency • Time to acquire lock • How long is the lock held? • oslo.service • Periodic job processing time 27 Other Oslo Libraries Integration
  28. Can our cluster handle 10000 hypervisors? 28 Hypervisor Emulation Our

    Hypervisor Scale 0 550 1100 1650 2200 2017 2018 2019 2020 Number of HV
  29. 29 Hypervisor Emulation OpenStack C-plane Hypervisor Emulator Emulator Runtime (Kubernetes)

    Hypervisor Emulation Container neutron- fake-agent nova-compute fake-driver Hypervisor Emulation Container neutron- fake-agent nova-compute fake-driver Hypervisor Emulation Container neutron- fake-agent nova-compute fake-driver Hypervisor Emulation Container neutron- fake-agent nova-compute fake-driver Hypervisor Emulation Container neutron- fake-agent nova-compute fake-driver Hypervisor Emulation Container neutron- fake-agent nova-compute fake-driver Hypervisor Emulation Container neutron- fake-agent nova-compute fake-driver Hypervisor Emulation Container neutron- fake-agent nova-compute fake-driver Testing tool Need 2000HV Need 5000HV Run Integration Test Run Benchmark Test Spawn Register emulated HV
  30. • oslo.metrics • https://specs.openstack.org/openstack/oslo-specs/specs/ussuri/oslo- metrics.html • https://review.opendev.org/730753 • Integration with

    oslo.messaging • https://review.opendev.org/730755 30 Upstreaming
  31. • How we used RabbitMQ in wrong way at scale

    • https://www.openstack.org/summit/shanghai-2019/summit-schedule/events/ 23983/how-we-used-rabbitmq-in-wrong-way-at-a-scale 31 Related Presentation