Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DISOCOVER OPENSTACK'S NERVE WITH OSLO METRICS

DISOCOVER OPENSTACK'S NERVE WITH OSLO METRICS

2020 Open Infrastructure Summit
https://www.youtube.com/watch?v=pNyjMRJViac
https://summit.openinfra.dev/

Speakers: Motomu Utsumi, Reedip Banerjee

LINE Developers

October 20, 2020
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. 2 Motomu Utsumi Motomu is working on the operation and

    development of LINE private cloud. He enjoys writing code to solve the problem the team faces through the operation. Reedip Banerjee Reedip has been working with the OpenStack community since Mitaka, and is currently a Cloud Infrastructure Engineer in LINE Corporation. He has interest in Networking Concepts Linked-In: https://www.linkedin.com/in/reedip/ About Us
  2. Agenda 1. Background 1. Introduction to LINE’s Infra 2. Outages

    faced in LINE 3. What can be done to improve the situation 2. Introduction to oslo.metrics 1. What is oslo.metrics 2. Architecture 3. How we are using oslo.metrics 1. Metrics Visualisation 2. Troubleshooting 3. Metrics trend monitoring 3
  3. Background ➤Who are we ➤Introduction to LINE Infrastructure ➤Issues faced

    in LINE’s Large Scale Infrastructure ➤What can be done to improve the situation 4
  4. Who Are We ➤ LINE Corp: ➤ Messenger ➤ Pay

    ➤ Games ➤ Music ➤ TV ➤ And many more services…. 5
  5. Who Are We ➤ From Japan, to Taiwan, Indonesia and

    Korea 6 https://www.statista.com/statistics/560545/number-of-monthly-active-line-app-users-japan/ As of Jun. 2020 NUMBER OF MONTHLY ACTIVE USERS OF LINE APP IN 4 MAJOR COUNTRIES 166m NUMBER OF MONTHLY ACTIVE USERS OF LINE APP IN JAPAN 84m Number of monthly users in Japan (millions)
  6. 2000+ Hypervisor 50+ Virtual Machines 20000~ Physical Servers Hypervisors Virtual

    Machine Baremetal Thousand LINE TV Introduction to LINE Infrastructure (2/3) Data as of 7/13/2020 8
  7. Introduction to LINE Infrastructure (3/3) IaaS PaaS FaaS VM Identity

    Network Image DNS Block Storage Object Storage Bare metal LB Kubernetes Kafka Redis MySQL ElasticSearch Function as a Service 9
  8. Outages Faced in LINE • Outages which motivated us to

    look inside: • RabbitMQ messages are lost • RabbitMQ messages delayed in delivery • RPC Server got exception and stopped working • Time taken by Server to perform RPC >> RPC Timeout • RabbitMQ Node went down due to High Memory/CPU usage/memory leak/max socket connections (aka Too many open files) • RabbitMQ split brain, unsynchronised queues 10
  9. What Can Be Done To Improve the Situation • Verify

    Scalability Issues • Identify bottlenecks, limits with increase of scale • Tuning parameters / modification to the architecture • Improve OpenStack Reliability • Monitor and track latency issues • Monitor number of parallel connections and time taken • Make Troubleshooting investigations easier 11
  10. What Is oslo.metrics • OSS Library • Collect metrics exposed

    by the internal Oslo libraries ( oslo.messaging, oslo.db ) • Monitor Usage of Oslo libraries by Admin and Operators • Information of the number of RPC calls /API call • delta Change in the RPC calls/ delta Change in RPC time with increase/decrease in API Servers/Clients. • Similar to rpc_monitor’s implementation for oslo.messaging, but adaptable to other Oslo libraries as well 13
  11. Architecture(3/3) • Should work in an isolated internal network (

    Security ) . • Uses UDP Unix socket to communicate • All processes send data to the same unix socket (for eg. Nova-API and Neutron-server on Controller ). • Oslo libraries (eg. oslo.messaging) expose the data in oslo.metrics format • oslo.metrics exposes the data to the HTTP Endpoint in Prometheus Exposition Format (but changeable to other standards) • Required data format shared by Oslo library to oslo.metrics 16
  12. 17 How Are We Using oslo.metrics? • Metrics visualization •

    Troubleshoot • Metrics trend monitoring
  13. 23 How Was the RPC Processing Time? Increased gradually And

    Exceeded timeout threshold reserve_block_device_name processing time (s)
  14. 24 How Many Times Did We Had This Issue? How

    Likely Will It Happen? Processing time did not Exceeded timeout threshold In last 2 weeks reserve_block_device_name processing time (s)
  15. 25 Metrics Trend Monitoring RPC processing time 0 30 60

    90 120 2019 Q3 2019 Q4 2020 Q1 2020 Q2 2020 Q3 2020 Q4 RPC X RPC Y RPC Z Likely to exceed timeout threshold next quarter RPC timeout threshold
  16. • Integration with other Oslo libraries • Bottleneck analysis with

    large scale hypervisor emulation 26 What’s in Progress?
  17. • oslo.db • Query time • How many queries are

    executed? • oslo.concurrency • Time to acquire lock • How long is the lock held? • oslo.service • Periodic job processing time 27 Other Oslo Libraries Integration
  18. Can our cluster handle 10000 hypervisors? 28 Hypervisor Emulation Our

    Hypervisor Scale 0 550 1100 1650 2200 2017 2018 2019 2020 Number of HV
  19. 29 Hypervisor Emulation OpenStack C-plane Hypervisor Emulator Emulator Runtime (Kubernetes)

    Hypervisor Emulation Container neutron- fake-agent nova-compute fake-driver Hypervisor Emulation Container neutron- fake-agent nova-compute fake-driver Hypervisor Emulation Container neutron- fake-agent nova-compute fake-driver Hypervisor Emulation Container neutron- fake-agent nova-compute fake-driver Hypervisor Emulation Container neutron- fake-agent nova-compute fake-driver Hypervisor Emulation Container neutron- fake-agent nova-compute fake-driver Hypervisor Emulation Container neutron- fake-agent nova-compute fake-driver Hypervisor Emulation Container neutron- fake-agent nova-compute fake-driver Testing tool Need 2000HV Need 5000HV Run Integration Test Run Benchmark Test Spawn Register emulated HV
  20. • How we used RabbitMQ in wrong way at scale

    • https://www.openstack.org/summit/shanghai-2019/summit-schedule/events/ 23983/how-we-used-rabbitmq-in-wrong-way-at-a-scale 31 Related Presentation