2 Motomu Utsumi Motomu is working on the operation and development of LINE private cloud. He enjoys writing code to solve the problem the team faces through the operation. Reedip Banerjee Reedip has been working with the OpenStack community since Mitaka, and is currently a Cloud Infrastructure Engineer in LINE Corporation. He has interest in Networking Concepts Linked-In: https://www.linkedin.com/in/reedip/ About Us
Agenda 1. Background 1. Introduction to LINE’s Infra 2. Outages faced in LINE 3. What can be done to improve the situation 2. Introduction to oslo.metrics 1. What is oslo.metrics 2. Architecture 3. How we are using oslo.metrics 1. Metrics Visualisation 2. Troubleshooting 3. Metrics trend monitoring 3
Background ➤Who are we ➤Introduction to LINE Infrastructure ➤Issues faced in LINE’s Large Scale Infrastructure ➤What can be done to improve the situation 4
Who Are We ➤ From Japan, to Taiwan, Indonesia and Korea 6 https://www.statista.com/statistics/560545/number-of-monthly-active-line-app-users-japan/ As of Jun. 2020 NUMBER OF MONTHLY ACTIVE USERS OF LINE APP IN 4 MAJOR COUNTRIES 166m NUMBER OF MONTHLY ACTIVE USERS OF LINE APP IN JAPAN 84m Number of monthly users in Japan (millions)
2000+ Hypervisor 50+ Virtual Machines 20000~ Physical Servers Hypervisors Virtual Machine Baremetal Thousand LINE TV Introduction to LINE Infrastructure (2/3) Data as of 7/13/2020 8
Introduction to LINE Infrastructure (3/3) IaaS PaaS FaaS VM Identity Network Image DNS Block Storage Object Storage Bare metal LB Kubernetes Kafka Redis MySQL ElasticSearch Function as a Service 9
Outages Faced in LINE • Outages which motivated us to look inside: • RabbitMQ messages are lost • RabbitMQ messages delayed in delivery • RPC Server got exception and stopped working • Time taken by Server to perform RPC >> RPC Timeout • RabbitMQ Node went down due to High Memory/CPU usage/memory leak/max socket connections (aka Too many open files) • RabbitMQ split brain, unsynchronised queues 10
What Can Be Done To Improve the Situation • Verify Scalability Issues • Identify bottlenecks, limits with increase of scale • Tuning parameters / modification to the architecture • Improve OpenStack Reliability • Monitor and track latency issues • Monitor number of parallel connections and time taken • Make Troubleshooting investigations easier 11
What Is oslo.metrics • OSS Library • Collect metrics exposed by the internal Oslo libraries ( oslo.messaging, oslo.db ) • Monitor Usage of Oslo libraries by Admin and Operators • Information of the number of RPC calls /API call • delta Change in the RPC calls/ delta Change in RPC time with increase/decrease in API Servers/Clients. • Similar to rpc_monitor’s implementation for oslo.messaging, but adaptable to other Oslo libraries as well 13
Architecture(3/3) • Should work in an isolated internal network ( Security ) . • Uses UDP Unix socket to communicate • All processes send data to the same unix socket (for eg. Nova-API and Neutron-server on Controller ). • Oslo libraries (eg. oslo.messaging) expose the data in oslo.metrics format • oslo.metrics exposes the data to the HTTP Endpoint in Prometheus Exposition Format (but changeable to other standards) • Required data format shared by Oslo library to oslo.metrics 16
24 How Many Times Did We Had This Issue? How Likely Will It Happen? Processing time did not Exceeded timeout threshold In last 2 weeks reserve_block_device_name processing time (s)
• oslo.db • Query time • How many queries are executed? • oslo.concurrency • Time to acquire lock • How long is the lock held? • oslo.service • Periodic job processing time 27 Other Oslo Libraries Integration
• How we used RabbitMQ in wrong way at scale • https://www.openstack.org/summit/shanghai-2019/summit-schedule/events/ 23983/how-we-used-rabbitmq-in-wrong-way-at-a-scale 31 Related Presentation