Cluster Monitoring of the open stack systems 

CLUSTER MONITORING OF THE   OPEN STACK SYSTEMS   Keep
a track of what’s wrong and what’s right Open Infra Days, Asia 2021

2 Yushiro Furukawa Yushiro has been working as upstream OpenStack
contributor into Neutron and Ironic for several years. Currently, he is a cloud infrastructure engineer in LINE Corporation since 2020. Linked-in: https://www.linkedin.com/in/yushiro- furukawa-96b273102 Reedip Banerjee Reedip has been working with the OpenStack community since Mitaka, and is currently a Cloud Infrastructure Engineer in LINE Corporation.   He has interest in Networking Concepts   Linked-In: https://www.linkedin.com/in/reedip/ About Us

Agenda ➤ Background ➤ About LINE ➤ Introduction to LINE
Infrastructure ➤ Issues faced in LINE’s Large Scale Infrastructure ➤ Reasons to setup Cluster Monitoring ➤Cluster Monitoring Project ➤ Overview of LINE’s OpenStack ➤ Monitored Metrics ➤ Architecture of Cluster Monitoring ➤How we are using Cluster Monitoring ➤ Troubleshooting ➤ Metrics trend monitoring ➤Further Improvements ➤ Alerts 3

Background ➤ Who are we ➤ Introduction to LINE Infrastructure
➤ Reasons to Setup Cluster Monitoring 4

About LINE 5 ➤ LINE Corp: ➤ Messenger ➤ Payment
➤ Games ➤ Music ➤ Live Streaming ➤ Etc….

About LINE ➤ From Japan, to Taiwan, Indonesia and Korea
6 As per data observed between Apr to Jun. 2021 NUMBER OF MONTHLY ACTIVE USERS OF LINE APP : GLOBAL 188 Million NUMBER OF DAILY MESSAGES ~4.9 Billion NUMBER OF MONTHLY ACTIVE USERS OF LINE APP IN JAPAN 88 Million

Introduction to LINE Infrastructure (1/3) 7

Introduction to LINE Infrastructure (2/3) 8 4000+ Hypervisor 74+ Virtual
Machines 30000~ Physical Servers Hypervisors Virtual Machine Baremetal Thousand Data as of 7/23/2021

Introduction to LINE Infrastructure (3/3) IaaS PaaS FaaS VM Identity
Network Image DNS Block Storage Object Storage Bare metal LB Kubernetes Kafka Redis MySQL ElasticSearch Function as a Service 9

Actual Structure 10 ➤ Handling “reactive” only and fix the
issue. ➤ Not clear about resource usage from end-user in daily/weekly/monthly basis. ➤ Cannot evaluate our platform availability.

11 ➤ Lack of resource failure awareness • End-user contacts
our team via slack, e-mail and etc if some resource operation has failed. • We cannot notice any resource operation failure by ourself ➤ Lack of production cluster status awareness • Production cluster have increased day by day. • But we don’t have a way to keep track the cluster status in terms of API / internal communication / process layer. ➤ Lack of definition to ensure our platform availability • No SLO Issues Faced in LINE’s Large Scale Infrastructure

12 ➤ User request ➤ Cannot access to dashboard ➤
A VM instance status has changed into “ERROR” ➤ Outages ➤ RabbitMQ messages are lost or delayed in delivery ➤ RPC Server got exception and stopped working ➤ RabbitMQ cluster getting partitioned, Split brain, unsynchronized queues ➤ Control Plane went down due to High Memory/CPU usage/memory leak/max socket connections (aka Too many open files) Actual Cases of Resource Failure Awareness

13 ➤ We don’t know the detail like the following
“Trend” topics in a flush. ➤ How long does it take to create/rebuild/delete a VM instance today? ➤ How many projects are created by daily? ➤ Daily/weekly/monthly basis of each performance. Production Cluster Status Awareness

14 ➤ New business units requires SLO for our platform
➤ This is good timing to define the SLO with daily/monthly basis ➤ API success/failure ratio • e.g. POST /v2.1/{project_id}/servers, etc… ➤ Resource operation success ratio [%] • e.g. openstack server create/delete/rebuild/… ➤ Resource operation time • Synchronous operations (e.g. DNS recordset) • Asynchronous operations (e.g. VM instance) Definition of our Platform Availability

15 TODO for Cluster Monitoring Project ➤ Lack of resource
failure awareness ➤ Lack of production cluster status awareness ➤ Lack of definition to ensure our platform availability Collect several data, calculate as a metrics and visible on a dashboard Visible “trend” data and actual cluster capacity Define SLO related each API/resource operation

TOBE Structure ➤ Getting alert related SLO ➤ Enable to
check any “Trend” or actual cluster resource data on the integrated dashboard. 16

➤API ➤ Latency, Duration ➤ Health check ➤ Success ratio
➤Internal components ➤ Messaging Queue services ➤ Hypervisors status ➤ Agent status ➤Physical resources ➤ CPU, Memory, disk, File descriptors ➤Trend ➤ SLO ➤ Resource usage by User 17 Monitoring Targets (Summary)

Monitoring Metrics Metric Type Notes Exporter used Resource Layer Metrics
User resource utilisation, for eg. No. Of VMs, No. Of Network Ports being used etc.   Includes VM Creation time, Deletion time, Designate Record creation Query Exporter for Prometheus API Success Rate How well the API responds for PUT, POST, GET requests, based on the return code Black Box Exporter API Response Time Time it takes for the API to respond back to any user request. Each API response is differentiated based on the TYPE of the request. Black Box Exporter Resource Utilization Memory utilisation, Disk Utilisation, number of fd/sockets, CPU core utilisation etc   This is calculated both for C-Plane and D-Plane ( Hypervisors, worker nodes , Pods et. Al. ) Node Exporter, cAdvisor, Socket Statistics Exporter Agent status Status of different Agents ( Network Agents, Nova Scheduler, Nova Conductor, Glance Registry, etc ) Node Exporter RabbitMQ related metrics Consumer count, queue length, message count/message size Rabbitmq Exporter, Oslo Metrics (introduced by LINE ) API Health Check Verification if the API is actually running or not Haproxy-exporter Hypervisor VM Load Number of VMs and their status ( Running, Paused, Shutdown) on each HV Custom Libvirt Exporter with Pushgateway

Cluster Monitoring Architecture

Detailed Design for Cluster Monitoring Exporters(1/8) Exporter How the Exporter
is used Query Exporter Gather metrics by running queries on the DB Black Box Exporter Gathers metrics by probing of endpoints over HTTP, HTTPS, DNS, TCP and ICMP. Socket Statistics Exporter Gather metrics for network sockets, including SendQ, RecvQ etc. Oslo Metrics Gather data from Oslo Libraries. Currently includes Oslo Messaging and working on Oslo DB Haproxy-exporter Gather HAProxy stats Pushgateway Used by ephemeral and batch jobs to expose their metrics to Prometheus RabbitMQ Exporter Gather metrics from RabbitMQ clusters including Queue Size, message size, partitioned cluster etc. Node Exporter Exporter for hardware and OS related metrics exposed by *NIX kernels, cAdvisor provides container users an understanding of the resource usage and performance characteristics of their running containers.

Detailed Design for Cluster Monitoring Exporters(2/8) Benchmarking SLO/SLI ➤Goal: ➤Benchmarking
and verifying the time taken for a User facing API ( VM Create/Delete/Rebuild etc ) to finish ➤Instance creation, deletion, and rebuild success rates calculated on the number of nova notifications and the status of the instance ➤SLO Calculation ➤ API success ➤ (count of matching API calls with response code 2XX + 4XX) (count of matching API calls) ➤ API failure ➤ (count of matching API calls with response code 5XX) (count of matching API calls) ∑ ∑ ∑ ∑ Operation Target Notifications Notes Creation Instance.create.start, instance.create.end, instance.create.error VM must be active after instance.create.end Deletion Instance.delete.start, instance.delete.end, instance.delete.error Rebuild Instance.rebuild.start, instance.rebuild.end, instance.rebuild.error

Calculating SLO/SLI performances for VM CRD Detailed Design for Cluster
Monitoring Exporters(3/8) 22 SLO/SLI Calculation procedure ( for create ): - Nova/Designate/Neutron sends a Resource.create.start - RabbitMQ queue receives the notification and puts it in Queue - Notification Consumer subscribes to the Queue and gets the start Notification - It sends the information to Notification Agent which keeps a track of the time, the resource ID, the resource type and start notification - When the resource operation is complete, Nova/Designate/Neutron send a Resource.create.end. If it fails, they send Resource.create.error - Notification Consumer gets the message, and sends it to the Agent. - Agent verifies the incoming message type ( End or Error ), resource Id, resource type and time, and sends it to the Logs as a tuple - Elastic search gets this information and subtracts the End/Error Time from the start time to get the SLO for the resource

Detailed Design for Cluster Monitoring Exporters(4/8) Gathering metrics using Query
Exporter 23 • Query Exporter runs as a Pod in each environment • It uses the config file to login the SQL DB • Config file have the DB name and Access Keys • Query Exporter runs the SQL commands on the DB to fetch relevant info and sends it to port 9560 for Prometheus to Scrape • Query Exporter can be used for ANY TYPE OF SQL Query

Detailed Design for Cluster Monitoring Exporters(5/8) Gathering haproxy metrics 24
HAProxy Exporter lies with the Haproxy service and fetches all the relevant information regarding the incoming and outgoing traffic

25 Gathering network metrics using ss-exporter Detailed Design for Cluster
Monitoring Exporters(6/8) Ss-exporter gathers netstat/ss metrics from the Current environment. It can be used to gather various Socket information, including the length of the SendQ and RecvQ to understand the load on a service.

Detailed Design for Cluster Monitoring Exporters(7/8) Calculating API Health Check,
Response Time using BlackBox Exporter 26 BlackBox exporter is used to scrape the EndPoints of the API to get the probe information. BlackBox exporter sends the API information for API service returning 2xx/3xx/4xx to the Prometheus /probe endpoint

Detailed Design for Cluster Monitoring Exporters(8/8) Calculating Hypervisor’s Utilisation using
custom Libvirt Exporter and Push-gateway. The Libvirt exported we used gets the number of VMs currently running, paused or shut down on each HV. This information can be used to understand the scheduling algorithm. This is a custom Libvirt Exporter, other exporters do exist which may not need Pushgateway. 27

How We Are Using Cluster Monitoring in LINE ➤Troubleshooting ➤Trend
Monitoring 28

29 Troubleshooting Issue (1/3): SLO for VMs We saw a
drop in the V M Create and VM Rebuild , they were not even close to 100%

30 Troubleshooting Issue (2/3): Agent Status We found out that
the Nova Compute agents are not working properly, reporting errors which may have caused the Create/Rebuild to not work properly

31 Troubleshooting Issue#1 ( 3/3 ): RabbitMQ Health On further
investigation, we found RabbitMQ reporting a partition and data nodes not able to communicate with Management nodes, resulting the Compute nodes in Nova not able to talk and causing a failure in the VM Create and Rebuild

32 Troubleshooting Issue #2: Control Plane Health We monitor the
C-Plane pods, and this way we know if there is any pod which is failing on the Control Plane. As you can see, this monitoring helps us know , like above, that Nova Console Auth and corresponding containers fail.

33 Trend Monitoring: SLO This panel is used by us
to monitor the SLO/SLI described in earlier slides. As you can see, the earlier method can be used well with Asynchronous Commands like VM Create and Rebuild as well as with Synchronous calls ( for example Recordset create in Designate)

34 Trend Monitoring: Resource Utilisation We monitor the resource utilisation
and determine if some Control Plane pods are short of any resource and if we need to increase their request/limit threshold.

35 Trend Monitoring: API Latency Difference after a few minutes

36 Trend Monitoring: Resource Utilization Using Query Exporter, we can
determine the usage by user/project in a particular cluster to understand if there is any misuse or not.

37 Trend Monitoring: API Success Rate We are able to
see the API Success rate using our exporters and understand if there is any issue, like above in the currently deployed APIs

➤Introducing Alerts and their Integration with Prometheus and Victoria Metrics
38 Future Improvements

• How we used RabbitMQ in wrong way at scale
• https://www.openstack.org/summit/shanghai-2019/summit-schedule/events/ 23983/how-we-used-rabbitmq-in-wrong-way-at-a-scale • Discover OpenStack’s nerve with Oslo-Metrics • https://www.openstack.org/videos/virtual/Discover-OpenStacks-nerve-with- oslo.metrics-Have-a-robust-private-cloud-on-a-large-scale 39 Related Presentations

Q&A For more info about our team 40

Cluster Monitoring of the open stack systems

Cluster Monitoring of the open stack systems

More Decks by LINE Developers

Other Decks in Technology

Featured

Transcript

Cluster Monitoring of the open stack systems 

Cluster Monitoring of the open stack systems