Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cluster Monitoring of the
open stack systems


Cluster Monitoring of the
open stack systems


Open Infra Days, Asia 2021

Co Authors:Reedip Banerjee & Yushiro Furukawa

LINE Developers

September 16, 2021
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. CLUSTER MONITORING OF THE 
 OPEN STACK SYSTEMS 
 Keep

    a track of what’s wrong and what’s right Open Infra Days, Asia 2021
  2. 2 Yushiro Furukawa Yushiro has been working as upstream OpenStack

    contributor into Neutron and Ironic for several years. Currently, he is a cloud infrastructure engineer in LINE Corporation since 2020. Linked-in: https://www.linkedin.com/in/yushiro- furukawa-96b273102 Reedip Banerjee Reedip has been working with the OpenStack community since Mitaka, and is currently a Cloud Infrastructure Engineer in LINE Corporation. 
 He has interest in Networking Concepts 
 Linked-In: https://www.linkedin.com/in/reedip/ About Us
  3. Agenda ➤ Background ➤ About LINE ➤ Introduction to LINE

    Infrastructure ➤ Issues faced in LINE’s Large Scale Infrastructure ➤ Reasons to setup Cluster Monitoring ➤Cluster Monitoring Project ➤ Overview of LINE’s OpenStack ➤ Monitored Metrics ➤ Architecture of Cluster Monitoring ➤How we are using Cluster Monitoring ➤ Troubleshooting ➤ Metrics trend monitoring ➤Further Improvements ➤ Alerts 3
  4. Background ➤ Who are we ➤ Introduction to LINE Infrastructure

    ➤ Reasons to Setup Cluster Monitoring 4
  5. About LINE 5 ➤ LINE Corp: ➤ Messenger ➤ Payment

    ➤ Games ➤ Music ➤ Live Streaming ➤ Etc….
  6. About LINE ➤ From Japan, to Taiwan, Indonesia and Korea

    6 As per data observed between Apr to Jun. 2021 NUMBER OF MONTHLY ACTIVE USERS OF LINE APP : GLOBAL 188 Million NUMBER OF DAILY MESSAGES ~4.9 Billion NUMBER OF MONTHLY ACTIVE USERS OF LINE APP IN JAPAN 88 Million
  7. Introduction to LINE Infrastructure (2/3) 8 4000+ Hypervisor 74+ Virtual

    Machines 30000~ Physical Servers Hypervisors Virtual Machine Baremetal Thousand Data as of 7/23/2021
  8. Introduction to LINE Infrastructure (3/3) IaaS PaaS FaaS VM Identity

    Network Image DNS Block Storage Object Storage Bare metal LB Kubernetes Kafka Redis MySQL ElasticSearch Function as a Service 9
  9. Actual Structure 10 ➤ Handling “reactive” only and fix the

    issue. ➤ Not clear about resource usage from end-user in daily/weekly/monthly basis. ➤ Cannot evaluate our platform availability.
  10. 11 ➤ Lack of resource failure awareness • End-user contacts

    our team via slack, e-mail and etc if some resource operation has failed. • We cannot notice any resource operation failure by ourself ➤ Lack of production cluster status awareness • Production cluster have increased day by day. • But we don’t have a way to keep track the cluster status in terms of API / internal communication / process layer. ➤ Lack of definition to ensure our platform availability • No SLO Issues Faced in LINE’s Large Scale Infrastructure
  11. 12 ➤ User request ➤ Cannot access to dashboard ➤

    A VM instance status has changed into “ERROR” ➤ Outages ➤ RabbitMQ messages are lost or delayed in delivery ➤ RPC Server got exception and stopped working ➤ RabbitMQ cluster getting partitioned, Split brain, unsynchronized queues ➤ Control Plane went down due to High Memory/CPU usage/memory leak/max socket connections (aka Too many open files) Actual Cases of Resource Failure Awareness
  12. 13 ➤ We don’t know the detail like the following

    “Trend” topics in a flush. ➤ How long does it take to create/rebuild/delete a VM instance today? ➤ How many projects are created by daily? ➤ Daily/weekly/monthly basis of each performance. Production Cluster Status Awareness
  13. 14 ➤ New business units requires SLO for our platform

    ➤ This is good timing to define the SLO with daily/monthly basis ➤ API success/failure ratio • e.g. POST /v2.1/{project_id}/servers, etc… ➤ Resource operation success ratio [%] • e.g. openstack server create/delete/rebuild/… ➤ Resource operation time • Synchronous operations (e.g. DNS recordset) • Asynchronous operations (e.g. VM instance) Definition of our Platform Availability
  14. 15 TODO for Cluster Monitoring Project ➤ Lack of resource

    failure awareness ➤ Lack of production cluster status awareness ➤ Lack of definition to ensure our platform availability Collect several data, calculate as a metrics and visible on a dashboard Visible “trend” data and actual cluster capacity Define SLO related each API/resource operation
  15. TOBE Structure ➤ Getting alert related SLO ➤ Enable to

    check any “Trend” or actual cluster resource data on the integrated dashboard. 16
  16. ➤API ➤ Latency, Duration ➤ Health check ➤ Success ratio

    ➤Internal components ➤ Messaging Queue services ➤ Hypervisors status ➤ Agent status ➤Physical resources ➤ CPU, Memory, disk, File descriptors ➤Trend ➤ SLO ➤ Resource usage by User 17 Monitoring Targets (Summary)
  17. Monitoring Metrics Metric Type Notes Exporter used Resource Layer Metrics

    User resource utilisation, for eg. No. Of VMs, No. Of Network Ports being used etc. 
 Includes VM Creation time, Deletion time, Designate Record creation Query Exporter for Prometheus API Success Rate How well the API responds for PUT, POST, GET requests, based on the return code Black Box Exporter API Response Time Time it takes for the API to respond back to any user request. Each API response is differentiated based on the TYPE of the request. Black Box Exporter Resource Utilization Memory utilisation, Disk Utilisation, number of fd/sockets, CPU core utilisation etc 
 This is calculated both for C-Plane and D-Plane ( Hypervisors, worker nodes , Pods et. Al. ) Node Exporter, cAdvisor, Socket Statistics Exporter Agent status Status of different Agents ( Network Agents, Nova Scheduler, Nova Conductor, Glance Registry, etc ) Node Exporter RabbitMQ related metrics Consumer count, queue length, message count/message size Rabbitmq Exporter, Oslo Metrics (introduced by LINE ) API Health Check Verification if the API is actually running or not Haproxy-exporter Hypervisor VM Load Number of VMs and their status ( Running, Paused, Shutdown) on each HV Custom Libvirt Exporter with Pushgateway
  18. Detailed Design for Cluster Monitoring Exporters(1/8) Exporter How the Exporter

    is used Query Exporter Gather metrics by running queries on the DB Black Box Exporter Gathers metrics by probing of endpoints over HTTP, HTTPS, DNS, TCP and ICMP. Socket Statistics Exporter Gather metrics for network sockets, including SendQ, RecvQ etc. Oslo Metrics Gather data from Oslo Libraries. Currently includes Oslo Messaging and working on Oslo DB Haproxy-exporter Gather HAProxy stats Pushgateway Used by ephemeral and batch jobs to expose their metrics to Prometheus RabbitMQ Exporter Gather metrics from RabbitMQ clusters including Queue Size, message size, partitioned cluster etc. Node Exporter Exporter for hardware and OS related metrics exposed by *NIX kernels, cAdvisor provides container users an understanding of the resource usage and performance characteristics of their running containers.
  19. Detailed Design for Cluster Monitoring Exporters(2/8) Benchmarking SLO/SLI ➤Goal: ➤Benchmarking

    and verifying the time taken for a User facing API ( VM Create/Delete/Rebuild etc ) to finish ➤Instance creation, deletion, and rebuild success rates calculated on the number of nova notifications and the status of the instance ➤SLO Calculation ➤ API success ➤ (count of matching API calls with response code 2XX + 4XX) (count of matching API calls) ➤ API failure ➤ (count of matching API calls with response code 5XX) (count of matching API calls) ∑ ∑ ∑ ∑ Operation Target Notifications Notes Creation Instance.create.start, instance.create.end, instance.create.error VM must be active after instance.create.end Deletion Instance.delete.start, instance.delete.end, instance.delete.error Rebuild Instance.rebuild.start, instance.rebuild.end, instance.rebuild.error
  20. Calculating SLO/SLI performances for VM CRD Detailed Design for Cluster

    Monitoring Exporters(3/8) 22 SLO/SLI Calculation procedure ( for create ): - Nova/Designate/Neutron sends a Resource.create.start - RabbitMQ queue receives the notification and puts it in Queue - Notification Consumer subscribes to the Queue and gets the start Notification - It sends the information to Notification Agent which keeps a track of the time, the resource ID, the resource type and start notification - When the resource operation is complete, Nova/Designate/Neutron send a Resource.create.end. If it fails, they send Resource.create.error - Notification Consumer gets the message, and sends it to the Agent. - Agent verifies the incoming message type ( End or Error ), resource Id, resource type and time, and sends it to the Logs as a tuple - Elastic search gets this information and subtracts the End/Error Time from the start time to get the SLO for the resource
  21. Detailed Design for Cluster Monitoring Exporters(4/8) Gathering metrics using Query

    Exporter 23 • Query Exporter runs as a Pod in each environment • It uses the config file to login the SQL DB • Config file have the DB name and Access Keys • Query Exporter runs the SQL commands on the DB to fetch relevant info and sends it to port 9560 for Prometheus to Scrape • Query Exporter can be used for ANY TYPE OF SQL Query
  22. Detailed Design for Cluster Monitoring Exporters(5/8) Gathering haproxy metrics 24

    HAProxy Exporter lies with the Haproxy service and fetches all the relevant information regarding the incoming and outgoing traffic
  23. 25 Gathering network metrics using ss-exporter Detailed Design for Cluster

    Monitoring Exporters(6/8) Ss-exporter gathers netstat/ss metrics from the Current environment. It can be used to gather various Socket information, including the length of the SendQ and RecvQ to understand the load on a service.
  24. Detailed Design for Cluster Monitoring Exporters(7/8) Calculating API Health Check,

    Response Time using BlackBox Exporter 26 BlackBox exporter is used to scrape the EndPoints of the API to get the probe information. BlackBox exporter sends the API information for API service returning 2xx/3xx/4xx to the Prometheus /probe endpoint
  25. Detailed Design for Cluster Monitoring Exporters(8/8) Calculating Hypervisor’s Utilisation using

    custom Libvirt Exporter and Push-gateway. The Libvirt exported we used gets the number of VMs currently running, paused or shut down on each HV. This information can be used to understand the scheduling algorithm. This is a custom Libvirt Exporter, other exporters do exist which may not need Pushgateway. 27
  26. 29 Troubleshooting Issue (1/3): SLO for VMs We saw a

    drop in the V M Create and VM Rebuild , they were not even close to 100%
  27. 30 Troubleshooting Issue (2/3): Agent Status We found out that

    the Nova Compute agents are not working properly, reporting errors which may have caused the Create/Rebuild to not work properly
  28. 31 Troubleshooting Issue#1 ( 3/3 ): RabbitMQ Health On further

    investigation, we found RabbitMQ reporting a partition and data nodes not able to communicate with Management nodes, resulting the Compute nodes in Nova not able to talk and causing a failure in the VM Create and Rebuild
  29. 32 Troubleshooting Issue #2: Control Plane Health We monitor the

    C-Plane pods, and this way we know if there is any pod which is failing on the Control Plane. As you can see, this monitoring helps us know , like above, that Nova Console Auth and corresponding containers fail.
  30. 33 Trend Monitoring: SLO This panel is used by us

    to monitor the SLO/SLI described in earlier slides. As you can see, the earlier method can be used well with Asynchronous Commands like VM Create and Rebuild as well as with Synchronous calls ( for example Recordset create in Designate)
  31. 34 Trend Monitoring: Resource Utilisation We monitor the resource utilisation

    and determine if some Control Plane pods are short of any resource and if we need to increase their request/limit threshold.
  32. 36 Trend Monitoring: Resource Utilization Using Query Exporter, we can

    determine the usage by user/project in a particular cluster to understand if there is any misuse or not.
  33. 37 Trend Monitoring: API Success Rate We are able to

    see the API Success rate using our exporters and understand if there is any issue, like above in the currently deployed APIs
  34. • How we used RabbitMQ in wrong way at scale

    • https://www.openstack.org/summit/shanghai-2019/summit-schedule/events/ 23983/how-we-used-rabbitmq-in-wrong-way-at-a-scale • Discover OpenStack’s nerve with Oslo-Metrics • https://www.openstack.org/videos/virtual/Discover-OpenStacks-nerve-with- oslo.metrics-Have-a-robust-private-cloud-on-a-large-scale 39 Related Presentations