Upgrade to Pro — share decks privately, control downloads, hide ads and more …

oslo.metrics Monitoring OpenStack RPC Calls

oslo.metrics Monitoring OpenStack RPC Calls

Open Infra Days, Asia 2021

Authors:Gene Kuo

LINE Developers

September 11, 2021
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. About Me • Gene Ku o • Infrastructure Engineer @

    LIN E • Co-Organizer @ Cloud Native Taiwan User Grou p • Co-chair @ Large Scale SIG 2
  2. Outline • Overview of LINE’s Private Cloud — Verd a

    • Introduction to oslo.metric s • Why oslo.metrics ? • What is oslo.metrics ? • Architectur e • How is oslo.metrics Used ? • Metrics visualizatio n • Troubleshootin g • Metrics trend monitorin g • Upstream Effort s • Demo 3
  3. 5

  4. 6 IaaS PaaS FaaS High Level Architecture VM Identity Network

    Image DNS Block Storage Object Storage Bare metal LB Kubernetes Kafka Redis MySQL ElasticSearch Function as a Service
  5. 4000+ Hypervisor 74+ Virtual Machines 30000~ Physical Servers Hypervisors Virtual

    Machine Baremetal Thousand LINE TV Verda About Scale 7 Data as of Jul. 2021
  6. Why oslo.metrics • Outages which motivated us to look inside

    : • RabbitMQ messages are los t • RabbitMQ messages delayed in deliver y • RPC Server got exception and stopped workin g • Time taken by server to process RPC > RPC Timeou t • RabbitMQ Cluster went dow n • RabbitMQ split brain, unsynchronized queues 10 Some of these issues couldn’t be detected by monitoring RabbitMQ cluster alone
  7. What is oslo.metrics • Part of oslo projec t •

    Collects metrics from oslo libraries and exposes as Prometheus forma t • Enables operator to monitor usage of oslo libraries • Number of RPC call s • Number of RPC exception s • Time used to process RPC call s • Monitoring from OpenStack perspective 13
  8. Architecture • Should be used in an isolated network (Security

    ) • Uses UDP Unix socket to communicat e • Oslo libraries are patched to send dat a • Oslo.messaging patch • All processes send data to the same Unix socket on each hos t • Differentiated by label s • oslo.metrics listen on socket, process data, and exposes i t • Prometheus scrape the metrics exposed 14
  9. 22 Did We Get Error? RPC client exception count nova

    reserve_block_device_name 
 messaging timeout
  10. 23 How Much Time RPC spend? Increased gradually and exceeded

    timeout RPC server processing time (s)
  11. 24 Did it Happen Before? RPC server processing time (s)

    Processing time did not exceeded timeout threshold in last 2 weeks
  12. 26 RPC processing time 0 30 60 90 120 2019

    Q3 2019 Q4 2020 Q1 2020 Q2 2020 Q3 2020 Q4 RPC X RPC Y RPC Z Likely to exceed timeout threshold next quarter RPC timeout threshold Trend Monitoring
  13. Current Statistics • oslo.metric s • Basic functionalit y •

    Unit test s • https://opendev.org/openstack/oslo.metrics • oslo.messaging integratio n • RPC client metrics • https://opendev.org/openstack/oslo.messaging/commit/ bdbb6d62ee20bfd5ffc59f8772a5a0e60614ba90 28
  14. Current Statistics • Documentation s • How to test with

    devstac k • Moving forward to 1.0.0 releas e • Encourage everyone to try it out and report bugs/suggestions ! • https://bugs.launchpad.net/oslo • OpenStack-discuss mailing lis t • Large Scale SIG meetings 29
  15. Future Works • 1.0.0 releas e • Integration with more

    oslo librarie s • oslo.d b • Transaction coun t • Transaction tim e • Query coun t • More detailed documentation s • Functional Test s • Integration with deployment tools 31
  16. 33