Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DISOCOVER OPENSTACK'S NERVE WITH OSLO METRICS

DISOCOVER OPENSTACK'S NERVE WITH OSLO METRICS

2020 Open Infrastructure Summit
https://www.youtube.com/watch?v=pNyjMRJViac
https://summit.openinfra.dev/

Speakers: Motomu Utsumi, Reedip Banerjee

LINE Developers
PRO

October 20, 2020
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. DISCOVER OPENSTACK’S
    NERVE WITH
    OSLO.METRICS
    Have A Robust Private Cloud On A Large Scale

    View Slide

  2. 2
    Motomu Utsumi
    Motomu is working on the operation and development of
    LINE private cloud.
    He enjoys writing code to solve the problem the team faces
    through the operation.
    Reedip Banerjee
    Reedip has been working with the OpenStack community
    since Mitaka, and is currently a Cloud Infrastructure
    Engineer in LINE Corporation.
    He has interest in Networking Concepts
    Linked-In: https://www.linkedin.com/in/reedip/
    About Us

    View Slide

  3. Agenda
    1. Background
    1. Introduction to LINE’s Infra
    2. Outages faced in LINE
    3. What can be done to improve the situation
    2. Introduction to oslo.metrics
    1. What is oslo.metrics
    2. Architecture
    3. How we are using oslo.metrics
    1. Metrics Visualisation
    2. Troubleshooting
    3. Metrics trend monitoring
    3

    View Slide

  4. Background
    ➤Who are we
    ➤Introduction to LINE Infrastructure
    ➤Issues faced in LINE’s Large Scale Infrastructure
    ➤What can be done to improve the situation
    4

    View Slide

  5. Who Are We
    ➤ LINE Corp:
    ➤ Messenger
    ➤ Pay
    ➤ Games
    ➤ Music
    ➤ TV
    ➤ And many more services….
    5

    View Slide

  6. Who Are We
    ➤ From Japan, to Taiwan, Indonesia and Korea
    6
    https://www.statista.com/statistics/560545/number-of-monthly-active-line-app-users-japan/
    As of Jun. 2020
    NUMBER OF MONTHLY ACTIVE USERS OF LINE APP IN
    4 MAJOR COUNTRIES
    166m
    NUMBER OF MONTHLY ACTIVE USERS OF LINE APP IN
    JAPAN
    84m
    Number of monthly users in Japan (millions)

    View Slide

  7. Introduction to LINE Infrastructure (1/3)
    7

    View Slide

  8. 2000+
    Hypervisor
    50+
    Virtual Machines
    20000~
    Physical Servers
    Hypervisors
    Virtual Machine
    Baremetal
    Thousand
    LINE TV
    Introduction to LINE Infrastructure (2/3)
    Data as of 7/13/2020
    8

    View Slide

  9. Introduction to LINE Infrastructure (3/3)
    IaaS
    PaaS
    FaaS
    VM Identity Network Image DNS Block Storage Object Storage
    Bare
    metal
    LB
    Kubernetes Kafka Redis MySQL ElasticSearch
    Function as a Service
    9

    View Slide

  10. Outages Faced in LINE
    • Outages which motivated us to look inside:
    • RabbitMQ messages are lost
    • RabbitMQ messages delayed in delivery
    • RPC Server got exception and stopped working
    • Time taken by Server to perform RPC >> RPC Timeout
    • RabbitMQ Node went down due to High Memory/CPU usage/memory leak/max
    socket connections (aka Too many open files)
    • RabbitMQ split brain, unsynchronised queues
    10

    View Slide

  11. What Can Be Done To Improve the Situation
    • Verify Scalability Issues
    • Identify bottlenecks, limits with increase of scale
    • Tuning parameters / modification to the architecture
    • Improve OpenStack Reliability
    • Monitor and track latency issues
    • Monitor number of parallel connections and time taken
    • Make Troubleshooting investigations easier
    11

    View Slide

  12. Introduction To oslo.metrics
    • What is oslo.metrics
    • Architecture
    12

    View Slide

  13. What Is oslo.metrics
    • OSS Library
    • Collect metrics exposed by the internal Oslo libraries ( oslo.messaging, oslo.db )
    • Monitor Usage of Oslo libraries by Admin and Operators
    • Information of the number of RPC calls /API call
    • delta Change in the RPC calls/ delta Change in RPC time with increase/decrease in
    API Servers/Clients.
    • Similar to rpc_monitor’s implementation for oslo.messaging, but adaptable to other
    Oslo libraries as well
    13

    View Slide

  14. Architecture(1/3)
    • Oslo libraries -> oslo.metrics -> Prometheus [1]
    14

    View Slide

  15. Architecture(2/3)
    • Collect metrics from both RPC Layer and AMPQ Driver layer.
    15

    View Slide

  16. Architecture(3/3)
    • Should work in an isolated internal network ( Security ) .
    • Uses UDP Unix socket to communicate
    • All processes send data to the same unix socket (for eg. Nova-API and Neutron-server on Controller ).
    • Oslo libraries (eg. oslo.messaging) expose the data in oslo.metrics format
    • oslo.metrics exposes the data to the HTTP Endpoint in Prometheus Exposition Format (but changeable to
    other standards)
    • Required data format shared by Oslo library to oslo.metrics
    16

    View Slide

  17. 17
    How Are We Using oslo.metrics?
    • Metrics visualization
    • Troubleshoot
    • Metrics trend monitoring

    View Slide

  18. Metrics Visualization
    With
    Prometheus and Grafana
    18

    View Slide

  19. 19
    Metrics Change: Instance Build
    RPC server invocation count RPC server average processing time (s)

    View Slide

  20. 20
    Metrics Change: Scheduling Instances
    Average select_destination processing time (s)
    10 instances
    100 instances
    500 instances

    View Slide

  21. Instance had duplicated
    volumes_attached entries
    21
    Troubleshoot With oslo.metrics

    View Slide

  22. 22
    Did We Get Error?
    nova
    reserve_block_device_name
    Messaging Timeout
    RPC client exception count

    View Slide

  23. 23
    How Was the RPC Processing Time?
    Increased gradually
    And
    Exceeded timeout threshold
    reserve_block_device_name processing time (s)

    View Slide

  24. 24
    How Many Times Did We Had This Issue?
    How Likely Will It Happen?
    Processing time did not
    Exceeded timeout threshold
    In last 2 weeks
    reserve_block_device_name processing time (s)

    View Slide

  25. 25
    Metrics Trend Monitoring
    RPC processing time
    0
    30
    60
    90
    120
    2019 Q3 2019 Q4 2020 Q1 2020 Q2 2020 Q3 2020 Q4
    RPC X RPC Y RPC Z
    Likely to exceed timeout
    threshold next quarter
    RPC timeout threshold

    View Slide

  26. • Integration with other Oslo libraries
    • Bottleneck analysis with large scale hypervisor emulation
    26
    What’s in Progress?

    View Slide

  27. • oslo.db
    • Query time
    • How many queries are executed?
    • oslo.concurrency
    • Time to acquire lock
    • How long is the lock held?
    • oslo.service
    • Periodic job processing time
    27
    Other Oslo Libraries Integration

    View Slide

  28. Can our cluster handle
    10000 hypervisors?
    28
    Hypervisor Emulation
    Our Hypervisor Scale
    0
    550
    1100
    1650
    2200
    2017 2018 2019 2020
    Number of HV

    View Slide

  29. 29
    Hypervisor Emulation
    OpenStack C-plane
    Hypervisor
    Emulator
    Emulator Runtime
    (Kubernetes)
    Hypervisor
    Emulation
    Container
    neutron-
    fake-agent
    nova-compute
    fake-driver
    Hypervisor
    Emulation
    Container
    neutron-
    fake-agent
    nova-compute
    fake-driver
    Hypervisor
    Emulation
    Container
    neutron-
    fake-agent
    nova-compute
    fake-driver
    Hypervisor
    Emulation
    Container
    neutron-
    fake-agent
    nova-compute
    fake-driver
    Hypervisor
    Emulation
    Container
    neutron-
    fake-agent
    nova-compute
    fake-driver
    Hypervisor
    Emulation
    Container
    neutron-
    fake-agent
    nova-compute
    fake-driver
    Hypervisor
    Emulation
    Container
    neutron-
    fake-agent
    nova-compute
    fake-driver
    Hypervisor
    Emulation
    Container
    neutron-
    fake-agent
    nova-compute
    fake-driver
    Testing tool
    Need 2000HV
    Need 5000HV
    Run Integration Test
    Run Benchmark Test
    Spawn
    Register emulated HV

    View Slide

  30. • oslo.metrics
    • https://specs.openstack.org/openstack/oslo-specs/specs/ussuri/oslo-
    metrics.html
    • https://review.opendev.org/730753
    • Integration with oslo.messaging
    • https://review.opendev.org/730755
    30
    Upstreaming

    View Slide

  31. • How we used RabbitMQ in wrong way at scale
    • https://www.openstack.org/summit/shanghai-2019/summit-schedule/events/
    23983/how-we-used-rabbitmq-in-wrong-way-at-a-scale
    31
    Related Presentation

    View Slide