Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How we used RabbitMQ in wrong way at a scale

How we used RabbitMQ in wrong way at a scale

LINE Developers

November 04, 2019
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. How we used RabbitMQ in wrong way at a scale

    LINE Corp Yuki Nishiwaki Bhor Dinesh
  2. What’s this talk about • We haven’t seriously got along

    with RabbitMQ • We experienced few outage because of that • Let us share what we faced and improved one by one Take away • Tips of Operating RabbitMQ • Don’t do same mistake we made
  3. Key of LINE Infrastructure 2000 Cloud User In 10 different

    locations Centralized Management Infrastructure Multiple Region
  4. Scale of Whole Our Infrastructure 20,000 43,000 4,400 6,00 1,800

    2,500 840 720 ※ This information from infra team configuration management system, and count only active servers
  5. 20,000 43,000 4,400 600 Other Solution vs Private Cloud(inc OpenStack)

    5500 37,000 4,100 594 714 670 1,800 2,500 840 720 There are still many Servers which are managed by Other Solution (VMWare, In-house Automation, Manual)
  6. OpenStack in LINE started from 2016 Components • • •

    • • • • • Versions Mitaka + Customization Scale Developers 4-6 Members
  7. OpenStack Deployment as of 2018/12 Cluster 2 Cluster 3 Same

    Architecture Region 3 Region 1 Cluster 1 Region 2 mysql mysql mysql mysql Regional Component Global Component Site Disaster Recovery 1 2
  8. RabbitMQ Deployment/Configuration as of 2018 Region 3 Region 1 Cluster

    1 Region 2 ha-mode:all rabbitmq_management rabbitmq_shovel notification_designate.info@regionN+1 => notification_designate.info@region1 rabbitmq_management Monitoring 1 2 3
  9. Several Outages 1. 2019-05-16: stage-prod stopped working due to 'notifications_designate.info'

    queue param inconsistency 2. 2018-11-20: RabbitMQ adding dedicated statistics-gathering nodes 3. 2019-02-23: One of RabbiMQ Node got crushed because of high memory usage (Verda-Prod-Tokyo) 4. OUTAGE2019-03-19 <Whole Verda-dev> - <RabbitMQ has been broken> 5. Human error: OUTAGE2019-05-16 <Whole Verda-Prod > - <RabbitMQ has been broken>
  10. Region 1 Region 2 Region 3 mysql Site Disaster Recovery

    RabbitMQ-Shovel Policy: notification_designate.info@regionN+1 => notification_designate.info@RabbitMQ-1
  11. Region 1 Region 2 Region 3 mysql Site Disaster Recovery

    RabbitMQ-Shovel Policy: notification_designate.info@regionN+1 => notification_designate.info@region1-data-nodes notification_designate.info)
  12. 1 2

  13. ⅕ How our RabbitMQ looks like now: Region 2 Region

    3 ※omit details ※omit details Region 1 ※omit details ※omit details ※omit details Prepare Nova, Neutron dedicated RabbitMQ Cluster
  14. ⅕ How our RabbitMQ looks like now: Region 2 Region

    3 ※omit details ※omit details Region 1 ※omit details ※omit details ※omit details
  15. ⅖ How our RabbitMQ looks like now: ha-mode: nodes ha-params:

    host1,host2,host3 ha-sync-mode: automatic queue-master-locator: min-masters rabbitmq_management-agent Region 2 rabbitmq_management Region 3 ※omit details ※omit details host1 host2 host3 host5 host4 Stats db Region 1 Data Nodes Management Nodes Plugins Plugins HA Configuration 1 3 2 Configure socket limit To tolerate 1 node failure 5 Only Data Nodes 4
  16. ⅗ How our RabbitMQ looks like now: Region 2 Region

    3 ※omit details ※omit details Region 1 rabbitmq_shovel Plugins host1 host2 host3 Data Nodes host5 host4 Stats db Management Nodes notification_designate.info@regionN+1 => notification_designate.info@host1(region1) => notification_designate.info@host2(region1) => notification_designate.info@host3(region1)
  17. ⅘ How our RabbitMQ looks like now: Region 2 Region

    3 ※omit details ※omit details Region 1 ^notifications_designate.error$ message-ttl:1 expires:10 max-length:1 ^guestagent.* expire:7200000 ※omit details Policies
  18. 5/5 How our RabbitMQ looks like now: Region 2 Region

    3 ※omit details ※omit details Region 1 ※omit details Automation Script Detect Non-Consumer & Fanout Queue => It means “XXXX-Agent may get restarted” Trigger Delete
  19. Compute Node Agents exchange XXX_fanout_<random uuid> Routing Key: YYY Controller

    Node API XXX_fanout_<random uuid> Declare and Consume Backport important oslo.messaging patch to Mitaka 1. Delete fanout queues on graceful shutdown Compute Node Controller Node Agents API exchange XXX_fanout_<random uuid> Routing Key: YYY Declare and Consume message message message message message message message message message Non Consumer Fanout Queue - Keep getting message from exchange - Many message left on RabbitMQ cluster message
  20. Compute Node Agents exchange XXX_fanout_<random uuid> Routing Key: YYY Controller

    Node API XXX_fanout_<random uuid> Backport important oslo.messaging patch to Mitaka 1. Delete fanout queues on gracefully shutdown Delete when process killed gracefully No more “Non Consumer Fanout Queue” 2016/07/26 merged into master
  21. method: “XXX” _reply_q: reply ... Backport important oslo.messaging patch to

    Mitaka 2. Use default exchange for direct messaging Compute Node Agents Controller Node API When Perform RPC execution Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> Find reply queue in Message message
  22. Backport important oslo.messaging patch to Mitaka 2. Use default exchange

    for direct messaging Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> message Controller Node API Failed to send Reply When Node got broken Sometimes exchange migration got failed reply_<random uuid> Exchange Missing
  23. Backport important oslo.messaging patch to Mitaka 2. Use default exchange

    for direct messaging Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> default message 2018/11/02 merged into master
  24. Backport important oslo.messaging patch to Mitaka 2. Use default exchange

    for direct messaging Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> default message 2018/11/02 merged into master From: https://www.rabbitmq.com/tutorials/amqp-concepts.html Default Exchange The default exchange is a direct exchange with no name (empty string) pre-declared by the broker. It has one special property that makes it very useful for simple applications: every queue that is created is automatically bound to it with a routing key which is the same as the queue name.
  25. Backport important oslo.messaging patch to Mitaka 2. Use default exchange

    for direct messaging Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> default message 2018/11/02 merged into master Even If Exchange gone, RPC reply are correctly delivered
  26. Monitoring: RabbitMQ • Use https://github.com/kbudde/rabbitmq_exporter • Check Followings ◦ Network

    Partition Check ◦ Queue Synchronization for each Queue ◦ Number of Consumer for each Queue ◦ Number of Messages Left ◦ Number of Connection in RabbitMQ vs Number of Connection in OS ◦ Number of Exchange ◦ Number of Queue binding for each Exchange ◦ idle_since for each Queue (not yet) ◦ Average Message Stay Time (not yet)
  27. There are 2 types of Monitoring for RPC, Notification From

    Message Broker Point of View From OpenStack Point of View RabbitMQ works well RPC, Notification works fine
  28. Why we need oslo.messaging Monitoring method: “XXX” _reply_q: reply ...

    RPC Client RPC Server There are bunch of reasons “RPC Reply Timeout” - RabbitMQ Message Lost - RabbitMQ Message Delay to delivery - RPC Server got exception - RPC Server took long time to perform RPC None of Business From RabbitMQ
  29. Experimenting “Metrics Support” in oslo.messaging Controller Nodes Compute Nodes oslo.metrics

    /var/run/metrics.sock nova-compute neutron-agent nova-conductor nova-api oslo.metrics /var/run/metrics.sock ・ ・ New New oslo.messaging oslo.messaging ・ Patched Patched
  30. Experimenting “Metrics Support” in oslo.messaging Controller Nodes Compute Nodes oslo.metrics

    /var/run/metrics.sock nova-compute neutron-agent nova-conductor nova-api oslo.metrics /var/run/metrics.sock ・ ・ New New oslo.messaging oslo.messaging ・ Patched Patched • RPC Server ◦ rpc_server_count_for_exception ◦ rpc_server_count_for_invocation_start ◦ rpc_server_count_for_invocation_finish ◦ rpc_server_processing_time • RPC Client ◦ rpc_client_count_for_exception ◦ rpc_client_count_for_invocation_start ◦ rpc_client_count_for_invocation_finish ◦ rpc_client_processing_time
  31. Sample of Metric Controller Nodes Compute Nodes oslo.metrics /var/run/metrics.sock nova-compute

    neutron-agent nova-conductor nova-api oslo.metrics /var/run/metrics.sock ・ ・ New New oslo.messaging oslo.messaging ・ Patched Patched rpc_server_invocation_start_count_total { endpoint="ConductorManager", method="object_class_action_versions", namespace="None", version="3.0" exchange="None", server=”host1", topic="conductor", } 155.0