Save 37% off PRO during our Black Friday Sale! »

How we used RabbitMQ in wrong way at a scale

How we used RabbitMQ in wrong way at a scale

53850955f15249a1a9dc49df6113e400?s=128

LINE Developers
PRO

November 04, 2019
Tweet

Transcript

  1. How we used RabbitMQ in wrong way at a scale

    LINE Corp Yuki Nishiwaki Bhor Dinesh
  2. What’s this talk about • We haven’t seriously got along

    with RabbitMQ • We experienced few outage because of that • Let us share what we faced and improved one by one Take away • Tips of Operating RabbitMQ • Don’t do same mistake we made
  3. Who are We…. LINE?

  4. MAU: 164 Million 81 Million 44 Million 21 Million 18

    Million
  5. None
  6. Key of LINE Infrastructure 2000 Cloud User In 10 different

    locations Centralized Management Infrastructure Multiple Region
  7. Scale of Whole Our Infrastructure 20,000 43,000 4,400 6,00 1,800

    2,500 840 720 ※ This information from infra team configuration management system, and count only active servers
  8. 20,000 43,000 4,400 600 Other Solution vs Private Cloud(inc OpenStack)

    5500 37,000 4,100 594 714 670 1,800 2,500 840 720 There are still many Servers which are managed by Other Solution (VMWare, In-house Automation, Manual)
  9. Many Workload moving to… Private Cloud 35000 15000 +20,000 VMs

    in a year
  10. OpenStack in LINE started from 2016 Components • • •

    • • • • • Versions Mitaka + Customization Scale Developers 4-6 Members
  11. OpenStack Deployment as of 2018/12 Cluster 2 Cluster 3 Same

    Architecture Region 3 Region 1 Cluster 1 Region 2 mysql mysql mysql mysql Regional Component Global Component Site Disaster Recovery 1 2
  12. RabbitMQ Deployment/Configuration as of 2018 Region 3 Region 1 Cluster

    1 Region 2 ha-mode:all rabbitmq_management rabbitmq_shovel notification_designate.info@regionN+1 => notification_designate.info@region1 rabbitmq_management Monitoring 1 2 3
  13. Where was wrong?

  14. Several Outages 1. 2019-05-16: stage-prod stopped working due to 'notifications_designate.info'

    queue param inconsistency 2. 2018-11-20: RabbitMQ adding dedicated statistics-gathering nodes 3. 2019-02-23: One of RabbiMQ Node got crushed because of high memory usage (Verda-Prod-Tokyo) 4. OUTAGE2019-03-19 <Whole Verda-dev> - <RabbitMQ has been broken> 5. Human error: OUTAGE2019-05-16 <Whole Verda-Prod > - <RabbitMQ has been broken>
  15. None
  16. None
  17. None
  18. None
  19. Region 1 Region 2 Region 3 mysql Site Disaster Recovery

    RabbitMQ-Shovel Policy: notification_designate.info@regionN+1 => notification_designate.info@RabbitMQ-1
  20. Region 1 Region 2 Region 3 mysql Site Disaster Recovery

    RabbitMQ-Shovel Policy: notification_designate.info@regionN+1 => notification_designate.info@region1-data-nodes notification_designate.info)
  21. 1 2

  22. Current Architecture of RabbitMQ

  23. ⅕ How our RabbitMQ looks like now: Region 2 Region

    3 ※omit details ※omit details Region 1 ※omit details ※omit details ※omit details Prepare Nova, Neutron dedicated RabbitMQ Cluster
  24. ⅕ How our RabbitMQ looks like now: Region 2 Region

    3 ※omit details ※omit details Region 1 ※omit details ※omit details ※omit details
  25. ⅖ How our RabbitMQ looks like now: ha-mode: nodes ha-params:

    host1,host2,host3 ha-sync-mode: automatic queue-master-locator: min-masters rabbitmq_management-agent Region 2 rabbitmq_management Region 3 ※omit details ※omit details host1 host2 host3 host5 host4 Stats db Region 1 Data Nodes Management Nodes Plugins Plugins HA Configuration 1 3 2 Configure socket limit To tolerate 1 node failure 5 Only Data Nodes 4
  26. ⅗ How our RabbitMQ looks like now: Region 2 Region

    3 ※omit details ※omit details Region 1 rabbitmq_shovel Plugins host1 host2 host3 Data Nodes host5 host4 Stats db Management Nodes notification_designate.info@regionN+1 => notification_designate.info@host1(region1) => notification_designate.info@host2(region1) => notification_designate.info@host3(region1)
  27. ⅘ How our RabbitMQ looks like now: Region 2 Region

    3 ※omit details ※omit details Region 1 ^notifications_designate.error$ message-ttl:1 expires:10 max-length:1 ^guestagent.* expire:7200000 ※omit details Policies
  28. 5/5 How our RabbitMQ looks like now: Region 2 Region

    3 ※omit details ※omit details Region 1 ※omit details Automation Script Detect Non-Consumer & Fanout Queue => It means “XXXX-Agent may get restarted” Trigger Delete
  29. oslo.messaging backport into Mitaka

  30. Compute Node Agents exchange XXX_fanout_<random uuid> Routing Key: YYY Controller

    Node API XXX_fanout_<random uuid> Declare and Consume Backport important oslo.messaging patch to Mitaka 1. Delete fanout queues on graceful shutdown Compute Node Controller Node Agents API exchange XXX_fanout_<random uuid> Routing Key: YYY Declare and Consume message message message message message message message message message Non Consumer Fanout Queue - Keep getting message from exchange - Many message left on RabbitMQ cluster message
  31. Compute Node Agents exchange XXX_fanout_<random uuid> Routing Key: YYY Controller

    Node API XXX_fanout_<random uuid> Backport important oslo.messaging patch to Mitaka 1. Delete fanout queues on gracefully shutdown Delete when process killed gracefully No more “Non Consumer Fanout Queue” 2016/07/26 merged into master
  32. method: “XXX” _reply_q: reply ... Backport important oslo.messaging patch to

    Mitaka 2. Use default exchange for direct messaging Compute Node Agents Controller Node API When Perform RPC execution Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> Find reply queue in Message message
  33. Backport important oslo.messaging patch to Mitaka 2. Use default exchange

    for direct messaging Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> message Controller Node API Failed to send Reply When Node got broken Sometimes exchange migration got failed reply_<random uuid> Exchange Missing
  34. Backport important oslo.messaging patch to Mitaka 2. Use default exchange

    for direct messaging Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> default message 2018/11/02 merged into master
  35. Backport important oslo.messaging patch to Mitaka 2. Use default exchange

    for direct messaging Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> default message 2018/11/02 merged into master From: https://www.rabbitmq.com/tutorials/amqp-concepts.html Default Exchange The default exchange is a direct exchange with no name (empty string) pre-declared by the broker. It has one special property that makes it very useful for simple applications: every queue that is created is automatically bound to it with a routing key which is the same as the queue name.
  36. Backport important oslo.messaging patch to Mitaka 2. Use default exchange

    for direct messaging Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> default message 2018/11/02 merged into master Even If Exchange gone, RPC reply are correctly delivered
  37. Monitoring of RabbitMQ and oslo.messaging

  38. Monitoring: RabbitMQ • Use https://github.com/kbudde/rabbitmq_exporter • Check Followings ◦ Network

    Partition Check ◦ Queue Synchronization for each Queue ◦ Number of Consumer for each Queue ◦ Number of Messages Left ◦ Number of Connection in RabbitMQ vs Number of Connection in OS ◦ Number of Exchange ◦ Number of Queue binding for each Exchange ◦ idle_since for each Queue (not yet) ◦ Average Message Stay Time (not yet)
  39. There are 2 types of Monitoring for RPC, Notification From

    Message Broker Point of View From OpenStack Point of View RabbitMQ works well RPC, Notification works fine
  40. Why we need oslo.messaging Monitoring method: “XXX” _reply_q: reply ...

    RPC Client RPC Server There are bunch of reasons “RPC Reply Timeout” - RabbitMQ Message Lost - RabbitMQ Message Delay to delivery - RPC Server got exception - RPC Server took long time to perform RPC None of Business From RabbitMQ
  41. Experimenting “Metrics Support” in oslo.messaging Controller Nodes Compute Nodes oslo.metrics

    /var/run/metrics.sock nova-compute neutron-agent nova-conductor nova-api oslo.metrics /var/run/metrics.sock ・ ・ New New oslo.messaging oslo.messaging ・ Patched Patched
  42. Experimenting “Metrics Support” in oslo.messaging Controller Nodes Compute Nodes oslo.metrics

    /var/run/metrics.sock nova-compute neutron-agent nova-conductor nova-api oslo.metrics /var/run/metrics.sock ・ ・ New New oslo.messaging oslo.messaging ・ Patched Patched • RPC Server ◦ rpc_server_count_for_exception ◦ rpc_server_count_for_invocation_start ◦ rpc_server_count_for_invocation_finish ◦ rpc_server_processing_time • RPC Client ◦ rpc_client_count_for_exception ◦ rpc_client_count_for_invocation_start ◦ rpc_client_count_for_invocation_finish ◦ rpc_client_processing_time
  43. Sample of Metric Controller Nodes Compute Nodes oslo.metrics /var/run/metrics.sock nova-compute

    neutron-agent nova-conductor nova-api oslo.metrics /var/run/metrics.sock ・ ・ New New oslo.messaging oslo.messaging ・ Patched Patched rpc_server_invocation_start_count_total { endpoint="ConductorManager", method="object_class_action_versions", namespace="None", version="3.0" exchange="None", server=”host1", topic="conductor", } 155.0
  44. What’s Next around RabbitMQ? Production Ready oslo.messaging Monitoring RabbitMQ Disaster

    Testing Benchmark Emulation & RPC Tuning