How we used RabbitMQ in wrong way at a scale

How we used RabbitMQ in wrong way at a scale
LINE Corp Yuki Nishiwaki Bhor Dinesh

What’s this talk about • We haven’t seriously got along
with RabbitMQ • We experienced few outage because of that • Let us share what we faced and improved one by one Take away • Tips of Operating RabbitMQ • Don’t do same mistake we made

Who are We…. LINE?

MAU: 164 Million 81 Million 44 Million 21 Million 18
Million

Key of LINE Infrastructure 2000 Cloud User In 10 different
locations Centralized Management Infrastructure Multiple Region

Scale of Whole Our Infrastructure 20,000 43,000 4,400 6,00 1,800
2,500 840 720 ※ This information from infra team configuration management system, and count only active servers

20,000 43,000 4,400 600 Other Solution vs Private Cloud(inc OpenStack)
5500 37,000 4,100 594 714 670 1,800 2,500 840 720 There are still many Servers which are managed by Other Solution (VMWare, In-house Automation, Manual)

Many Workload moving to… Private Cloud 35000 15000 +20,000 VMs
in a year

OpenStack in LINE started from 2016 Components • • •
• • • • • Versions Mitaka + Customization Scale Developers 4-6 Members

OpenStack Deployment as of 2018/12 Cluster 2 Cluster 3 Same
Architecture Region 3 Region 1 Cluster 1 Region 2 mysql mysql mysql mysql Regional Component Global Component Site Disaster Recovery 1 2

RabbitMQ Deployment/Configuration as of 2018 Region 3 Region 1 Cluster
1 Region 2 ha-mode:all rabbitmq_management rabbitmq_shovel notification_designate.info@regionN+1 => notification_designate.info@region1 rabbitmq_management Monitoring 1 2 3

Where was wrong?

Several Outages 1. 2019-05-16: stage-prod stopped working due to 'notifications_designate.info'
queue param inconsistency 2. 2018-11-20: RabbitMQ adding dedicated statistics-gathering nodes 3. 2019-02-23: One of RabbiMQ Node got crushed because of high memory usage (Verda-Prod-Tokyo) 4. OUTAGE2019-03-19 <Whole Verda-dev> - <RabbitMQ has been broken> 5. Human error: OUTAGE2019-05-16 <Whole Verda-Prod > - <RabbitMQ has been broken>

Region 1 Region 2 Region 3 mysql Site Disaster Recovery
RabbitMQ-Shovel Policy: notification_designate.info@regionN+1 => notification_designate.info@RabbitMQ-1

Region 1 Region 2 Region 3 mysql Site Disaster Recovery
RabbitMQ-Shovel Policy: notification_designate.info@regionN+1 => notification_designate.info@region1-data-nodes notification_designate.info)

Current Architecture of RabbitMQ

⅕ How our RabbitMQ looks like now: Region 2 Region
3 ※omit details ※omit details Region 1 ※omit details ※omit details ※omit details Prepare Nova, Neutron dedicated RabbitMQ Cluster

⅕ How our RabbitMQ looks like now: Region 2 Region
3 ※omit details ※omit details Region 1 ※omit details ※omit details ※omit details

⅖ How our RabbitMQ looks like now: ha-mode: nodes ha-params:
host1,host2,host3 ha-sync-mode: automatic queue-master-locator: min-masters rabbitmq_management-agent Region 2 rabbitmq_management Region 3 ※omit details ※omit details host1 host2 host3 host5 host4 Stats db Region 1 Data Nodes Management Nodes Plugins Plugins HA Configuration 1 3 2 Configure socket limit To tolerate 1 node failure 5 Only Data Nodes 4

⅗ How our RabbitMQ looks like now: Region 2 Region
3 ※omit details ※omit details Region 1 rabbitmq_shovel Plugins host1 host2 host3 Data Nodes host5 host4 Stats db Management Nodes notification_designate.info@regionN+1 => notification_designate.info@host1(region1) => notification_designate.info@host2(region1) => notification_designate.info@host3(region1)

⅘ How our RabbitMQ looks like now: Region 2 Region
3 ※omit details ※omit details Region 1 ^notifications_designate.error$ message-ttl:1 expires:10 max-length:1 ^guestagent.* expire:7200000 ※omit details Policies

5/5 How our RabbitMQ looks like now: Region 2 Region
3 ※omit details ※omit details Region 1 ※omit details Automation Script Detect Non-Consumer & Fanout Queue => It means “XXXX-Agent may get restarted” Trigger Delete

oslo.messaging backport into Mitaka

Compute Node Agents exchange XXX_fanout_<random uuid> Routing Key: YYY Controller
Node API XXX_fanout_<random uuid> Declare and Consume Backport important oslo.messaging patch to Mitaka 1. Delete fanout queues on graceful shutdown Compute Node Controller Node Agents API exchange XXX_fanout_<random uuid> Routing Key: YYY Declare and Consume message message message message message message message message message Non Consumer Fanout Queue - Keep getting message from exchange - Many message left on RabbitMQ cluster message

Compute Node Agents exchange XXX_fanout_<random uuid> Routing Key: YYY Controller
Node API XXX_fanout_<random uuid> Backport important oslo.messaging patch to Mitaka 1. Delete fanout queues on gracefully shutdown Delete when process killed gracefully No more “Non Consumer Fanout Queue” 2016/07/26 merged into master

method: “XXX” _reply_q: reply ... Backport important oslo.messaging patch to
Mitaka 2. Use default exchange for direct messaging Compute Node Agents Controller Node API When Perform RPC execution Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> Find reply queue in Message message

Backport important oslo.messaging patch to Mitaka 2. Use default exchange
for direct messaging Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> message Controller Node API Failed to send Reply When Node got broken Sometimes exchange migration got failed reply_<random uuid> Exchange Missing

for direct messaging Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> default message 2018/11/02 merged into master

for direct messaging Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> default message 2018/11/02 merged into master From: https://www.rabbitmq.com/tutorials/amqp-concepts.html Default Exchange The default exchange is a direct exchange with no name (empty string) pre-declared by the broker. It has one special property that makes it very useful for simple applications: every queue that is created is automatically bound to it with a routing key which is the same as the queue name.

for direct messaging Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> default message 2018/11/02 merged into master Even If Exchange gone, RPC reply are correctly delivered

Monitoring of RabbitMQ and oslo.messaging

Monitoring: RabbitMQ • Use https://github.com/kbudde/rabbitmq_exporter • Check Followings ◦ Network
Partition Check ◦ Queue Synchronization for each Queue ◦ Number of Consumer for each Queue ◦ Number of Messages Left ◦ Number of Connection in RabbitMQ vs Number of Connection in OS ◦ Number of Exchange ◦ Number of Queue binding for each Exchange ◦ idle_since for each Queue (not yet) ◦ Average Message Stay Time (not yet)

There are 2 types of Monitoring for RPC, Notification From
Message Broker Point of View From OpenStack Point of View RabbitMQ works well RPC, Notification works fine

Why we need oslo.messaging Monitoring method: “XXX” _reply_q: reply ...
RPC Client RPC Server There are bunch of reasons “RPC Reply Timeout” - RabbitMQ Message Lost - RabbitMQ Message Delay to delivery - RPC Server got exception - RPC Server took long time to perform RPC None of Business From RabbitMQ

Experimenting “Metrics Support” in oslo.messaging Controller Nodes Compute Nodes oslo.metrics
/var/run/metrics.sock nova-compute neutron-agent nova-conductor nova-api oslo.metrics /var/run/metrics.sock ・・ New New oslo.messaging oslo.messaging ・ Patched Patched

Experimenting “Metrics Support” in oslo.messaging Controller Nodes Compute Nodes oslo.metrics
/var/run/metrics.sock nova-compute neutron-agent nova-conductor nova-api oslo.metrics /var/run/metrics.sock ・・ New New oslo.messaging oslo.messaging ・ Patched Patched • RPC Server ◦ rpc_server_count_for_exception ◦ rpc_server_count_for_invocation_start ◦ rpc_server_count_for_invocation_finish ◦ rpc_server_processing_time • RPC Client ◦ rpc_client_count_for_exception ◦ rpc_client_count_for_invocation_start ◦ rpc_client_count_for_invocation_finish ◦ rpc_client_processing_time

Sample of Metric Controller Nodes Compute Nodes oslo.metrics /var/run/metrics.sock nova-compute
neutron-agent nova-conductor nova-api oslo.metrics /var/run/metrics.sock ・・ New New oslo.messaging oslo.messaging ・ Patched Patched rpc_server_invocation_start_count_total { endpoint="ConductorManager", method="object_class_action_versions", namespace="None", version="3.0" exchange="None", server=”host1", topic="conductor", } 155.0

What’s Next around RabbitMQ? Production Ready oslo.messaging Monitoring RabbitMQ Disaster
Testing Benchmark Emulation & RPC Tuning

How we used RabbitMQ in wrong way at a scale

How we used RabbitMQ in wrong way at a scale

More Decks by LINE Developers

Other Decks in Technology

Featured

Transcript