with RabbitMQ • We experienced few outage because of that • Let us share what we faced and improved one by one Take away • Tips of Operating RabbitMQ • Don’t do same mistake we made
5500 37,000 4,100 594 714 670 1,800 2,500 840 720 There are still many Servers which are managed by Other Solution (VMWare, In-house Automation, Manual)
queue param inconsistency 2. 2018-11-20: RabbitMQ adding dedicated statistics-gathering nodes 3. 2019-02-23: One of RabbiMQ Node got crushed because of high memory usage (Verda-Prod-Tokyo) 4. OUTAGE2019-03-19 <Whole Verda-dev> - <RabbitMQ has been broken> 5. Human error: OUTAGE2019-05-16 <Whole Verda-Prod > - <RabbitMQ has been broken>
3 ※omit details ※omit details Region 1 ※omit details Automation Script Detect Non-Consumer & Fanout Queue => It means “XXXX-Agent may get restarted” Trigger Delete
Node API XXX_fanout_<random uuid> Backport important oslo.messaging patch to Mitaka 1. Delete fanout queues on gracefully shutdown Delete when process killed gracefully No more “Non Consumer Fanout Queue” 2016/07/26 merged into master
for direct messaging Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> default message 2018/11/02 merged into master From: https://www.rabbitmq.com/tutorials/amqp-concepts.html Default Exchange The default exchange is a direct exchange with no name (empty string) pre-declared by the broker. It has one special property that makes it very useful for simple applications: every queue that is created is automatically bound to it with a routing key which is the same as the queue name.
for direct messaging Compute Node Controller Node Agents API reply_<random uuid> reply_<random uuid> Routing Key: reply_<random uuid> default message 2018/11/02 merged into master Even If Exchange gone, RPC reply are correctly delivered
Partition Check ◦ Queue Synchronization for each Queue ◦ Number of Consumer for each Queue ◦ Number of Messages Left ◦ Number of Connection in RabbitMQ vs Number of Connection in OS ◦ Number of Exchange ◦ Number of Queue binding for each Exchange ◦ idle_since for each Queue (not yet) ◦ Average Message Stay Time (not yet)
RPC Client RPC Server There are bunch of reasons “RPC Reply Timeout” - RabbitMQ Message Lost - RabbitMQ Message Delay to delivery - RPC Server got exception - RPC Server took long time to perform RPC None of Business From RabbitMQ