Slide 1

Slide 1 text

How we used RabbitMQ in wrong way at a scale LINE Corp Yuki Nishiwaki Bhor Dinesh

Slide 2

Slide 2 text

What’s this talk about ● We haven’t seriously got along with RabbitMQ ● We experienced few outage because of that ● Let us share what we faced and improved one by one Take away ● Tips of Operating RabbitMQ ● Don’t do same mistake we made

Slide 3

Slide 3 text

Who are We…. LINE?

Slide 4

Slide 4 text

MAU: 164 Million 81 Million 44 Million 21 Million 18 Million

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Key of LINE Infrastructure 2000 Cloud User In 10 different locations Centralized Management Infrastructure Multiple Region

Slide 7

Slide 7 text

Scale of Whole Our Infrastructure 20,000 43,000 4,400 6,00 1,800 2,500 840 720 ※ This information from infra team configuration management system, and count only active servers

Slide 8

Slide 8 text

20,000 43,000 4,400 600 Other Solution vs Private Cloud(inc OpenStack) 5500 37,000 4,100 594 714 670 1,800 2,500 840 720 There are still many Servers which are managed by Other Solution (VMWare, In-house Automation, Manual)

Slide 9

Slide 9 text

Many Workload moving to… Private Cloud 35000 15000 +20,000 VMs in a year

Slide 10

Slide 10 text

OpenStack in LINE started from 2016 Components • • • • • • • • Versions Mitaka + Customization Scale Developers 4-6 Members

Slide 11

Slide 11 text

OpenStack Deployment as of 2018/12 Cluster 2 Cluster 3 Same Architecture Region 3 Region 1 Cluster 1 Region 2 mysql mysql mysql mysql Regional Component Global Component Site Disaster Recovery 1 2

Slide 12

Slide 12 text

RabbitMQ Deployment/Configuration as of 2018 Region 3 Region 1 Cluster 1 Region 2 ha-mode:all rabbitmq_management rabbitmq_shovel notification_designate.info@regionN+1 => notification_designate.info@region1 rabbitmq_management Monitoring 1 2 3

Slide 13

Slide 13 text

Where was wrong?

Slide 14

Slide 14 text

Several Outages 1. 2019-05-16: stage-prod stopped working due to 'notifications_designate.info' queue param inconsistency 2. 2018-11-20: RabbitMQ adding dedicated statistics-gathering nodes 3. 2019-02-23: One of RabbiMQ Node got crushed because of high memory usage (Verda-Prod-Tokyo) 4. OUTAGE2019-03-19 - 5. Human error: OUTAGE2019-05-16 -

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Region 1 Region 2 Region 3 mysql Site Disaster Recovery RabbitMQ-Shovel Policy: notification_designate.info@regionN+1 => notification_designate.info@RabbitMQ-1

Slide 20

Slide 20 text

Region 1 Region 2 Region 3 mysql Site Disaster Recovery RabbitMQ-Shovel Policy: notification_designate.info@regionN+1 => notification_designate.info@region1-data-nodes notification_designate.info)

Slide 21

Slide 21 text

1 2

Slide 22

Slide 22 text

Current Architecture of RabbitMQ

Slide 23

Slide 23 text

⅕ How our RabbitMQ looks like now: Region 2 Region 3 ※omit details ※omit details Region 1 ※omit details ※omit details ※omit details Prepare Nova, Neutron dedicated RabbitMQ Cluster

Slide 24

Slide 24 text

⅕ How our RabbitMQ looks like now: Region 2 Region 3 ※omit details ※omit details Region 1 ※omit details ※omit details ※omit details

Slide 25

Slide 25 text

⅖ How our RabbitMQ looks like now: ha-mode: nodes ha-params: host1,host2,host3 ha-sync-mode: automatic queue-master-locator: min-masters rabbitmq_management-agent Region 2 rabbitmq_management Region 3 ※omit details ※omit details host1 host2 host3 host5 host4 Stats db Region 1 Data Nodes Management Nodes Plugins Plugins HA Configuration 1 3 2 Configure socket limit To tolerate 1 node failure 5 Only Data Nodes 4

Slide 26

Slide 26 text

⅗ How our RabbitMQ looks like now: Region 2 Region 3 ※omit details ※omit details Region 1 rabbitmq_shovel Plugins host1 host2 host3 Data Nodes host5 host4 Stats db Management Nodes notification_designate.info@regionN+1 => notification_designate.info@host1(region1) => notification_designate.info@host2(region1) => notification_designate.info@host3(region1)

Slide 27

Slide 27 text

⅘ How our RabbitMQ looks like now: Region 2 Region 3 ※omit details ※omit details Region 1 ^notifications_designate.error$ message-ttl:1 expires:10 max-length:1 ^guestagent.* expire:7200000 ※omit details Policies

Slide 28

Slide 28 text

5/5 How our RabbitMQ looks like now: Region 2 Region 3 ※omit details ※omit details Region 1 ※omit details Automation Script Detect Non-Consumer & Fanout Queue => It means “XXXX-Agent may get restarted” Trigger Delete

Slide 29

Slide 29 text

oslo.messaging backport into Mitaka

Slide 30

Slide 30 text

Compute Node Agents exchange XXX_fanout_ Routing Key: YYY Controller Node API XXX_fanout_ Declare and Consume Backport important oslo.messaging patch to Mitaka 1. Delete fanout queues on graceful shutdown Compute Node Controller Node Agents API exchange XXX_fanout_ Routing Key: YYY Declare and Consume message message message message message message message message message Non Consumer Fanout Queue - Keep getting message from exchange - Many message left on RabbitMQ cluster message

Slide 31

Slide 31 text

Compute Node Agents exchange XXX_fanout_ Routing Key: YYY Controller Node API XXX_fanout_ Backport important oslo.messaging patch to Mitaka 1. Delete fanout queues on gracefully shutdown Delete when process killed gracefully No more “Non Consumer Fanout Queue” 2016/07/26 merged into master

Slide 32

Slide 32 text

method: “XXX” _reply_q: reply ... Backport important oslo.messaging patch to Mitaka 2. Use default exchange for direct messaging Compute Node Agents Controller Node API When Perform RPC execution Compute Node Controller Node Agents API reply_ reply_ Routing Key: reply_ Find reply queue in Message message

Slide 33

Slide 33 text

Backport important oslo.messaging patch to Mitaka 2. Use default exchange for direct messaging Compute Node Controller Node Agents API reply_ reply_ Routing Key: reply_ message Controller Node API Failed to send Reply When Node got broken Sometimes exchange migration got failed reply_ Exchange Missing

Slide 34

Slide 34 text

Backport important oslo.messaging patch to Mitaka 2. Use default exchange for direct messaging Compute Node Controller Node Agents API reply_ reply_ Routing Key: reply_ default message 2018/11/02 merged into master

Slide 35

Slide 35 text

Backport important oslo.messaging patch to Mitaka 2. Use default exchange for direct messaging Compute Node Controller Node Agents API reply_ reply_ Routing Key: reply_ default message 2018/11/02 merged into master From: https://www.rabbitmq.com/tutorials/amqp-concepts.html Default Exchange The default exchange is a direct exchange with no name (empty string) pre-declared by the broker. It has one special property that makes it very useful for simple applications: every queue that is created is automatically bound to it with a routing key which is the same as the queue name.

Slide 36

Slide 36 text

Backport important oslo.messaging patch to Mitaka 2. Use default exchange for direct messaging Compute Node Controller Node Agents API reply_ reply_ Routing Key: reply_ default message 2018/11/02 merged into master Even If Exchange gone, RPC reply are correctly delivered

Slide 37

Slide 37 text

Monitoring of RabbitMQ and oslo.messaging

Slide 38

Slide 38 text

Monitoring: RabbitMQ ● Use https://github.com/kbudde/rabbitmq_exporter ● Check Followings ○ Network Partition Check ○ Queue Synchronization for each Queue ○ Number of Consumer for each Queue ○ Number of Messages Left ○ Number of Connection in RabbitMQ vs Number of Connection in OS ○ Number of Exchange ○ Number of Queue binding for each Exchange ○ idle_since for each Queue (not yet) ○ Average Message Stay Time (not yet)

Slide 39

Slide 39 text

There are 2 types of Monitoring for RPC, Notification From Message Broker Point of View From OpenStack Point of View RabbitMQ works well RPC, Notification works fine

Slide 40

Slide 40 text

Why we need oslo.messaging Monitoring method: “XXX” _reply_q: reply ... RPC Client RPC Server There are bunch of reasons “RPC Reply Timeout” - RabbitMQ Message Lost - RabbitMQ Message Delay to delivery - RPC Server got exception - RPC Server took long time to perform RPC None of Business From RabbitMQ

Slide 41

Slide 41 text

Experimenting “Metrics Support” in oslo.messaging Controller Nodes Compute Nodes oslo.metrics /var/run/metrics.sock nova-compute neutron-agent nova-conductor nova-api oslo.metrics /var/run/metrics.sock ・ ・ New New oslo.messaging oslo.messaging ・ Patched Patched

Slide 42

Slide 42 text

Experimenting “Metrics Support” in oslo.messaging Controller Nodes Compute Nodes oslo.metrics /var/run/metrics.sock nova-compute neutron-agent nova-conductor nova-api oslo.metrics /var/run/metrics.sock ・ ・ New New oslo.messaging oslo.messaging ・ Patched Patched ● RPC Server ○ rpc_server_count_for_exception ○ rpc_server_count_for_invocation_start ○ rpc_server_count_for_invocation_finish ○ rpc_server_processing_time ● RPC Client ○ rpc_client_count_for_exception ○ rpc_client_count_for_invocation_start ○ rpc_client_count_for_invocation_finish ○ rpc_client_processing_time

Slide 43

Slide 43 text

Sample of Metric Controller Nodes Compute Nodes oslo.metrics /var/run/metrics.sock nova-compute neutron-agent nova-conductor nova-api oslo.metrics /var/run/metrics.sock ・ ・ New New oslo.messaging oslo.messaging ・ Patched Patched rpc_server_invocation_start_count_total { endpoint="ConductorManager", method="object_class_action_versions", namespace="None", version="3.0" exchange="None", server=”host1", topic="conductor", } 155.0

Slide 44

Slide 44 text

What’s Next around RabbitMQ? Production Ready oslo.messaging Monitoring RabbitMQ Disaster Testing Benchmark Emulation & RPC Tuning