How we used RabbitMQ
in wrong way at a scale
LINE Corp
Yuki Nishiwaki
Bhor Dinesh
Slide 2
Slide 2 text
What’s this talk about
● We haven’t seriously got along with RabbitMQ
● We experienced few outage because of that
● Let us share what we faced and improved one by one
Take away
● Tips of Operating RabbitMQ
● Don’t do same mistake we made
Slide 3
Slide 3 text
Who are We…. LINE?
Slide 4
Slide 4 text
MAU: 164 Million
81 Million 44 Million 21 Million 18 Million
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
Key of LINE Infrastructure
2000
Cloud User
In 10 different
locations
Centralized
Management
Infrastructure
Multiple Region
Slide 7
Slide 7 text
Scale of Whole Our Infrastructure
20,000
43,000
4,400
6,00
1,800
2,500
840
720
※ This information from infra team configuration management system, and count only active servers
Slide 8
Slide 8 text
20,000
43,000
4,400
600
Other Solution vs Private Cloud(inc OpenStack)
5500
37,000
4,100
594
714
670
1,800
2,500
840
720
There are still many Servers which are managed
by Other Solution
(VMWare, In-house Automation, Manual)
Slide 9
Slide 9 text
Many Workload moving to… Private Cloud
35000
15000
+20,000 VMs
in a year
Slide 10
Slide 10 text
OpenStack in LINE started from 2016
Components
•
•
•
•
•
•
•
•
Versions
Mitaka + Customization
Scale
Developers
4-6 Members
Slide 11
Slide 11 text
OpenStack Deployment as of 2018/12
Cluster 2
Cluster 3
Same Architecture
Region 3
Region 1
Cluster 1 Region 2
mysql
mysql
mysql
mysql
Regional Component
Global Component
Site Disaster Recovery
1
2
Slide 12
Slide 12 text
RabbitMQ Deployment/Configuration as of 2018
Region 3
Region 1
Cluster 1 Region 2
ha-mode:all
rabbitmq_management
rabbitmq_shovel
notification_designate.info@regionN+1
=> notification_designate.info@region1
rabbitmq_management
Monitoring
1
2
3
Slide 13
Slide 13 text
Where was wrong?
Slide 14
Slide 14 text
Several Outages
1. 2019-05-16: stage-prod stopped working due to 'notifications_designate.info' queue param inconsistency
2. 2018-11-20: RabbitMQ adding dedicated statistics-gathering nodes
3. 2019-02-23: One of RabbiMQ Node got crushed because of high memory usage (Verda-Prod-Tokyo)
4. OUTAGE2019-03-19 -
5. Human error: OUTAGE2019-05-16 -
Slide 15
Slide 15 text
No content
Slide 16
Slide 16 text
No content
Slide 17
Slide 17 text
No content
Slide 18
Slide 18 text
No content
Slide 19
Slide 19 text
Region 1 Region 2
Region 3
mysql
Site Disaster Recovery
RabbitMQ-Shovel Policy:
notification_designate.info@regionN+1 => notification_designate.info@RabbitMQ-1
Slide 20
Slide 20 text
Region 1 Region 2
Region 3
mysql
Site Disaster Recovery
RabbitMQ-Shovel Policy:
notification_designate.info@regionN+1 => notification_designate.info@region1-data-nodes
notification_designate.info)
Slide 21
Slide 21 text
1
2
Slide 22
Slide 22 text
Current Architecture of RabbitMQ
Slide 23
Slide 23 text
⅕ How our RabbitMQ looks like now:
Region 2
Region 3
※omit details
※omit details
Region 1
※omit details ※omit details ※omit details
Prepare Nova, Neutron dedicated RabbitMQ Cluster
Slide 24
Slide 24 text
⅕ How our RabbitMQ looks like now:
Region 2
Region 3
※omit details
※omit details
Region 1
※omit details ※omit details ※omit details
Slide 25
Slide 25 text
⅖ How our RabbitMQ looks like now:
ha-mode: nodes
ha-params: host1,host2,host3
ha-sync-mode: automatic
queue-master-locator: min-masters
rabbitmq_management-agent
Region 2
rabbitmq_management
Region 3
※omit details
※omit details
host1 host2 host3
host5
host4
Stats db
Region 1
Data Nodes
Management Nodes Plugins
Plugins
HA Configuration
1
3
2
Configure socket limit
To tolerate 1 node failure
5
Only Data Nodes
4
Slide 26
Slide 26 text
⅗ How our RabbitMQ looks like now:
Region 2
Region 3
※omit details
※omit details
Region 1
rabbitmq_shovel
Plugins
host1 host2 host3
Data Nodes
host5
host4
Stats db
Management Nodes
notification_designate.info@regionN+1
=> notification_designate.info@host1(region1)
=> notification_designate.info@host2(region1)
=> notification_designate.info@host3(region1)
Slide 27
Slide 27 text
⅘ How our RabbitMQ looks like now:
Region 2
Region 3
※omit details
※omit details
Region 1
^notifications_designate.error$
message-ttl:1
expires:10
max-length:1
^guestagent.*
expire:7200000
※omit details
Policies
Slide 28
Slide 28 text
5/5 How our RabbitMQ looks like now:
Region 2
Region 3
※omit details
※omit details
Region 1
※omit details
Automation Script
Detect Non-Consumer & Fanout Queue
=> It means “XXXX-Agent may get restarted”
Trigger
Delete
Slide 29
Slide 29 text
oslo.messaging backport into Mitaka
Slide 30
Slide 30 text
Compute Node
Agents
exchange
XXX_fanout_
Routing Key: YYY
Controller Node
API
XXX_fanout_
Declare and Consume
Backport important oslo.messaging patch to Mitaka
1. Delete fanout queues on graceful shutdown
Compute Node Controller Node
Agents API
exchange
XXX_fanout_
Routing Key: YYY
Declare and Consume
message
message
message
message
message
message
message
message
message
Non Consumer Fanout Queue
- Keep getting message from exchange
- Many message left on RabbitMQ cluster
message
Slide 31
Slide 31 text
Compute Node
Agents
exchange
XXX_fanout_
Routing Key: YYY
Controller Node
API
XXX_fanout_
Backport important oslo.messaging patch to Mitaka
1. Delete fanout queues on gracefully shutdown
Delete when process killed gracefully
No more “Non Consumer Fanout Queue”
2016/07/26 merged into master
Slide 32
Slide 32 text
method: “XXX”
_reply_q: reply
...
Backport important oslo.messaging patch to Mitaka
2. Use default exchange for direct messaging
Compute Node
Agents
Controller Node
API
When Perform RPC execution
Compute Node Controller Node
Agents API
reply_
reply_
Routing Key:
reply_
Find reply queue in Message
message
Slide 33
Slide 33 text
Backport important oslo.messaging patch to Mitaka
2. Use default exchange for direct messaging
Compute Node Controller Node
Agents API
reply_
reply_
Routing Key:
reply_
message
Controller Node
API
Failed to send Reply
When Node got broken
Sometimes exchange migration got failed
reply_
Exchange Missing
Slide 34
Slide 34 text
Backport important oslo.messaging patch to Mitaka
2. Use default exchange for direct messaging
Compute Node
Controller Node
Agents
API
reply_
reply_ Routing Key:
reply_
default
message
2018/11/02 merged into master
Slide 35
Slide 35 text
Backport important oslo.messaging patch to Mitaka
2. Use default exchange for direct messaging
Compute Node
Controller Node
Agents
API
reply_
reply_ Routing Key:
reply_
default
message
2018/11/02 merged into master
From: https://www.rabbitmq.com/tutorials/amqp-concepts.html
Default Exchange
The default exchange is a direct exchange
with no name (empty string) pre-declared
by the broker. It has one special property
that makes it very useful for simple
applications: every queue that is created is
automatically bound to it with a routing key
which is the same as the queue name.
Slide 36
Slide 36 text
Backport important oslo.messaging patch to Mitaka
2. Use default exchange for direct messaging
Compute Node
Controller Node
Agents
API
reply_
reply_ Routing Key:
reply_
default
message
2018/11/02 merged into master
Even If Exchange gone,
RPC reply are correctly delivered
Slide 37
Slide 37 text
Monitoring of RabbitMQ and oslo.messaging
Slide 38
Slide 38 text
Monitoring: RabbitMQ
● Use https://github.com/kbudde/rabbitmq_exporter
● Check Followings
○ Network Partition Check
○ Queue Synchronization for each Queue
○ Number of Consumer for each Queue
○ Number of Messages Left
○ Number of Connection in RabbitMQ vs Number of Connection in OS
○ Number of Exchange
○ Number of Queue binding for each Exchange
○ idle_since for each Queue (not yet)
○ Average Message Stay Time (not yet)
Slide 39
Slide 39 text
There are 2 types of Monitoring for RPC, Notification
From Message Broker Point of View From OpenStack Point of View
RabbitMQ works well RPC, Notification works fine
Slide 40
Slide 40 text
Why we need oslo.messaging Monitoring
method: “XXX”
_reply_q: reply
...
RPC Client RPC Server
There are bunch of reasons “RPC Reply Timeout”
- RabbitMQ Message Lost
- RabbitMQ Message Delay to delivery
- RPC Server got exception
- RPC Server took long time to perform RPC
None of
Business
From RabbitMQ
Slide 41
Slide 41 text
Experimenting “Metrics Support” in oslo.messaging
Controller Nodes
Compute Nodes
oslo.metrics
/var/run/metrics.sock
nova-compute
neutron-agent
nova-conductor
nova-api
oslo.metrics
/var/run/metrics.sock
・
・
New
New
oslo.messaging
oslo.messaging
・
Patched
Patched