the information tsunami RabbitQoS - How Base achieves tenants isolation, and SLOs, in multi-tenant messaging systems Bogusław Miśta Backend Platform Team Leader @Zendesk
Need to wait for the abnormal traffic being processed first • Customers don’t get notifications • Miss appointments • Make decisions on incorrect, inconsistent, data • Can lose money $ Customers impact
Handle large spikes automatically, no manual interventions • Get visibility, quickly and easily assess the impact of the delay on customers • No one is absorbing all the resources • Tenants are isolated, and don’t affect each other • Different types of messages (batch, import, vip, regular) don’t affect each other • SLOs are met - 10 seconds, and messages are delivered on time We can do better! Goals
of Service is the ability to provide different priority to different applications, users, or data flows, or to guarantee a certain level of performance to a data flow.”
problems in computer science can be solved by another level of indirection” (“… except of course for the problem of too many indirections”) The Scalability way
A proxy, very fast water mill • MITM - Didn’t have to implement AMQP protocol • Sifts the water drops (messages) • Classifies the messages • Detects offending accounts • Sets aside to the persistent storage • Throttles offending accounts • Makes sure service queue delay below SLO RabbitQoS RabbitMQ Quality of Service
Offenders Detection • Kicks in when a delay rises above the SLO - “delay threshold” • Based on 10s sliding window • Top K accounts which sent most messages • K - configurable via UI by a service owner
Separate queue & exchange for each offender • Offenders isolation • Queues created and removed on demand • Considered Kafka • Can store much more messages without interference with RabbitMQ performance • Partitions are HA • Possible to browse messages • Didn’t want to complicate solution - use the same broker RabbitQoS Offenders Queues
Customers gets their notifications on time • SLOs are met in automatic way • Metrics, and visibility into asynchronous communication • Clustered RabbitMQ brokers • 70% less alerts Conclusion Happy firefighters day!