Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RabbitQoS - Taming the Information Tsunami (HackKrk)

RabbitQoS - Taming the Information Tsunami (HackKrk)

How Base achieves tenants isolation, and SLOs, in RabbitMQ using an internal tool that we've built called RabbitQoS.

Bogusław Miśta

October 17, 2018
Tweet

More Decks by Bogusław Miśta

Other Decks in Technology

Transcript

  1. Section Title Goes Here The All-In-One Sales Platform getbase.com Taming

    the information tsunami RabbitQoS - How Base achieves tenants isolation, and SLOs, in multi-tenant messaging systems Bogusław Miśta Backend Platform Team Leader @Zendesk
  2. Section Title Goes Here The All-In-One Sales Platform getbase.com •

    Service Oriented Architecture • 150+ microservices • Communication: synchronous, asynchronous, batch, … • Asynchronous: data replication, notifications, … • RabbitMQ, Kafka • Multi-tenant Base
  3. Section Title Goes Here The All-In-One Sales Platform getbase.com •

    Tenants used to be homogenous • Enterprise customers • Abnormal data volume • Irregular traffic patterns due to automation Multi-tenant Heterogenous
  4. Section Title Goes Here The All-In-One Sales Platform getbase.com •

    Single change generates millions of messages • Uncontrollable flood of messages • Impact on all customers Butterfly effect Floods of messages
  5. Section Title Goes Here The All-In-One Sales Platform getbase.com •

    Need to wait for the abnormal traffic being processed first • Customers don’t get notifications • Miss appointments • Make decisions on incorrect, inconsistent, data • Can lose money $ Customers impact
  6. Section Title Goes Here The All-In-One Sales Platform getbase.com •

    70% of all alerts related to RabbitMQ floods • All incidents caused High Severity • Not actionable Who you call? Duty’s calling
  7. Section Title Goes Here The All-In-One Sales Platform getbase.com •

    Handle large spikes automatically, no manual interventions • Get visibility, quickly and easily assess the impact of the delay on customers • No one is absorbing all the resources • Tenants are isolated, and don’t affect each other • Different types of messages (batch, import, vip, regular) don’t affect each other • SLOs are met - 10 seconds, and messages are delivered on time We can do better! Goals
  8. Section Title Goes Here The All-In-One Sales Platform getbase.com “Quality

    of Service is the ability to provide different priority to different applications, users, or data flows, or to guarantee a certain level of performance to a data flow.”
  9. Section Title Goes Here The All-In-One Sales Platform getbase.com “All

    problems in computer science can be solved by another level of indirection” (“… except of course for the problem of too many indirections”) The Scalability way
  10. Section Title Goes Here The All-In-One Sales Platform getbase.com •

    A proxy, very fast water mill • MITM - Didn’t have to implement AMQP protocol • Sifts the water drops (messages) • Classifies the messages • Detects offending accounts • Sets aside to the persistent storage • Throttles offending accounts • Makes sure service queue delay below SLO RabbitQoS RabbitMQ Quality of Service
  11. Section Title Goes Here The All-In-One Sales Platform getbase.com RabbitQoS

    Offenders Detection • Kicks in when a delay rises above the SLO - “delay threshold” • Based on 10s sliding window • Top K accounts which sent most messages • K - configurable via UI by a service owner
  12. Section Title Goes Here The All-In-One Sales Platform getbase.com •

    Separate queue & exchange for each offender • Offenders isolation • Queues created and removed on demand • Considered Kafka • Can store much more messages without interference with RabbitMQ performance • Partitions are HA • Possible to browse messages • Didn’t want to complicate solution - use the same broker RabbitQoS Offenders Queues
  13. Section Title Goes Here The All-In-One Sales Platform getbase.com RabbitQoS

    Offenders Throttling • Core aspect of QoS • Slowly consume offender queues • Token Bucket algorithm • Adjustable token rate, and a burst size by service owners
  14. Section Title Goes Here The All-In-One Sales Platform getbase.com •

    Akka (Java) • Apache Helix • Apache Zookeeper • RabbitMQ • InfluxDB RabbitQoS Technologies
  15. Section Title Goes Here The All-In-One Sales Platform getbase.com •

    Customers gets their notifications on time • SLOs are met in automatic way • Metrics, and visibility into asynchronous communication • Clustered RabbitMQ brokers • 70% less alerts Conclusion Happy firefighters day!
  16. Section Title Goes Here The All-In-One Sales Platform getbase.com Q&A

    Because asking questions is a good way to find out things