Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building and Scaling a WebSockets Pubsub System

kapil
March 31, 2018

Building and Scaling a WebSockets Pubsub System

I will talk about how we built and maintained a WebSockets platform on AWS infra.
You can expect to have insights about,

How to build and evovle a WebSockets platform on AWS
How we made the platform more resilient to failures known and unknown
How we saved costs by using right strategy for auto-scaling and load balancing
How to monitor a WebSockets platform

kapil

March 31, 2018
Tweet

More Decks by kapil

Other Decks in Technology

Transcript

  1. About me - Kapil Staff Engineer @ Helpshift Clojure Distributed

    Systems Games Music Books/Comics Football
  2. Overview • Use case and Scale • Overview of PubSub

    platform • Evolution of platform architecture • Scaling issues and solutions
  3. Helpshift is a Mobile CRM SaaS product. We help connect

    app developers with their customers. Since everything is now on mobile.
  4. Scale • ~2 TB data broadcast / day • Outgoing

    - 75 k msg/sec • Incoming - 1.5 k msg/sec • Concurrency - 3.5k Here are some scale numbers for the Platform we have built
  5. Mobile SDK •In-app customer support •Creates a one to one

    channel between app user and app owner • ~1 Billion unique app installs
  6. Once a ticket is filed it shows up in agent’s

    portal which is a Helpshift web app.
  7. Topics and Messages The part lit up in green all

    the parts that require to be updated dynamically.
  8. PubSub Platform We built a generic Pubish and Subscribe platform.

    Subscribers of these messages are Javascript clients listening on Websockets connection and Publishers are any backend server using ZMQ to publish the messages. We call this platform dirigent.
  9. A simplified version of the platform’s architecture. Again browsers (Subscribers)

    connect to Dirigent using WebSockets and Backend servers (Publishers) connect to Diligent using ZMQ. It’s a simplified view right now.
  10. WebSockets • Bi-Direction transport between client and server • Supports

    text or binary data • Protocol starts with HTTP connection and upgraded to a WebSockets connection
  11. ZMQ • Zero Messaging broker Queue • Supports IPC, TCP,

    Multicast as transport layer • Tiny library that let’s one use different messaging patterns like PubSub, Request-Response • Network is the only overhead affecting performance
  12. Consumer Protocol • Transport - WebSockets • Inspired by WAMP

    WebSockets sub protocol • Implements methods like, • Subscribe • Ex. [“subscribe”, “topic_foo”] • Unsubscribe • Ex. [“unsubscribe”, “topic_foo”] This is how consumer protocol looks like which is built on top of a WebSockets subprotocol WAMP. It does not completely implement WAMP since it also has abilities to do RPC. But the PubSub part is heavily inspired by WAMP sepc.
  13. Producer Protocol • ZMQ • Implements publish • Ex. [“topic_foo”,

    {unix_ts}, {JSON payload}] Publisher connects to Dirigent by means of ZMQ. Underlying connection is TCP .
  14. bar

  15. bar

  16. v1 of the platform we used different transport mechanism. HTTP

    streaming for delivering messages to browsers and HTTP to deliver messages to Dirigent servers. HTTP mechanism posed problem and it had coupling effect with backend server. Whenever dirigent platform went down due to load the HTTP connections timed out and created a cascading failure in backend servers. We switched ZMQ there.
  17. X_X Whenever something went wrong with Dirigent cluster. Since the

    API servers were making Post requests to broadcast to messages the request used to time out.
  18. X_X Queued Publish “bar” on topic “foo” Whenever something went

    wrong with Dirigent cluster. Since the API servers were making Post requests to broadcast to messages the request used to time out.
  19. X_X Queued 20 publishes Whenever something went wrong with Dirigent

    cluster. Since the API servers were making Post requests to broadcast to messages the request used to time out.
  20. X_X X_X X_X Eventually backend servers started to conk off.

    Instead of HTTP we started using ZMQ as delivery mechanism. It creates loose coupling so there is no cascading failures with application server.
  21. Problems with HTTP producers • Cascading failures due to queued

    requests • Manual fanout to all dirigent servers
  22. ZMQ • De-couples producer (Backend server) and consumer (WebSockets server)

    • Retries, buffering is done out of the box so library handles that • Topic broadcast is supported out of the box so manual broadcast is not required
  23. HTTP Streaming Now this is one side of the story.

    There were problems at consumer side as well.
  24. X_X X_X Since each http connection will get everything that

    is generated for a subdomain. The connection will start choking. We quickly run out of network card limit for a single machine. On few bad days we saw client-side bandwidth was running out as well.
  25. Problems with HTTP streaming • Request for a topic is

    one time and to choose new set of topics client needs to make a new request • Modelling topics becomes difficult so you end up creating a giant topic for one subdomain/customer • Clients receive everything and WebSockets server has to push everything. • Server network bandwidth limit (~120MBps) runs out • Client bandwidth is limited as well
  26. Meaning subscribe to everything forever. Problem with this is you

    quickly run out of network card limit on a single machine. And we did. To solve this we had to move to a web sockets connection where client can ask for only things it wants and nothing more.
  27. WebSockets • Bi-Directional protocol let’s client pick and choose topics

    only required for rendering the UI. • Network traffic is hugely reduced • Creating more granular topics is possible
  28. ZMQ XPUB The part highlighted is XSUB pattern of zeroMQ

    where many producer connect to one of the diligent proxy server.
 Every backend server connects to one diligent proxy server. Even if there are more server the backend just needs to connect to one. The load balancer takes care of this
  29. ZMQ XSUB Every diligent server connects to every diligent proxy

    server. Since all proxy servers have subset of received messages.
  30. Just to zoom on what I mean by 1-N connections.

    Every diligent web socket server connects to every diligent proxy server. The way it discovers all active proxy servers is through zookeeper. This is the XPUB part of ZMQ
  31. ZMQ
 XSUB ZMQ
 XPUB This is how the high level

    architecture looks like. If there are no subscribers for a ZMQ topic the producers drop the message for a particular topic.
  32. ZMQ XPub-XSub • Broadcast of messages on a particular topic

    is handled out of the box • Support back propagation of subscribers to ZMQ producers. If there are no subscribers message production does not happen saving network hops
  33. We also we have multiple clusters and they can talk

    to each other. They have their different set of subscribers. Publishers can come from another cluster.
  34. Under the hood • Clojure (JVM) • Http-kit (Java NIO

    based web sockets server) • ZMQ • Zookeeper
  35. Monitoring All the messages we are publishing is important data

    and needs to rendered in time. The nature of this data is ephemeral. We don’t store it anywhere so auditing is hard. So utilising monitoring was crucial for us.
  36. Under the hood • StatsD protocol • Graphite - Storage

    • Grafana - Frontend • Sensu - Alerts on trends
  37. *example of monitoring comparison different stages* Since auditing this kind

    of data is hard. We compare metrics of data in different stages of the platform. But since the numbers are big it’s hard to spot any anomaly. What we are looking for is variance.
  38. Problem with absolute numbers • Relatively small differences are not

    easy to spot • Alerts will need to have arbitrary numbers which tend to fire false positives
  39. Message variance is easy to parse visually. If variance is

    low some stage of the platform is dropping data. In fact we also have setup alerts on this same query.
  40. Variance • Variance is percentage of message in different stages

    • Sensu alerts work on the same Graphite queries we use to plot in Grafana. • Alerts are now reliable since the threshold is on a percentage not on a arbitrary number • Only non-actionable alert we get is when there is a network blip. However this helps us understand the outage/degraded window.
  41. Another important metric is time taken to publish a message

    to WebSocket connection. Since near real time SLA is so important we look at p99s for anomalies. We have setup alerts on these as well.
  42. Cost saving • Network bandwidth used to push messages out

    on internet • Number of machines used by cluster Costs are a concern for us always! There are two important factors that add up to the cost. Outgoing bandwidth usage and number of machines
  43. Compression First we started using gzip compression for websockets. It’s

    a standard compression mechanism supported by browsers but as with browsers there are quirks here.
  44. Re-visiting features • ‘x’ feature is taking too much bandwidth

    can we do subset of ‘x’ feature. • Is ‘x’ feature adding value for the cost we are paying? • Some payloads have just extra information • Renaming field names Biggest change you can do to save costs is to re-visit the features/business logic itself and try to optimise there. This reduced the bandwidth usage by significant amount.
  45. Auto scaling To save up on number of machines used.

    We started investigating in how to do auto scaling. Auto scaling was not a straight forward thing since all the connections are long running and usually can stay alive for as long as 8 hours.
  46. Auto scaling pre-reqs • Before scaling out each server’s utilisation

    should be maximum possible. Load balancing should distribute the load evenly • No server should be overloaded given the increase in scale is mostly gradual
  47. HAProxy with least conn We went with the obvious choice

    of least connection with HAProxy doing the load balancing.
  48. Even though from connected clients it seems like doing the

    same amount of work. It’s actually clear from load average that it’s not.
  49. Least load connection works. Sometimes. Num connections != Load The

    problem with least load connection is assumption that number of connections a server is handling is directly proportional to amount of work it’s doing. This was a wrong assumption and it just lead us to uneven distribution. Server crashes and just bad sleepless nights.
  50. We looked at another metric. Number of messages pushed to

    browsers from a Dirigent WS server. But to do this we needed to do feedback load balancing
  51. Feedback load balancing Feedback load balancing is something we started

    to do with Herald an internal tool we built at Helpshift. This helps HAProxy decide which server to choose when routing a new connection. All the servers can expose the current load they are under to Herald which in turns tells HAproxy which server to choose. If all servers are loaded we scale out. If all servers are under loaded we scale in.
  52. Traffic patterns • Support Agents work in shifts. So sudden

    rise in activity • Activity 24x7 because most of the support has multiple shifts • No sudden increase in traffic usually unless there is a glaring bug in an app.
  53. Goals • Decide instance types for the cluster • Set

    thresholds for Auto Scaling • Repeatable setup
  54. Benchmark Setup • Replay subscribe and publishes • Control over

    rate and scale of, • Subscribes • Publishes • Repeatable setup • Record and replay
  55. Benchmark results • Concrete number of thresholds found for various

    loads • Number of publishes per second • Number of subscribers per machine with per subscribers receiving a certain number of messages • Instance type which requires CPU heavy and moderate memory. No disk usage.
  56. Reliability • Dirigent proxies produce dummy messages to verify messages

    are being received on WebSockets server. • Each Dirigent server exposes a health check. If it fails the health check it’s taken out of rotation. Health check verifies each external dependency is healthy as well the service itself. • Each WebSocket client expects a periodic ping from server. If it doesn’t receive a ping it assumes connection is not active and starts a new connection.
  57. Future • Benchmarking tool repurposed as verification tool for production

    system. Treats the system as black box and verifies availability • Latency of round trip of a ping from WebSocket sever to browser. Drill down of this data per subdomain to address on call reports. • Evaluation of NanoMsg
  58. Summary • Building a web sockets infrastructure on EC2 is

    possible but it has quirks • Use feedback load balancing for WebSockets / Long running connection traffic • ZMQ, JVM are solid building blocks for building a realtime pubsub platform • Instrumentation in multiple stages of platform is a good way to keep track of a real time system
  59. Fin