Building and Scaling a WebSockets Pubsub System

Building and Scaling a WebSockets PubSub System Kapil Reddy @
Helpshift

About me - Kapil Staff Engineer @ Helpshift Clojure Distributed
Systems Games Music Books/Comics Football

Overview • Use case and Scale • Overview of PubSub
platform • Evolution of platform architecture • Scaling issues and solutions

Helpshift is a Mobile CRM SaaS product. We help connect
app developers with their customers. Since everything is now on mobile.

Scale • ~2 TB data broadcast / day • Outgoing
- 75 k msg/sec • Incoming - 1.5 k msg/sec • Concurrency - 3.5k Here are some scale numbers for the Platform we have built

Mobile SDK •In-app customer support •Creates a one to one
channel between app user and app owner • ~1 Billion unique app installs

Once a ticket is ﬁled it shows up in agent’s
portal which is a Helpshift web app.

Topics and Messages The part lit up in green all
the parts that require to be updated dynamically.

PubSub Platform We built a generic Pubish and Subscribe platform.
Subscribers of these messages are Javascript clients listening on Websockets connection and Publishers are any backend server using ZMQ to publish the messages. We call this platform dirigent.

A simpliﬁed version of the platform’s architecture. Again browsers (Subscribers)
connect to Dirigent using WebSockets and Backend servers (Publishers) connect to Diligent using ZMQ. It’s a simpliﬁed view right now.

WebSockets • Bi-Direction transport between client and server • Supports
text or binary data • Protocol starts with HTTP connection and upgraded to a WebSockets connection

ZMQ • Zero Messaging broker Queue • Supports IPC, TCP,
Multicast as transport layer • Tiny library that let’s one use different messaging patterns like PubSub, Request-Response • Network is the only overhead affecting performance

Consumer Protocol • Transport - WebSockets • Inspired by WAMP
WebSockets sub protocol • Implements methods like, • Subscribe • Ex. [“subscribe”, “topic_foo”] • Unsubscribe • Ex. [“unsubscribe”, “topic_foo”] This is how consumer protocol looks like which is built on top of a WebSockets subprotocol WAMP. It does not completely implement WAMP since it also has abilities to do RPC. But the PubSub part is heavily inspired by WAMP sepc.

Producer Protocol • ZMQ • Implements publish • Ex. [“topic_foo”,
{unix_ts}, {JSON payload}] Publisher connects to Dirigent by means of ZMQ. Underlying connection is TCP .

Give me topic “foo”

Give me topic “foo” Send message “bar” on topic “foo”

Stop sending me topic “foo”

bar Nothing happens this time around

bar Topics for different customers [“cust1.tickets”,”cust1.feed”] [“cust2.tickets”,”cust2.feed”] Nothing happens this
time around

Think of each green box as a diﬀerent topic

Evolution

v1 of the platform we used diﬀerent transport mechanism. HTTP
streaming for delivering messages to browsers and HTTP to deliver messages to Dirigent servers. HTTP mechanism posed problem and it had coupling eﬀect with backend server. Whenever dirigent platform went down due to load the HTTP connections timed out and created a cascading failure in backend servers. We switched ZMQ there.

X_X Whenever something went wrong with Dirigent cluster. Since the
API servers were making Post requests to broadcast to messages the request used to time out.

X_X Queued Publish “bar” on topic “foo” Whenever something went
wrong with Dirigent cluster. Since the API servers were making Post requests to broadcast to messages the request used to time out.

X_X Queued 20 publishes Whenever something went wrong with Dirigent
cluster. Since the API servers were making Post requests to broadcast to messages the request used to time out.

X_X X_X Eventually backend servers started to conk oﬀ.

X_X X_X X_X Eventually backend servers started to conk oﬀ.
Instead of HTTP we started using ZMQ as delivery mechanism. It creates loose coupling so there is no cascading failures with application server.

Problems with HTTP producers • Cascading failures due to queued
requests • Manual fanout to all dirigent servers

ZMQ • De-couples producer (Backend server) and consumer (WebSockets server)
• Retries, buffering is done out of the box so library handles that • Topic broadcast is supported out of the box so manual broadcast is not required

HTTP Streaming Now this is one side of the story.
There were problems at consumer side as well.

[“cust1”] [“cust2”]

So whole dashboard becomes a single topic.

[“cust1”] Cust 1 - Ticket 100 Looking at Ticket 1

Cust 1 - Ticket 500 Staph!!!

Cust 1 - Ticket 500 Staph!!! X_X

X_X X_X Since each http connection will get everything that
is generated for a subdomain. The connection will start choking. We quickly run out of network card limit for a single machine. On few bad days we saw client-side bandwidth was running out as well.

Problems with HTTP streaming • Request for a topic is
one time and to choose new set of topics client needs to make a new request • Modelling topics becomes difﬁcult so you end up creating a giant topic for one subdomain/customer • Clients receive everything and WebSockets server has to push everything. • Server network bandwidth limit (~120MBps) runs out • Client bandwidth is limited as well

Meaning subscribe to everything forever. Problem with this is you
quickly run out of network card limit on a single machine. And we did. To solve this we had to move to a web sockets connection where client can ask for only things it wants and nothing more.

So this is how we eventually moved to using WebSockts
for subscription

WebSockets • Bi-Directional protocol let’s client pick and choose topics
only required for rendering the UI. • Network trafﬁc is hugely reduced • Creating more granular topics is possible

Architecture

ZMQ XPUB The part highlighted is XSUB pattern of zeroMQ
where many producer connect to one of the diligent proxy server.  Every backend server connects to one diligent proxy server. Even if there are more server the backend just needs to connect to one. The load balancer takes care of this

ZMQ XSUB Every diligent server connects to every diligent proxy
server. Since all proxy servers have subset of received messages.

Just to zoom on what I mean by 1-N connections.
Every diligent web socket server connects to every diligent proxy server. The way it discovers all active proxy servers is through zookeeper. This is the XPUB part of ZMQ

ZMQ  XSUB ZMQ  XPUB This is how the high level
architecture looks like. If there are no subscribers for a ZMQ topic the producers drop the message for a particular topic.

ZMQ XPub-XSub • Broadcast of messages on a particular topic
is handled out of the box • Support back propagation of subscribers to ZMQ producers. If there are no subscribers message production does not happen saving network hops

We also we have multiple clusters and they can talk
to each other. They have their diﬀerent set of subscribers. Publishers can come from another cluster.

Under the hood • Clojure (JVM) • Http-kit (Java NIO
based web sockets server) • ZMQ • Zookeeper

Maintenance Numbers • Production - ~4 years • Active maintainers
- 1 • On-call - 1

Monitoring All the messages we are publishing is important data
and needs to rendered in time. The nature of this data is ephemeral. We don’t store it anywhere so auditing is hard. So utilising monitoring was crucial for us.

Under the hood • StatsD protocol • Graphite - Storage
• Grafana - Frontend • Sensu - Alerts on trends

*example of monitoring comparison different stages* Since auditing this kind
of data is hard. We compare metrics of data in diﬀerent stages of the platform. But since the numbers are big it’s hard to spot any anomaly. What we are looking for is variance.

Problem with absolute numbers • Relatively small differences are not
easy to spot • Alerts will need to have arbitrary numbers which tend to ﬁre false positives

Message variance is easy to parse visually. If variance is
low some stage of the platform is dropping data. In fact we also have setup alerts on this same query.

Variance • Variance is percentage of message in different stages
• Sensu alerts work on the same Graphite queries we use to plot in Grafana. • Alerts are now reliable since the threshold is on a percentage not on a arbitrary number • Only non-actionable alert we get is when there is a network blip. However this helps us understand the outage/degraded window.

Another important metric is time taken to publish a message
to WebSocket connection. Since near real time SLA is so important we look at p99s for anomalies. We have setup alerts on these as well.

Cost saving • Network bandwidth used to push messages out
on internet • Number of machines used by cluster Costs are a concern for us always! There are two important factors that add up to the cost. Outgoing bandwidth usage and number of machines

Compression First we started using gzip compression for websockets. It’s
a standard compression mechanism supported by browsers but as with browsers there are quirks here.

Re-visiting features • ‘x’ feature is taking too much bandwidth
can we do subset of ‘x’ feature. • Is ‘x’ feature adding value for the cost we are paying? • Some payloads have just extra information • Renaming ﬁeld names Biggest change you can do to save costs is to re-visit the features/business logic itself and try to optimise there. This reduced the bandwidth usage by signiﬁcant amount.

Auto scaling To save up on number of machines used.
We started investigating in how to do auto scaling. Auto scaling was not a straight forward thing since all the connections are long running and usually can stay alive for as long as 8 hours.

Auto scaling pre-reqs • Before scaling out each server’s utilisation
should be maximum possible. Load balancing should distribute the load evenly • No server should be overloaded given the increase in scale is mostly gradual

HAProxy with least conn We went with the obvious choice
of least connection with HAProxy doing the load balancing.

For a while it worked perfectly. But it one little
problem.

Even though from connected clients it seems like doing the
same amount of work. It’s actually clear from load average that it’s not.

Least load connection works. Sometimes. Num connections != Load The
problem with least load connection is assumption that number of connections a server is handling is directly proportional to amount of work it’s doing. This was a wrong assumption and it just lead us to uneven distribution. Server crashes and just bad sleepless nights.

We looked at another metric. Number of messages pushed to
browsers from a Dirigent WS server. But to do this we needed to do feedback load balancing

Feedback load balancing Feedback load balancing is something we started
to do with Herald an internal tool we built at Helpshift. This helps HAProxy decide which server to choose when routing a new connection. All the servers can expose the current load they are under to Herald which in turns tells HAproxy which server to choose. If all servers are loaded we scale out. If all servers are under loaded we scale in.

We can same almost similar patterns

Trafﬁc patterns • Support Agents work in shifts. So sudden
rise in activity • Activity 24x7 because most of the support has multiple shifts • No sudden increase in trafﬁc usually unless there is a glaring bug in an app.

Benchmarking

Goals • Decide instance types for the cluster • Set
thresholds for Auto Scaling • Repeatable setup

Benchmarking data • Produced messages • Payloads • Rate •
Subscribe requests • Payloads • Rate

Collect topic subscribes

Collect publishes

Benchmark Setup • Replay subscribe and publishes • Control over
rate and scale of, • Subscribes • Publishes • Repeatable setup • Record and replay

Benchmark results • Concrete number of thresholds found for various
loads • Number of publishes per second • Number of subscribers per machine with per subscribers receiving a certain number of messages • Instance type which requires CPU heavy and moderate memory. No disk usage.

Reliability • Dirigent proxies produce dummy messages to verify messages
are being received on WebSockets server. • Each Dirigent server exposes a health check. If it fails the health check it’s taken out of rotation. Health check veriﬁes each external dependency is healthy as well the service itself. • Each WebSocket client expects a periodic ping from server. If it doesn’t receive a ping it assumes connection is not active and starts a new connection.

Future • Benchmarking tool repurposed as veriﬁcation tool for production
system. Treats the system as black box and veriﬁes availability • Latency of round trip of a ping from WebSocket sever to browser. Drill down of this data per subdomain to address on call reports. • Evaluation of NanoMsg

Summary • Building a web sockets infrastructure on EC2 is
possible but it has quirks • Use feedback load balancing for WebSockets / Long running connection trafﬁc • ZMQ, JVM are solid building blocks for building a realtime pubsub platform • Instrumentation in multiple stages of platform is a good way to keep track of a real time system

Building and Scaling a WebSockets Pubsub System

Building and Scaling a WebSockets Pubsub System

More Decks by kapil

Other Decks in Technology

Featured

Transcript