Building and Scaling a WebSockets Pubsub System

Slide 1

Slide 1 text

Building and Scaling a WebSockets PubSub System Kapil Reddy @ Helpshift

Slide 2

Slide 2 text

About me - Kapil Staff Engineer @ Helpshift Clojure Distributed Systems Games Music Books/Comics Football

Slide 3

Slide 3 text

Overview • Use case and Scale • Overview of PubSub platform • Evolution of platform architecture • Scaling issues and solutions

Slide 4

Slide 4 text

Helpshift is a Mobile CRM SaaS product. We help connect app developers with their customers. Since everything is now on mobile.

Slide 5

Slide 5 text

Scale • ~2 TB data broadcast / day • Outgoing - 75 k msg/sec • Incoming - 1.5 k msg/sec • Concurrency - 3.5k Here are some scale numbers for the Platform we have built

Slide 6

Slide 6 text

Mobile SDK •In-app customer support •Creates a one to one channel between app user and app owner • ~1 Billion unique app installs

Slide 7

Slide 7 text

Once a ticket is ﬁled it shows up in agent’s portal which is a Helpshift web app.

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Topics and Messages The part lit up in green all the parts that require to be updated dynamically.

Slide 10

Slide 10 text

PubSub Platform We built a generic Pubish and Subscribe platform. Subscribers of these messages are Javascript clients listening on Websockets connection and Publishers are any backend server using ZMQ to publish the messages. We call this platform dirigent.

Slide 11

Slide 11 text

A simpliﬁed version of the platform’s architecture. Again browsers (Subscribers) connect to Dirigent using WebSockets and Backend servers (Publishers) connect to Diligent using ZMQ. It’s a simpliﬁed view right now.

Slide 12

Slide 12 text

WebSockets • Bi-Direction transport between client and server • Supports text or binary data • Protocol starts with HTTP connection and upgraded to a WebSockets connection

Slide 13

Slide 13 text

ZMQ • Zero Messaging broker Queue • Supports IPC, TCP, Multicast as transport layer • Tiny library that let’s one use different messaging patterns like PubSub, Request-Response • Network is the only overhead affecting performance

Slide 14

Slide 14 text

Consumer Protocol • Transport - WebSockets • Inspired by WAMP WebSockets sub protocol • Implements methods like, • Subscribe • Ex. [“subscribe”, “topic_foo”] • Unsubscribe • Ex. [“unsubscribe”, “topic_foo”] This is how consumer protocol looks like which is built on top of a WebSockets subprotocol WAMP. It does not completely implement WAMP since it also has abilities to do RPC. But the PubSub part is heavily inspired by WAMP sepc.

Slide 15

Slide 15 text

Producer Protocol • ZMQ • Implements publish • Ex. [“topic_foo”, {unix_ts}, {JSON payload}] Publisher connects to Dirigent by means of ZMQ. Underlying connection is TCP .

Slide 16

Slide 16 text

Give me topic “foo”

Slide 17

Slide 17 text

Give me topic “foo” Send message “bar” on topic “foo”

Slide 18

Slide 18 text

bar

Slide 19

Slide 19 text

bar

Slide 20

Slide 20 text

Stop sending me topic “foo”

Slide 21

Slide 21 text

bar Nothing happens this time around

Slide 22

Slide 22 text

bar Topics for different customers [“cust1.tickets”,”cust1.feed”] [“cust2.tickets”,”cust2.feed”] Nothing happens this time around

Slide 23

Slide 23 text

Think of each green box as a diﬀerent topic

Slide 24

Slide 24 text

Evolution

Slide 25

Slide 25 text

v1 of the platform we used diﬀerent transport mechanism. HTTP streaming for delivering messages to browsers and HTTP to deliver messages to Dirigent servers. HTTP mechanism posed problem and it had coupling eﬀect with backend server. Whenever dirigent platform went down due to load the HTTP connections timed out and created a cascading failure in backend servers. We switched ZMQ there.

Slide 26

Slide 26 text

HTTP

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

X_X Whenever something went wrong with Dirigent cluster. Since the API servers were making Post requests to broadcast to messages the request used to time out.

Slide 29

Slide 29 text

X_X Queued Publish “bar” on topic “foo” Whenever something went wrong with Dirigent cluster. Since the API servers were making Post requests to broadcast to messages the request used to time out.

Slide 30

Slide 30 text

X_X Queued 20 publishes Whenever something went wrong with Dirigent cluster. Since the API servers were making Post requests to broadcast to messages the request used to time out.

Slide 31

Slide 31 text

X_X X_X Eventually backend servers started to conk oﬀ.

Slide 32

Slide 32 text

X_X X_X X_X Eventually backend servers started to conk oﬀ. Instead of HTTP we started using ZMQ as delivery mechanism. It creates loose coupling so there is no cascading failures with application server.

Slide 33

Slide 33 text

Problems with HTTP producers • Cascading failures due to queued requests • Manual fanout to all dirigent servers

Slide 34

Slide 34 text

ZMQ • De-couples producer (Backend server) and consumer (WebSockets server) • Retries, buffering is done out of the box so library handles that • Topic broadcast is supported out of the box so manual broadcast is not required

Slide 35

Slide 35 text

HTTP Streaming Now this is one side of the story. There were problems at consumer side as well.

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

[“cust1”] [“cust2”]

Slide 38

Slide 38 text

So whole dashboard becomes a single topic.

Slide 39

Slide 39 text

[“cust1”] Cust 1 - Ticket 100 Looking at Ticket 1

Slide 40

Slide 40 text

[“cust1”] Cust 1 - Ticket 101 Looking at Ticket 1

Slide 41

Slide 41 text

[“cust1”] Cust 1 - Ticket 103 Looking at Ticket 1

Slide 42

Slide 42 text

[“cust1”] Cust 1 - Ticket 500 Looking at Ticket 1

Slide 43

Slide 43 text

Cust 1 - Ticket 500 Staph!!!

Slide 44

Slide 44 text

Cust 1 - Ticket 500 Staph!!! X_X

Slide 45

Slide 45 text

X_X X_X Since each http connection will get everything that is generated for a subdomain. The connection will start choking. We quickly run out of network card limit for a single machine. On few bad days we saw client-side bandwidth was running out as well.

Slide 46

Slide 46 text

Problems with HTTP streaming • Request for a topic is one time and to choose new set of topics client needs to make a new request • Modelling topics becomes difﬁcult so you end up creating a giant topic for one subdomain/customer • Clients receive everything and WebSockets server has to push everything. • Server network bandwidth limit (~120MBps) runs out • Client bandwidth is limited as well

Slide 47

Slide 47 text

Meaning subscribe to everything forever. Problem with this is you quickly run out of network card limit on a single machine. And we did. To solve this we had to move to a web sockets connection where client can ask for only things it wants and nothing more.

Slide 48

Slide 48 text

So this is how we eventually moved to using WebSockts for subscription

Slide 49

Slide 49 text

WebSockets • Bi-Directional protocol let’s client pick and choose topics only required for rendering the UI. • Network trafﬁc is hugely reduced • Creating more granular topics is possible

Slide 50

Slide 50 text

Architecture

Slide 51

Slide 51 text

ZMQ XPUB The part highlighted is XSUB pattern of zeroMQ where many producer connect to one of the diligent proxy server.  Every backend server connects to one diligent proxy server. Even if there are more server the backend just needs to connect to one. The load balancer takes care of this

Slide 52

Slide 52 text

ZMQ XSUB Every diligent server connects to every diligent proxy server. Since all proxy servers have subset of received messages.

Slide 53

Slide 53 text

Just to zoom on what I mean by 1-N connections. Every diligent web socket server connects to every diligent proxy server. The way it discovers all active proxy servers is through zookeeper. This is the XPUB part of ZMQ

Slide 54

Slide 54 text

ZMQ  XSUB ZMQ  XPUB This is how the high level architecture looks like. If there are no subscribers for a ZMQ topic the producers drop the message for a particular topic.

Slide 55

Slide 55 text

ZMQ XPub-XSub • Broadcast of messages on a particular topic is handled out of the box • Support back propagation of subscribers to ZMQ producers. If there are no subscribers message production does not happen saving network hops

Slide 56

Slide 56 text

We also we have multiple clusters and they can talk to each other. They have their diﬀerent set of subscribers. Publishers can come from another cluster.

Slide 57

Slide 57 text

Under the hood • Clojure (JVM) • Http-kit (Java NIO based web sockets server) • ZMQ • Zookeeper

Slide 58

Slide 58 text

Maintenance Numbers • Production - ~4 years • Active maintainers - 1 • On-call - 1

Slide 59

Slide 59 text

Monitoring All the messages we are publishing is important data and needs to rendered in time. The nature of this data is ephemeral. We don’t store it anywhere so auditing is hard. So utilising monitoring was crucial for us.

Slide 60

Slide 60 text

Under the hood • StatsD protocol • Graphite - Storage • Grafana - Frontend • Sensu - Alerts on trends

Slide 61

Slide 61 text

*example of monitoring comparison different stages* Since auditing this kind of data is hard. We compare metrics of data in diﬀerent stages of the platform. But since the numbers are big it’s hard to spot any anomaly. What we are looking for is variance.

Slide 62

Slide 62 text

Problem with absolute numbers • Relatively small differences are not easy to spot • Alerts will need to have arbitrary numbers which tend to ﬁre false positives

Slide 63

Slide 63 text

Message variance is easy to parse visually. If variance is low some stage of the platform is dropping data. In fact we also have setup alerts on this same query.

Slide 64

Slide 64 text

Variance • Variance is percentage of message in different stages • Sensu alerts work on the same Graphite queries we use to plot in Grafana. • Alerts are now reliable since the threshold is on a percentage not on a arbitrary number • Only non-actionable alert we get is when there is a network blip. However this helps us understand the outage/degraded window.

Slide 65

Slide 65 text

Another important metric is time taken to publish a message to WebSocket connection. Since near real time SLA is so important we look at p99s for anomalies. We have setup alerts on these as well.

Slide 66

Slide 66 text

Cost saving • Network bandwidth used to push messages out on internet • Number of machines used by cluster Costs are a concern for us always! There are two important factors that add up to the cost. Outgoing bandwidth usage and number of machines

Slide 67

Slide 67 text

Compression First we started using gzip compression for websockets. It’s a standard compression mechanism supported by browsers but as with browsers there are quirks here.

Slide 68

Slide 68 text

Re-visiting features • ‘x’ feature is taking too much bandwidth can we do subset of ‘x’ feature. • Is ‘x’ feature adding value for the cost we are paying? • Some payloads have just extra information • Renaming ﬁeld names Biggest change you can do to save costs is to re-visit the features/business logic itself and try to optimise there. This reduced the bandwidth usage by signiﬁcant amount.

Slide 69

Slide 69 text

Auto scaling To save up on number of machines used. We started investigating in how to do auto scaling. Auto scaling was not a straight forward thing since all the connections are long running and usually can stay alive for as long as 8 hours.

Slide 70

Slide 70 text

Auto scaling pre-reqs • Before scaling out each server’s utilisation should be maximum possible. Load balancing should distribute the load evenly • No server should be overloaded given the increase in scale is mostly gradual

Slide 71

Slide 71 text

HAProxy with least conn We went with the obvious choice of least connection with HAProxy doing the load balancing.

Slide 72

Slide 72 text

For a while it worked perfectly. But it one little problem.

Slide 73

Slide 73 text

Even though from connected clients it seems like doing the same amount of work. It’s actually clear from load average that it’s not.

Slide 74

Slide 74 text

Least load connection works. Sometimes. Num connections != Load The problem with least load connection is assumption that number of connections a server is handling is directly proportional to amount of work it’s doing. This was a wrong assumption and it just lead us to uneven distribution. Server crashes and just bad sleepless nights.

Slide 75

Slide 75 text

We looked at another metric. Number of messages pushed to browsers from a Dirigent WS server. But to do this we needed to do feedback load balancing

Slide 76

Slide 76 text

Feedback load balancing Feedback load balancing is something we started to do with Herald an internal tool we built at Helpshift. This helps HAProxy decide which server to choose when routing a new connection. All the servers can expose the current load they are under to Herald which in turns tells HAproxy which server to choose. If all servers are loaded we scale out. If all servers are under loaded we scale in.

Slide 77

Slide 77 text

We can same almost similar patterns

Slide 78

Slide 78 text

Trafﬁc patterns • Support Agents work in shifts. So sudden rise in activity • Activity 24x7 because most of the support has multiple shifts • No sudden increase in trafﬁc usually unless there is a glaring bug in an app.

Slide 79

Slide 79 text

Benchmarking

Slide 80

Slide 80 text

Goals • Decide instance types for the cluster • Set thresholds for Auto Scaling • Repeatable setup

Slide 81

Slide 81 text

Benchmarking data • Produced messages • Payloads • Rate • Subscribe requests • Payloads • Rate

Slide 82

Slide 82 text

Collect topic subscribes

Slide 83

Slide 83 text

Collect publishes

Slide 84

Slide 84 text

No content

Slide 85

Slide 85 text

Benchmark Setup • Replay subscribe and publishes • Control over rate and scale of, • Subscribes • Publishes • Repeatable setup • Record and replay

Slide 86

Slide 86 text

Benchmark results • Concrete number of thresholds found for various loads • Number of publishes per second • Number of subscribers per machine with per subscribers receiving a certain number of messages • Instance type which requires CPU heavy and moderate memory. No disk usage.

Slide 87

Slide 87 text

Reliability • Dirigent proxies produce dummy messages to verify messages are being received on WebSockets server. • Each Dirigent server exposes a health check. If it fails the health check it’s taken out of rotation. Health check veriﬁes each external dependency is healthy as well the service itself. • Each WebSocket client expects a periodic ping from server. If it doesn’t receive a ping it assumes connection is not active and starts a new connection.

Slide 88

Slide 88 text

Future • Benchmarking tool repurposed as veriﬁcation tool for production system. Treats the system as black box and veriﬁes availability • Latency of round trip of a ping from WebSocket sever to browser. Drill down of this data per subdomain to address on call reports. • Evaluation of NanoMsg

Slide 89

Slide 89 text

Summary • Building a web sockets infrastructure on EC2 is possible but it has quirks • Use feedback load balancing for WebSockets / Long running connection trafﬁc • ZMQ, JVM are solid building blocks for building a realtime pubsub platform • Instrumentation in multiple stages of platform is a good way to keep track of a real time system

Slide 90

Slide 90 text

Fin