Scalling Node to 50 Million Requests

Scaling Node to 50 million requests per day • Andy
Kent • Ryan Greenhall

Introduction • Who are Forward? • Story of how we
became node adopters • How did we scale? • Production statistics and monitoring • Open source tooling used • Open source contributions

Who are we?

Search Marketing • Custom redirect/click tracking service • The reliable
collection of click data is essential for afﬁliates • Need to be web scale ;) • High volume – low latency

Introducing the Redirect Service • Dynamically redirect users to the
best pages based on criteria. • Record click interactions to disk. • Aim for consistently low latencies worldwide ~50mm req/day

A long time ago in a galaxy far way… •
Our redirect service was originally written in Ruby • Ruby process based scale out is harsh • Queue based architecture allowed us to scale quite well

Ruby Architecture EC2 - ELB 1A-1 1A-2 1B-1 1B-2 Queue
A Queue B Logger A Logger B Data Storage

Problems with Ruby the Implementation • Queues added complexity •
Several points of failure • Erratic latency under heavy load

The Query Tracker • About 18 months ago we had
the opportunity to place JS on the landing page of a major client • Allowed us to track user actions of interest. • Needed a way to track and persist the events

The Query Tracker • Web service that writes the parameters
passed to disk • This was a perfect match for node’s non blocking IO. • No need for queues • Deployed to the same machines as redirect service • Av. Latency was half!

Redirect Rewrite • Leanings from Query Tracker applied • Async
disc IO replaced queues • Machine count halved

Redirect Libraries • Node native HTTP server • Jasmine •
node-mailer

Examples of System Behaviour describe("r1 redirect", function() { it("Redirects to
correct site", function() { get("http://google.com/", {}, function(response) { redirect = response.headers.location; expect(redirect).toEqual("http://www.google.com/"); }); }); });

Node is awesome, but I miss Ruby • CoffeeScript to
the rescue • Reduced code base by 1/3 • But ...

Examples of System Behaviour describe "r1 redirect", -> it "Redirects
to correct site", -> get "http://google.com/", {}, (response) -> redirect = response.headers.location expect(redirect).toEqual("http://www.google.com/")

Deployment Strategy • Blue-green deployment • Zero Downtime • No
staging environment • Deploy direct to Production

Deployment Tooling • Capistrano • Checkout and run • EC2-ELB
• Upstart

Scaling is Easy if... • Stateless • Inconstant • Isolated

Physical Deployment

Deployment Arch overview go.redirect.com us.redirect.com us-west.redirect.com eu.redirect.com ap.redirect.com EC2 -
ELB 1A-1 1A-2 1B-1 1B-2

Any Given Sunday Region Request per second Latency (s) EU
11,000 0.006 US East 3,500 0.005 US West 5,000 0.007 Asia Paciﬁc 3,000 0.006 32 Million requests per day 60 GB of web log data per day

How does load affect Latency?

Monitoring • UltraDNS monitoring • ELB Automatic failover • EC2
CloudWatch • Local + end-2-end probing • Airbrake • Realtime stream monitoring

Health Status dashboard

Screen Manager • Remote control for external displays • Powering
16 x 24” displays • supports programming, scheduling • ipad remote control • Available publicly soon

Screen Manager

Insight • https://github.com/ryangreenhall/insight

Data Processing • Realtime Stream NodeTail > ZeroMQ > Esper
• Archival Data Log rotate > SCP > Hadoop/Hive

Node-Tail Tail = require('tail').Tail; tail = new Tail("fileToTail"); tail.on("line", function(data)
{ console.log(data); });

Creek • Realtime aggregation on unbounded data streams • Allowed
us to build dashboard displays pulling from vast streams of data • https://github.com/andykent/creek

Creek Aggregators • count.alltime, count.timeboxed • distinct.alltime, distinct.timeboxed • max.alltime,
max.timeboxed • mean.alltime, mean.timeboxed • min.alltime, min.timeboxed • sum.alltime, sum.timeboxed • popular.timeboxed • recent.limited

Summary • Node is production ready • Simple architectures scale
with ease • Blue/Green deploys avoid downtime • DNS routing can help

Thanks for listening... • Questions?

Scalling Node to 50 Million Requests

Scalling Node to 50 Million Requests

Andy Kent

More Decks by Andy Kent

Other Decks in Programming

Featured

Transcript

Scaling Node to 50 million requests per day • Andy

Introduction • Who are Forward? • Story of how we

Who are we?

Search Marketing • Custom redirect/click tracking service • The reliable

Introducing the Redirect Service • Dynamically redirect users to the

A long time ago in a galaxy far way… •

Ruby Architecture EC2 - ELB 1A-1 1A-2 1B-1 1B-2 Queue

Problems with Ruby the Implementation • Queues added complexity •

The Query Tracker • About 18 months ago we had

The Query Tracker • Web service that writes the parameters

Redirect Rewrite • Leanings from Query Tracker applied • Async

Redirect Libraries • Node native HTTP server • Jasmine •

Examples of System Behaviour describe("r1 redirect", function() { it("Redirects to

Node is awesome, but I miss Ruby • CoffeeScript to

Examples of System Behaviour describe "r1 redirect", -> it "Redirects

Deployment Strategy • Blue-green deployment • Zero Downtime • No

Deployment Tooling • Capistrano • Checkout and run • EC2-ELB

Scaling is Easy if... • Stateless • Inconstant • Isolated

Physical Deployment

Deployment Arch overview go.redirect.com us.redirect.com us-west.redirect.com eu.redirect.com ap.redirect.com EC2 -

Any Given Sunday Region Request per second Latency (s) EU

How does load affect Latency?

Monitoring • UltraDNS monitoring • ELB Automatic failover • EC2

Health Status dashboard

Screen Manager • Remote control for external displays • Powering

Screen Manager

Insight • https://github.com/ryangreenhall/insight

Data Processing • Realtime Stream NodeTail > ZeroMQ > Esper

Node-Tail Tail = require('tail').Tail; tail = new Tail("fileToTail"); tail.on("line", function(data)

Creek • Realtime aggregation on unbounded data streams • Allowed

Creek Aggregators • count.alltime, count.timeboxed • distinct.alltime, distinct.timeboxed • max.alltime,

Summary • Node is production ready • Simple architectures scale

Thanks for listening... • Questions?