Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalling Node to 50 Million Requests

Andy Kent
October 22, 2011

Scalling Node to 50 Million Requests

Presented with Ryan Greenhall at nodejsconf.it

This talk discusses some of the additional tooling Forward have employed to allow us to deploy node applications at large scale with very high availability.

Andy Kent

October 22, 2011
Tweet

More Decks by Andy Kent

Other Decks in Programming

Transcript

  1. Introduction • Who are Forward? • Story of how we

    became node adopters • How did we scale? • Production statistics and monitoring • Open source tooling used • Open source contributions
  2. Search Marketing • Custom redirect/click tracking service • The reliable

    collection of click data is essential for affiliates • Need to be web scale ;) • High volume – low latency
  3. Introducing the Redirect Service • Dynamically redirect users to the

    best pages based on criteria. • Record click interactions to disk. • Aim for consistently low latencies worldwide ~50mm req/day
  4. A long time ago in a galaxy far way… •

    Our redirect service was originally written in Ruby • Ruby process based scale out is harsh • Queue based architecture allowed us to scale quite well
  5. Ruby Architecture EC2 - ELB 1A-1 1A-2 1B-1 1B-2 Queue

    A Queue B Logger A Logger B Data Storage
  6. Problems with Ruby the Implementation • Queues added complexity •

    Several points of failure • Erratic latency under heavy load
  7. The Query Tracker • About 18 months ago we had

    the opportunity to place JS on the landing page of a major client • Allowed us to track user actions of interest. • Needed a way to track and persist the events
  8. The Query Tracker • Web service that writes the parameters

    passed to disk • This was a perfect match for node’s non blocking IO. • No need for queues • Deployed to the same machines as redirect service • Av. Latency was half!
  9. Redirect Rewrite • Leanings from Query Tracker applied • Async

    disc IO replaced queues • Machine count halved
  10. Examples of System Behaviour describe("r1 redirect", function() { it("Redirects to

    correct site", function() { get("http://google.com/", {}, function(response) { redirect = response.headers.location; expect(redirect).toEqual("http://www.google.com/"); }); }); });
  11. Node is awesome, but I miss Ruby • CoffeeScript to

    the rescue • Reduced code base by 1/3 • But ...
  12. Examples of System Behaviour describe "r1 redirect", -> it "Redirects

    to correct site", -> get "http://google.com/", {}, (response) -> redirect = response.headers.location expect(redirect).toEqual("http://www.google.com/")
  13. Deployment Strategy • Blue-green deployment • Zero Downtime • No

    staging environment • Deploy direct to Production
  14. Any Given Sunday Region Request per second Latency (s) EU

    11,000 0.006 US East 3,500 0.005 US West 5,000 0.007 Asia Pacific 3,000 0.006 32 Million requests per day 60 GB of web log data per day
  15. Monitoring • UltraDNS monitoring • ELB Automatic failover • EC2

    CloudWatch • Local + end-2-end probing • Airbrake • Realtime stream monitoring
  16. Screen Manager • Remote control for external displays • Powering

    16 x 24” displays • supports programming, scheduling • ipad remote control • Available publicly soon
  17. Data Processing • Realtime Stream NodeTail > ZeroMQ > Esper

    • Archival Data Log rotate > SCP > Hadoop/Hive
  18. Creek • Realtime aggregation on unbounded data streams • Allowed

    us to build dashboard displays pulling from vast streams of data • https://github.com/andykent/creek
  19. Creek Aggregators • count.alltime, count.timeboxed • distinct.alltime, distinct.timeboxed • max.alltime,

    max.timeboxed • mean.alltime, mean.timeboxed • min.alltime, min.timeboxed • sum.alltime, sum.timeboxed • popular.timeboxed • recent.limited
  20. Summary • Node is production ready • Simple architectures scale

    with ease • Blue/Green deploys avoid downtime • DNS routing can help