Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Node at Errorception

Scaling Node at Errorception

The story of how Errorception was launched, and how it has had to overcome scaling challenges with Node.js. The result is what I think is a fantastic way of building large complex apps with Node.

Rakesh Pai

October 19, 2012

More Decks by Rakesh Pai

Other Decks in Programming


  1. BACKSTORY First startup failed Decided to launch a second one…

    …in 15 days No real backend experience
  2. BACKSTORY Code was crap, no tests, no thought towards scaling

    Single, monolothic node.js app talking to mongodb It worked… mostly Single machine, 512 MB RAM #1 on HN for a couple of hours… What could go wrong with that?
  3. DAY 3 Big Russian website decides to use Errorception Over

    a hundred requests per second Around 10 errors/second That actually isn't too bad, right?
  4. NODE WAS THE PROBLEM! Well, actually, it was my code

    Lesson 1: Never do anything CPU intensive in node Re-evaluate even small loops in tight code paths
  5. REWRITE HOT-PATHS Reduce/eliminate loops No harm in being chatty with

    the DB Rethink the logic for finding duplicates
  6. SOME MONTHS LATER… Big advertising company decides to use Errorception

    Ads go on the Yahoo! homepage. Yahoo's traffic hits Errorception's server Server: 768 MB RAM machine
  7. UNSEXY SCALING SOLUTION Fine-tune nginx w o r k e

    r _ p r o c e s s e s 4 ; w o r k e r _ c p u _ a f f i n i t y 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 ;
  8. MOST UNSEXY SOLUTION Call up the advertising company. Apologize. Ask

    them for some time. Server recovered in 15 mins.
  9. ENOUGH FIREFIGHTING Time to think about the architecture seriously Throwing

    money at the problem isn't going to work for long
  10. PROBLEMS No confidence about the code quality. Needed tests. Needed

    better visibility and monitoring Needed to remove, not add, complexity Deployment
  11. THE DEPLOYMENT PROBLEM Deployment needs a restart Sometimes, deployments need

    a downtime for DB migrations When down, errors aren't collected Also, an app that's down looks bad But minimizing deployments is not a good idea
  12. SOLUTIONS Break up the application into multiple pieces Each piece

    as small as necessary Deploy each piece independently, version them independently Get them to talk to each other through some message passing system
  13. server Very, VERY versatile Great for temporary, ephemeral data Excellent

    atomic prmitives Not just memcached on steroids
  14. QUEUES Queues let each part of the application prepare tasks

    for other parts If a component dies, queue will fill up However, it lets us kill parts of the app at will
  15. ERRORCEPTION ISN'T ONE APP The UI server deals with serving

    HTTP (ExpressJS) A super lightweight (90 LOC) pure-node HTTP server collects errors: Uses node's cluster to split the task across processes Collects errors and simply dumps them into a redis queue 3 micro-apps process the errors from queue to queue in Redis Finally, a single small app writes to MongoDB
  16. ONE SMALL PROBLEM REMAINS What about shared logic? Need a

    way to have a SOA of some sort Tried several solutions. All failed in interesting ways
  17. THIS WAS EXHAUSTING Frequent errors and downtimes Restarts every couple

    of hours due to memory leaks Even wrote a script to restart applications every couple of hours!

    do it? Aha moment: n o d e _ m o d u l e s of course!
  19. SHARING CODE WITH SYMLINKS is now a module, rather than

    a service The module folder is simply symlinked into every app's n o d e _ m o d u l e s folder Still has drawbacks, but works
  20. CURRENT STACK ExpressJS for the website Pure, straight-up node for

    the error catching server Redis everywhere Mongoose on top of MongoDB f o r e v e r as a process watcher 24 node processes Still one primary machine and one failover