Scaling Node at Errorception

88b752acb5380b8c054377f455155ab8?s=47 Rakesh Pai
October 19, 2012

Scaling Node at Errorception

The story of how Errorception was launched, and how it has had to overcome scaling challenges with Node.js. The result is what I think is a fantastic way of building large complex apps with Node.


Rakesh Pai

October 19, 2012



  2. ABOUT ME Founder, developer at Errorception JS developer for 9+

  3. Catches client-side JS errors Over a thousand active projects Over

    a million errors caught per month
  4. BACKSTORY First startup failed Decided to launch a second one…

    …in 15 days No real backend experience
  5. BACKSTORY Code was crap, no tests, no thought towards scaling

    Single, monolothic node.js app talking to mongodb It worked… mostly Single machine, 512 MB RAM #1 on HN for a couple of hours… What could go wrong with that?
  6. DAY 3 Big Russian website decides to use Errorception Over

    a hundred requests per second Around 10 errors/second That actually isn't too bad, right?

  8. NODE WAS THE PROBLEM! Well, actually, it was my code

    Lesson 1: Never do anything CPU intensive in node Re-evaluate even small loops in tight code paths
  9. HOT CODE PATHS De-duplication of errors

  10. REWRITE HOT-PATHS Reduce/eliminate loops No harm in being chatty with

    the DB Rethink the logic for finding duplicates
  11. SOME MONTHS LATER… Big advertising company decides to use Errorception

    Ads go on the Yahoo! homepage. Yahoo's traffic hits Errorception's server Server: 768 MB RAM machine
  12. — Mike Krieger, Instagram

  13. UNSEXY SCALING SOLUTION u l i m i t -

  14. UNSEXY SCALING SOLUTION Fine-tune nginx w o r k e

    r _ p r o c e s s e s 4 ; w o r k e r _ c p u _ a f f i n i t y 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 ;
  15. MOST UNSEXY SOLUTION Call up the advertising company. Apologize. Ask

    them for some time. Server recovered in 15 mins.
  16. UNSEXY SCALING SOLUTION Use a CDN! Marketed as being geo-optimized.

    In reality, my server was f***ed
  17. ENOUGH FIREFIGHTING Time to think about the architecture seriously Throwing

    money at the problem isn't going to work for long
  18. PROBLEMS No confidence about the code quality. Needed tests. Needed

    better visibility and monitoring Needed to remove, not add, complexity Deployment
  19. THE DEPLOYMENT PROBLEM Deployment needs a restart Sometimes, deployments need

    a downtime for DB migrations When down, errors aren't collected Also, an app that's down looks bad But minimizing deployments is not a good idea
  20. FINDING SOLUTIONS Rich Hickey - Simple Made Easy

  21. SOLUTIONS Break up the application into multiple pieces Each piece

    as small as necessary Deploy each piece independently, version them independently Get them to talk to each other through some message passing system
  22. server Very, VERY versatile Great for temporary, ephemeral data Excellent

    atomic prmitives Not just memcached on steroids
  23. QUEUES Queues let each part of the application prepare tasks

    for other parts If a component dies, queue will fill up However, it lets us kill parts of the app at will
  24. ERRORCEPTION ISN'T ONE APP The UI server deals with serving

    HTTP (ExpressJS) A super lightweight (90 LOC) pure-node HTTP server collects errors: Uses node's cluster to split the task across processes Collects errors and simply dumps them into a redis queue 3 micro-apps process the errors from queue to queue in Redis Finally, a single small app writes to MongoDB
  25. MULTIPLE INSTANCES OF AN APP Hard to maintain consistency


  27. ONE SMALL PROBLEM REMAINS What about shared logic? Need a

    way to have a SOA of some sort Tried several solutions. All failed in interesting ways
  28. THIS WAS EXHAUSTING Frequent errors and downtimes Restarts every couple

    of hours due to memory leaks Even wrote a script to restart applications every couple of hours!

    do it? Aha moment: n o d e _ m o d u l e s of course!
  30. SHARING CODE WITH SYMLINKS is now a module, rather than

    a service The module folder is simply symlinked into every app's n o d e _ m o d u l e s folder Still has drawbacks, but works
  31. CURRENT STACK ExpressJS for the website Pure, straight-up node for

    the error catching server Redis everywhere Mongoose on top of MongoDB f o r e v e r as a process watcher 24 node processes Still one primary machine and one failover
  32. NEXT STEPS Improve code sharing Some form of service-oriented application

    distribution mechanism
  33. THANK YOU @errorception @rakesh314