Scaling Node at Errorception

SCALING NODE AT ERRORCEPTION

ABOUT ME Founder, developer at Errorception JS developer for 9+
years

Catches client-side JS errors Over a thousand active projects Over
a million errors caught per month

BACKSTORY First startup failed Decided to launch a second one…
…in 15 days No real backend experience

BACKSTORY Code was crap, no tests, no thought towards scaling
Single, monolothic node.js app talking to mongodb It worked… mostly Single machine, 512 MB RAM #1 on HN for a couple of hours… What could go wrong with that?

DAY 3 Big Russian website decides to use Errorception Over
a hundred requests per second Around 10 errors/second That actually isn't too bad, right?

LINUX / NGINX MANAGES TO SHOULDER THE LOAD Node doesn't

NODE WAS THE PROBLEM! Well, actually, it was my code
Lesson 1: Never do anything CPU intensive in node Re-evaluate even small loops in tight code paths

HOT CODE PATHS De-duplication of errors

REWRITE HOT-PATHS Reduce/eliminate loops No harm in being chatty with
the DB Rethink the logic for finding duplicates

SOME MONTHS LATER… Big advertising company decides to use Errorception
Ads go on the Yahoo! homepage. Yahoo's traffic hits Errorception's server Server: 768 MB RAM machine

— Mike Krieger, Instagram

UNSEXY SCALING SOLUTION u l i m i t -
n

UNSEXY SCALING SOLUTION Fine-tune nginx w o r k e
r _ p r o c e s s e s 4 ; w o r k e r _ c p u _ a f f i n i t y 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 ;

MOST UNSEXY SOLUTION Call up the advertising company. Apologize. Ask
them for some time. Server recovered in 15 mins.

UNSEXY SCALING SOLUTION Use a CDN! Marketed as being geo-optimized.
In reality, my server was f***ed

ENOUGH FIREFIGHTING Time to think about the architecture seriously Throwing
money at the problem isn't going to work for long

PROBLEMS No confidence about the code quality. Needed tests. Needed
better visibility and monitoring Needed to remove, not add, complexity Deployment

THE DEPLOYMENT PROBLEM Deployment needs a restart Sometimes, deployments need
a downtime for DB migrations When down, errors aren't collected Also, an app that's down looks bad But minimizing deployments is not a good idea

FINDING SOLUTIONS Rich Hickey - Simple Made Easy

SOLUTIONS Break up the application into multiple pieces Each piece
as small as necessary Deploy each piece independently, version them independently Get them to talk to each other through some message passing system

server Very, VERY versatile Great for temporary, ephemeral data Excellent
atomic prmitives Not just memcached on steroids

QUEUES Queues let each part of the application prepare tasks
for other parts If a component dies, queue will fill up However, it lets us kill parts of the app at will

ERRORCEPTION ISN'T ONE APP The UI server deals with serving
HTTP (ExpressJS) A super lightweight (90 LOC) pure-node HTTP server collects errors: Uses node's cluster to split the task across processes Collects errors and simply dumps them into a redis queue 3 micro-apps process the errors from queue to queue in Redis Finally, a single small app writes to MongoDB

MULTIPLE INSTANCES OF AN APP Hard to maintain consistency

REDIS-LOCK github.com/errorception/redis-lock

ONE SMALL PROBLEM REMAINS What about shared logic? Need a
way to have a SOA of some sort Tried several solutions. All failed in interesting ways

THIS WAS EXHAUSTING Frequent errors and downtimes Restarts every couple
of hours due to memory leaks Even wrote a script to restart applications every couple of hours!

HOW DO WE SHARE CODE ACROSS APPLICATIONS? How does node
do it? Aha moment: n o d e _ m o d u l e s of course!

SHARING CODE WITH SYMLINKS is now a module, rather than
a service The module folder is simply symlinked into every app's n o d e _ m o d u l e s folder Still has drawbacks, but works

CURRENT STACK ExpressJS for the website Pure, straight-up node for
the error catching server Redis everywhere Mongoose on top of MongoDB f o r e v e r as a process watcher 24 node processes Still one primary machine and one failover

NEXT STEPS Improve code sharing Some form of service-oriented application
distribution mechanism

THANK YOU github.com/errorception errorception.com @errorception @rakesh314

Scaling Node at Errorception

Scaling Node at Errorception

Rakesh Pai

More Decks by Rakesh Pai

Other Decks in Programming

Featured

Transcript