Slide 1

Slide 1 text

SCALING NODE AT ERRORCEPTION

Slide 2

Slide 2 text

ABOUT ME Founder, developer at Errorception JS developer for 9+ years

Slide 3

Slide 3 text

Catches client-side JS errors Over a thousand active projects Over a million errors caught per month

Slide 4

Slide 4 text

BACKSTORY First startup failed Decided to launch a second one… …in 15 days No real backend experience

Slide 5

Slide 5 text

BACKSTORY Code was crap, no tests, no thought towards scaling Single, monolothic node.js app talking to mongodb It worked… mostly Single machine, 512 MB RAM #1 on HN for a couple of hours… What could go wrong with that?

Slide 6

Slide 6 text

DAY 3 Big Russian website decides to use Errorception Over a hundred requests per second Around 10 errors/second That actually isn't too bad, right?

Slide 7

Slide 7 text

LINUX / NGINX MANAGES TO SHOULDER THE LOAD Node doesn't

Slide 8

Slide 8 text

NODE WAS THE PROBLEM! Well, actually, it was my code Lesson 1: Never do anything CPU intensive in node Re-evaluate even small loops in tight code paths

Slide 9

Slide 9 text

HOT CODE PATHS De-duplication of errors

Slide 10

Slide 10 text

REWRITE HOT-PATHS Reduce/eliminate loops No harm in being chatty with the DB Rethink the logic for finding duplicates

Slide 11

Slide 11 text

SOME MONTHS LATER… Big advertising company decides to use Errorception Ads go on the Yahoo! homepage. Yahoo's traffic hits Errorception's server Server: 768 MB RAM machine

Slide 12

Slide 12 text

— Mike Krieger, Instagram

Slide 13

Slide 13 text

UNSEXY SCALING SOLUTION u l i m i t - n

Slide 14

Slide 14 text

UNSEXY SCALING SOLUTION Fine-tune nginx w o r k e r _ p r o c e s s e s 4 ; w o r k e r _ c p u _ a f f i n i t y 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 ;

Slide 15

Slide 15 text

MOST UNSEXY SOLUTION Call up the advertising company. Apologize. Ask them for some time. Server recovered in 15 mins.

Slide 16

Slide 16 text

UNSEXY SCALING SOLUTION Use a CDN! Marketed as being geo-optimized. In reality, my server was f***ed

Slide 17

Slide 17 text

ENOUGH FIREFIGHTING Time to think about the architecture seriously Throwing money at the problem isn't going to work for long

Slide 18

Slide 18 text

PROBLEMS No confidence about the code quality. Needed tests. Needed better visibility and monitoring Needed to remove, not add, complexity Deployment

Slide 19

Slide 19 text

THE DEPLOYMENT PROBLEM Deployment needs a restart Sometimes, deployments need a downtime for DB migrations When down, errors aren't collected Also, an app that's down looks bad But minimizing deployments is not a good idea

Slide 20

Slide 20 text

FINDING SOLUTIONS Rich Hickey - Simple Made Easy

Slide 21

Slide 21 text

SOLUTIONS Break up the application into multiple pieces Each piece as small as necessary Deploy each piece independently, version them independently Get them to talk to each other through some message passing system

Slide 22

Slide 22 text

server Very, VERY versatile Great for temporary, ephemeral data Excellent atomic prmitives Not just memcached on steroids

Slide 23

Slide 23 text

QUEUES Queues let each part of the application prepare tasks for other parts If a component dies, queue will fill up However, it lets us kill parts of the app at will

Slide 24

Slide 24 text

ERRORCEPTION ISN'T ONE APP The UI server deals with serving HTTP (ExpressJS) A super lightweight (90 LOC) pure-node HTTP server collects errors: Uses node's cluster to split the task across processes Collects errors and simply dumps them into a redis queue 3 micro-apps process the errors from queue to queue in Redis Finally, a single small app writes to MongoDB

Slide 25

Slide 25 text

MULTIPLE INSTANCES OF AN APP Hard to maintain consistency

Slide 26

Slide 26 text

REDIS-LOCK github.com/errorception/redis-lock

Slide 27

Slide 27 text

ONE SMALL PROBLEM REMAINS What about shared logic? Need a way to have a SOA of some sort Tried several solutions. All failed in interesting ways

Slide 28

Slide 28 text

THIS WAS EXHAUSTING Frequent errors and downtimes Restarts every couple of hours due to memory leaks Even wrote a script to restart applications every couple of hours!

Slide 29

Slide 29 text

HOW DO WE SHARE CODE ACROSS APPLICATIONS? How does node do it? Aha moment: n o d e _ m o d u l e s of course!

Slide 30

Slide 30 text

SHARING CODE WITH SYMLINKS is now a module, rather than a service The module folder is simply symlinked into every app's n o d e _ m o d u l e s folder Still has drawbacks, but works

Slide 31

Slide 31 text

CURRENT STACK ExpressJS for the website Pure, straight-up node for the error catching server Redis everywhere Mongoose on top of MongoDB f o r e v e r as a process watcher 24 node processes Still one primary machine and one failover

Slide 32

Slide 32 text

NEXT STEPS Improve code sharing Some form of service-oriented application distribution mechanism

Slide 33

Slide 33 text

THANK YOU github.com/errorception errorception.com @errorception @rakesh314