ABOUT ME
Founder, developer at Errorception
JS developer for 9+ years
Slide 3
Slide 3 text
Catches client-side JS errors
Over a thousand active projects
Over a million errors caught per month
Slide 4
Slide 4 text
BACKSTORY
First startup failed
Decided to launch a second one…
…in 15 days
No real backend experience
Slide 5
Slide 5 text
BACKSTORY
Code was crap, no tests, no thought towards scaling
Single, monolothic node.js app talking to mongodb
It worked… mostly
Single machine, 512 MB RAM
#1 on HN for a couple of hours… What could go
wrong with that?
Slide 6
Slide 6 text
DAY 3
Big Russian website decides to use Errorception
Over a hundred requests per second
Around 10 errors/second
That actually isn't too bad, right?
Slide 7
Slide 7 text
LINUX / NGINX MANAGES TO SHOULDER
THE LOAD
Node doesn't
Slide 8
Slide 8 text
NODE WAS THE PROBLEM!
Well, actually, it was my code
Lesson 1: Never do anything CPU intensive in node
Re-evaluate even small loops in tight code paths
Slide 9
Slide 9 text
HOT CODE PATHS
De-duplication of errors
Slide 10
Slide 10 text
REWRITE HOT-PATHS
Reduce/eliminate loops
No harm in being chatty with the DB
Rethink the logic for finding duplicates
Slide 11
Slide 11 text
SOME MONTHS LATER…
Big advertising company decides to use Errorception
Ads go on the Yahoo! homepage. Yahoo's traffic hits
Errorception's server
Server: 768 MB RAM machine
Slide 12
Slide 12 text
— Mike Krieger, Instagram
Slide 13
Slide 13 text
UNSEXY SCALING SOLUTION
u
l
i
m
i
t -
n
Slide 14
Slide 14 text
UNSEXY SCALING SOLUTION
Fine-tune nginx
w
o
r
k
e
r
_
p
r
o
c
e
s
s
e
s 4
;
w
o
r
k
e
r
_
c
p
u
_
a
f
f
i
n
i
t
y 0
0
0
1 0
0
1
0 0
1
0
0 1
0
0
0
;
Slide 15
Slide 15 text
MOST UNSEXY SOLUTION
Call up the advertising company. Apologize. Ask them
for some time.
Server recovered in 15 mins.
Slide 16
Slide 16 text
UNSEXY SCALING SOLUTION
Use a CDN!
Marketed as being geo-optimized.
In reality, my server was f***ed
Slide 17
Slide 17 text
ENOUGH FIREFIGHTING
Time to think about the architecture seriously
Throwing money at the problem isn't going to work for
long
Slide 18
Slide 18 text
PROBLEMS
No confidence about the code quality. Needed tests.
Needed better visibility and monitoring
Needed to remove, not add, complexity
Deployment
Slide 19
Slide 19 text
THE DEPLOYMENT PROBLEM
Deployment needs a restart
Sometimes, deployments need a downtime for DB
migrations
When down, errors aren't collected
Also, an app that's down looks bad
But minimizing deployments is not a good idea
Slide 20
Slide 20 text
FINDING SOLUTIONS
Rich Hickey - Simple Made Easy
Slide 21
Slide 21 text
SOLUTIONS
Break up the application into multiple pieces
Each piece as small as necessary
Deploy each piece independently, version them
independently
Get them to talk to each other through some
message passing system
Slide 22
Slide 22 text
server
Very, VERY versatile
Great for temporary, ephemeral data
Excellent atomic prmitives
Not just memcached on steroids
Slide 23
Slide 23 text
QUEUES
Queues let each part of the application prepare tasks
for other parts
If a component dies, queue will fill up
However, it lets us kill parts of the app at will
Slide 24
Slide 24 text
ERRORCEPTION ISN'T ONE APP
The UI server deals with serving HTTP (ExpressJS)
A super lightweight (90 LOC) pure-node HTTP
server collects errors:
Uses node's cluster to split the task across
processes
Collects errors and simply dumps them into a redis
queue
3 micro-apps process the errors from queue to
queue in Redis
Finally, a single small app writes to MongoDB
Slide 25
Slide 25 text
MULTIPLE INSTANCES OF AN APP
Hard to maintain consistency
Slide 26
Slide 26 text
REDIS-LOCK
github.com/errorception/redis-lock
Slide 27
Slide 27 text
ONE SMALL PROBLEM REMAINS
What about shared logic?
Need a way to have a SOA of some sort
Tried several solutions. All failed in interesting ways
Slide 28
Slide 28 text
THIS WAS EXHAUSTING
Frequent errors and downtimes
Restarts every couple of hours due to memory leaks
Even wrote a script to restart applications every
couple of hours!
Slide 29
Slide 29 text
HOW DO WE SHARE CODE ACROSS
APPLICATIONS?
How does node do it?
Aha moment: n
o
d
e
_
m
o
d
u
l
e
s of course!
Slide 30
Slide 30 text
SHARING CODE WITH SYMLINKS
is now a module, rather than a service
The module folder is simply symlinked into every
app's n
o
d
e
_
m
o
d
u
l
e
s folder
Still has drawbacks, but works
Slide 31
Slide 31 text
CURRENT STACK
ExpressJS for the website
Pure, straight-up node for the error catching server
Redis everywhere
Mongoose on top of MongoDB
f
o
r
e
v
e
r as a process watcher
24 node processes
Still one primary machine and one failover
Slide 32
Slide 32 text
NEXT STEPS
Improve code sharing
Some form of service-oriented application
distribution mechanism
Slide 33
Slide 33 text
THANK YOU
github.com/errorception
errorception.com
@errorception
@rakesh314