Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Node at Errorception

Rakesh Pai
October 19, 2012

Scaling Node at Errorception

The story of how Errorception was launched, and how it has had to overcome scaling challenges with Node.js. The result is what I think is a fantastic way of building large complex apps with Node.

Rakesh Pai

October 19, 2012
Tweet

More Decks by Rakesh Pai

Other Decks in Programming

Transcript

  1. SCALING NODE AT
    ERRORCEPTION

    View Slide

  2. ABOUT ME
    Founder, developer at Errorception
    JS developer for 9+ years

    View Slide

  3. Catches client-side JS errors
    Over a thousand active projects
    Over a million errors caught per month

    View Slide

  4. BACKSTORY
    First startup failed
    Decided to launch a second one…
    …in 15 days
    No real backend experience

    View Slide

  5. BACKSTORY
    Code was crap, no tests, no thought towards scaling
    Single, monolothic node.js app talking to mongodb
    It worked… mostly
    Single machine, 512 MB RAM
    #1 on HN for a couple of hours… What could go
    wrong with that?

    View Slide

  6. DAY 3
    Big Russian website decides to use Errorception
    Over a hundred requests per second
    Around 10 errors/second
    That actually isn't too bad, right?

    View Slide

  7. LINUX / NGINX MANAGES TO SHOULDER
    THE LOAD
    Node doesn't

    View Slide

  8. NODE WAS THE PROBLEM!
    Well, actually, it was my code
    Lesson 1: Never do anything CPU intensive in node
    Re-evaluate even small loops in tight code paths

    View Slide

  9. HOT CODE PATHS
    De-duplication of errors

    View Slide

  10. REWRITE HOT-PATHS
    Reduce/eliminate loops
    No harm in being chatty with the DB
    Rethink the logic for finding duplicates

    View Slide

  11. SOME MONTHS LATER…
    Big advertising company decides to use Errorception
    Ads go on the Yahoo! homepage. Yahoo's traffic hits
    Errorception's server
    Server: 768 MB RAM machine

    View Slide

  12. — Mike Krieger, Instagram

    View Slide

  13. UNSEXY SCALING SOLUTION
    u
    l
    i
    m
    i
    t -
    n

    View Slide

  14. UNSEXY SCALING SOLUTION
    Fine-tune nginx
    w
    o
    r
    k
    e
    r
    _
    p
    r
    o
    c
    e
    s
    s
    e
    s 4
    ;
    w
    o
    r
    k
    e
    r
    _
    c
    p
    u
    _
    a
    f
    f
    i
    n
    i
    t
    y 0
    0
    0
    1 0
    0
    1
    0 0
    1
    0
    0 1
    0
    0
    0
    ;

    View Slide

  15. MOST UNSEXY SOLUTION
    Call up the advertising company. Apologize. Ask them
    for some time.
    Server recovered in 15 mins.

    View Slide

  16. UNSEXY SCALING SOLUTION
    Use a CDN!
    Marketed as being geo-optimized.
    In reality, my server was f***ed

    View Slide

  17. ENOUGH FIREFIGHTING
    Time to think about the architecture seriously
    Throwing money at the problem isn't going to work for
    long

    View Slide

  18. PROBLEMS
    No confidence about the code quality. Needed tests.
    Needed better visibility and monitoring
    Needed to remove, not add, complexity
    Deployment

    View Slide

  19. THE DEPLOYMENT PROBLEM
    Deployment needs a restart
    Sometimes, deployments need a downtime for DB
    migrations
    When down, errors aren't collected
    Also, an app that's down looks bad
    But minimizing deployments is not a good idea

    View Slide

  20. FINDING SOLUTIONS
    Rich Hickey - Simple Made Easy

    View Slide

  21. SOLUTIONS
    Break up the application into multiple pieces
    Each piece as small as necessary
    Deploy each piece independently, version them
    independently
    Get them to talk to each other through some
    message passing system

    View Slide

  22. server
    Very, VERY versatile
    Great for temporary, ephemeral data
    Excellent atomic prmitives
    Not just memcached on steroids

    View Slide

  23. QUEUES
    Queues let each part of the application prepare tasks
    for other parts
    If a component dies, queue will fill up
    However, it lets us kill parts of the app at will

    View Slide

  24. ERRORCEPTION ISN'T ONE APP
    The UI server deals with serving HTTP (ExpressJS)
    A super lightweight (90 LOC) pure-node HTTP
    server collects errors:
    Uses node's cluster to split the task across
    processes
    Collects errors and simply dumps them into a redis
    queue
    3 micro-apps process the errors from queue to
    queue in Redis
    Finally, a single small app writes to MongoDB

    View Slide

  25. MULTIPLE INSTANCES OF AN APP
    Hard to maintain consistency

    View Slide

  26. REDIS-LOCK
    github.com/errorception/redis-lock

    View Slide

  27. ONE SMALL PROBLEM REMAINS
    What about shared logic?
    Need a way to have a SOA of some sort
    Tried several solutions. All failed in interesting ways

    View Slide

  28. THIS WAS EXHAUSTING
    Frequent errors and downtimes
    Restarts every couple of hours due to memory leaks
    Even wrote a script to restart applications every
    couple of hours!

    View Slide

  29. HOW DO WE SHARE CODE ACROSS
    APPLICATIONS?
    How does node do it?
    Aha moment: n
    o
    d
    e
    _
    m
    o
    d
    u
    l
    e
    s of course!

    View Slide

  30. SHARING CODE WITH SYMLINKS
    is now a module, rather than a service
    The module folder is simply symlinked into every
    app's n
    o
    d
    e
    _
    m
    o
    d
    u
    l
    e
    s folder
    Still has drawbacks, but works

    View Slide

  31. CURRENT STACK
    ExpressJS for the website
    Pure, straight-up node for the error catching server
    Redis everywhere
    Mongoose on top of MongoDB
    f
    o
    r
    e
    v
    e
    r as a process watcher
    24 node processes
    Still one primary machine and one failover

    View Slide

  32. NEXT STEPS
    Improve code sharing
    Some form of service-oriented application
    distribution mechanism

    View Slide

  33. THANK YOU
    github.com/errorception
    errorception.com
    @errorception
    @rakesh314

    View Slide