Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Schemas for Real Time Analytics with Node.js- Eric Lubow and Russell Bradberry, SimpleReach

mongodb
January 20, 2012

Schemas for Real Time Analytics with Node.js- Eric Lubow and Russell Bradberry, SimpleReach

NYC MongoDB User Group 1-17-2011

SimpleReach powers the Slide, a recommendation powered content discovery technology for websites. Eric Lublow will discuss how SimpleReach builds schemas in MongoDB and Node.js for powerful, real-time data delivery. Topics will include:

* Dynamic collection creation * Updating (Increments over sets) * Schema * Next gen schemas for dynamic filters

mongodb

January 20, 2012
Tweet

More Decks by mongodb

Other Decks in Technology

Transcript

  1. Problems •  Data is real time, so why shouldn’t the

    reports be •  Need to be able to filter reports •  Must be able to show aggregates in different ways •  Load time must be sub-second •  Also, need to have pretty graphs •  Reporting on web page analytics, so that’s a lot of data •  Over 8mm pages indexed so far •  On over 12,000 active sites •  Some sites are getting over 100mm page views per month •  Overall we need to handle billions of page views per month •  And we need to do it on the cheap (we are a start up btw)
  2. The old way of doing things •  Write events to

    flat file •  Load flat file into Hadoop •  MapReduce •  MapReduce •  MapReduce •  Load MapReduced data into MySQL •  ???? •  Profit
  3. Why the old way doesn’t work •  Requires a lot

    of storage space •  Requires long-running background tasks •  Need quite a few servers and plenty of ram •  It’s not real time •  It costs a lot for infrastructure and maintenance •  WTF is a zookeeper anyway?
  4. Why MongoDB Works •  It’s web scale •  No background

    tasks that can break or need monitoring •  No need to load data into different intermediary data stores •  It’s real time •  We are currently doing 3000+ writes per second and 800+ reads per second (at peak times), on a single machine on Amazon EC2. •  We have enough room to grow by an order of magnitude before we have to shard
  5. The Collections •  There are 3 main reports •  By

    Account •  By Day •  By URL •  There are 2 main collections •  By Account •  By URL •  Others may include •  By $criteria (Dynamic) The goal is to aggregate everything ahead of time, as much as possible, so our Rails app doesn’t have to do heavy aggregations.
  6. The Collections cont.. •  The account report is the most

    viewed report •  The collection is one document per account •  Holds all activity for an account •  We get a subset of the document to represent the date range requested and do a simple aggregation •  The content report is viewed when users want a bit more detail •  The collection is one document per URL •  Very large, cant do real-time aggregations •  Great for viewing a report by URL because there is minimal aggregating required
  7. The Document { "_id" : ObjectId("000000000000000000000000"), "account_id" : ObjectId("000000000000000000000001") "content_id"

    : ObjectId("000000000000000000000002") "stats" : { "sum" : { "p": 24690, c: 24, i: 2468 } "2011" : { "sum": { "p": 24690, c: 24, i: 2468 } "06" : { "13" : { "sum" : { "p": 12345, "c" : 12, "i" : 1234 } }, "14" : { "sum" : { "p": 12345, "c" : 12, "i" : 1234 } } } } } }
  8. What about filtering? •  If the data is pre-aggregated, how

    do I filter out what I don’t want to see? •  There are many ways, the one that suits you might be different •  One way is to create a hash for filters •  If you have 5 different criteria, that’s (2^5)-1 = 31 documents to update •  At 1000 requests per second, that would be 31,000 documents to update per second •  That’s less than half of what we bench 1 server •  It will do.
  9. Why NodeJS •  It’s non blocking! •  This means if

    we are updating 31,000 documents at once, we can do it in parallel •  It’s in JavaScript (something all of our developers know) •  It can be developed faster than Java/Scala/C++ •  It’s much faster than Ruby + EventMachine •  We thought about using Twisted Python, but we lack the expertise in our team •  We can respond to the user before logging the event •  Does all the heavy lifting so our Rails app can easily display it to the users.
  10. It powers The Slide by SimpleReach 300mm+ slides to date

    10mm page views per day 20k+ accounts 8mm+ articles indexed Sub 50ms response times Instant real-time reporting
  11. is Hiring! •  If you would like to do awesome

    things with real-time analytics, drop us a line: @SimpleReach [email protected] Eric Lubow (CTO) @elubow [email protected] Russ Bradberry (Principal Architect) @devdazed [email protected]