Realtime Analytics using MongoDB, Python, Gevent, and ZeroMQ

© 2011Geeknet Inc ! Realtime Analytics using MongoDB, Python, Gevent,
and ZeroMQ Rick Copeland @rick446 [email protected]

SourceForge s MongoDB - Tried CouchDB – liked the dev model,
not so much the performance - Migrated consumer-facing pages (summary, browse, download) to MongoDB and it worked great (on MongoDB 0.8 no less!) - Built an entirely new tool platform around MongoDB (Allura)

The Problem We’re Trying to Solve - We have lots of
users (good) - We have lots of projects (good) - We don’t know what those users and projects are doing (not so good) - We have tons of code in PHP, Perl, and Python (not so good)

Introducing Zarkov 0.0.1 Asynchronous TCP server for event logging with
gevent Turn OFF “safe” writes, turn OFF Ming validation (or do it in the client) Incrementally calculate aggregate stats based on event log using mapreduce with {‘out’:’reduce’}

Zarkov Architecture MongoDB BSON over ZeroMQ Journal Greenlet Commit Greenlet
Write-ahead log Write-ahead log Aggregation Cron job

Technologies - MongoDB - Fast (10k+ inserts/s single-threaded) - ZeroMQ - Built-in buffering - PUSH/PULL
sockets (push never blocks, easy to distribute work) - BSON - Fast Python/C implementation - More types than JSON - Gevent - “green threads” for Python

“Wow, it’s really fast; can it replace…” - Download statistics? - Google
Analytics? - Project realtime statistics? “Probably, but it’ll take some work….”

Moving towards production.... - MongoDB MapReduce: convenient, but not so fast
- Global JS Interpreter Lock per mongod - Lots of writing to temp collections (high lock %) - Javascript without libraries (ick!) - Hadoop? Painful to configure, high latency, non-seamless integration with MongoDB

Zarkov’s already doing a lot… So we added a lightweight
map/ reduce framework - Write your map/reduce jobs in Python - Input/Output is MongoDB - Intermediate files are local .bson files - Use ZeroMQ for job distribution

Quick Map/reduce Refresher def map_reduce(input_collection, query, output_collection, map, reduce): objects
= input_collection.find(query) map_results = list(map(objects)) map_results.sort(key=operator.itemgetter(0)) for key, kv_pairs in itertools.groupby( (map_results, operator.itemgetter(0)): value = reduce(key, [ v for k,v in kv_pairs ]) output_collection.save( {"_id":key, "value":value})

Quick Map/reduce Refresher def map_reduce(input_collection, query, output_collection, map, reduce): objects
= input_collection.find(query) map_results = list(map(objects)) map_results.sort(key=operator.itemgetter(0)) for key, kv_pairs in itertools.groupby( (map_results, operator.itemgetter(0)): value = reduce(key, [ v for k,v in kv_pairs ]) output_collection.save( {"_id":key, "value":value}) Parallel

Zarkov Map/Reduce Architecture map_in_#.bson Query Map Reduce Commit map_out_#.bson Job
Mgr

Zarkov Map/Reduce - Phases managed by greenlets - Map and reduce jobs
parceled out to remote workers via zmq PUSH/PULL - Adaptive timeout/retry to support dead workers - “Sort” is performed by mapper, generating a fixed # of reduce jobs

Zarkov Web Service - We’ve got the data in, now how
do we get it out? - Zarkov includes a tiny HTTP server $ curl -d foo='{"c":"sfweb", "b":"date/2011-07-01/", "e":"date/2011-07-04"}' http://localhost:8081/q {"foo": {"sflogo": [[1309579200000.0, 12774], [1309665600000.0, 13458], [1309752000000.0, 13967]], "hits": [[1309579200000.0, 69357], [1309665600000.0, 68514], [1309752000000.0, 68494]]}} - Values come out tweaked for use in flot

Zarkov Deployment at SF.net

MongoDB Tricks - Autoincrement integers are harder than in MySQL but
not impossible - Unsafe writes, insert > update class IdGen(object):! @classmethod! def get_ids(cls, inc=1):! obj = cls.query.find_and_modify(! query={'_id':0},! update={! '$inc': dict(inc=inc),! },! upsert=True,! new=True)! return range(obj.inc - inc, obj.inc)

MongoDB Pitfalls - Use databases to partition really big data (e.g.
events), not collections - Avoid Javascript (mapreduce, group, $where) - Indexing is nice, but slows things down; use _id when you can - mongorestore is fast, but locks a lot

Open Source Ming http://sf.net/projects/merciless/ MIT License Allura http://sf.net/p/allura/ Apache License
Zarkov http://sf.net/p/zarkov/ Apache License

Future Work   Remove SPoF   Better way of expressing
aggregates   “ZQL”   Better web integration   WebSockets/Socket.io   Maybe trigger aggs based on event activity?

Credits -  http://www.flickr.com/photos/jprovost/ 5733297977/in/photostream/

Realtime Analytics using MongoDB, Python, Geven...

Realtime Analytics using MongoDB, Python, Gevent, and ZeroMQ

rick446

More Decks by rick446

Other Decks in Technology

Featured

Transcript

© 2011Geeknet Inc ! Realtime Analytics using MongoDB, Python, Gevent,

SourceForge s MongoDB - Tried CouchDB – liked the dev model,

The Problem We’re Trying to Solve - We have lots of

Introducing Zarkov 0.0.1 Asynchronous TCP server for event logging with

Zarkov Architecture MongoDB BSON over ZeroMQ Journal Greenlet Commit Greenlet

Technologies - MongoDB - Fast (10k+ inserts/s single-threaded) - ZeroMQ - Built-in buffering - PUSH/PULL

“Wow, it’s really fast; can it replace…” - Download statistics? - Google

Moving towards production.... - MongoDB MapReduce: convenient, but not so fast

Zarkov’s already doing a lot… So we added a lightweight

Quick Map/reduce Refresher def map_reduce(input_collection, query, output_collection, map, reduce): objects

Quick Map/reduce Refresher def map_reduce(input_collection, query, output_collection, map, reduce): objects

Zarkov Map/Reduce Architecture map_in_#.bson Query Map Reduce Commit map_out_#.bson Job

Zarkov Map/Reduce - Phases managed by greenlets - Map and reduce jobs

Zarkov Web Service - We’ve got the data in, now how

Zarkov Deployment at SF.net

© 2011Geeknet Inc ! Lessons learned at

MongoDB Tricks - Autoincrement integers are harder than in MySQL but

MongoDB Pitfalls - Use databases to partition really big data (e.g.

Open Source Ming http://sf.net/projects/merciless/ MIT License Allura http://sf.net/p/allura/ Apache License

Future Work   Remove SPoF   Better way of expressing

© 2011Geeknet Inc ! Questions? Rick Copeland @rick446 [email protected]

Credits -  http://www.flickr.com/photos/jprovost/ 5733297977/in/photostream/