Slide 1

Slide 1 text

© 2011Geeknet Inc ! Realtime Analytics using MongoDB, Python, Gevent, and ZeroMQ Rick Copeland @rick446 [email protected]

Slide 2

Slide 2 text

SourceForge s MongoDB - Tried CouchDB – liked the dev model, not so much the performance - Migrated consumer-facing pages (summary, browse, download) to MongoDB and it worked great (on MongoDB 0.8 no less!) - Built an entirely new tool platform around MongoDB (Allura)

Slide 3

Slide 3 text

The Problem We’re Trying to Solve - We have lots of users (good) - We have lots of projects (good) - We don’t know what those users and projects are doing (not so good) - We have tons of code in PHP, Perl, and Python (not so good)

Slide 4

Slide 4 text

Introducing Zarkov 0.0.1 Asynchronous TCP server for event logging with gevent Turn OFF “safe” writes, turn OFF Ming validation (or do it in the client) Incrementally calculate aggregate stats based on event log using mapreduce with {‘out’:’reduce’}

Slide 5

Slide 5 text

Zarkov Architecture MongoDB BSON over ZeroMQ Journal Greenlet Commit Greenlet Write-ahead log Write-ahead log Aggregation Cron job

Slide 6

Slide 6 text

Technologies - MongoDB - Fast (10k+ inserts/s single-threaded) - ZeroMQ - Built-in buffering - PUSH/PULL sockets (push never blocks, easy to distribute work) - BSON - Fast Python/C implementation - More types than JSON - Gevent - “green threads” for Python

Slide 7

Slide 7 text

“Wow, it’s really fast; can it replace…” - Download statistics? - Google Analytics? - Project realtime statistics? “Probably, but it’ll take some work….”

Slide 8

Slide 8 text

Moving towards production.... - MongoDB MapReduce: convenient, but not so fast - Global JS Interpreter Lock per mongod - Lots of writing to temp collections (high lock %) - Javascript without libraries (ick!) - Hadoop? Painful to configure, high latency, non-seamless integration with MongoDB

Slide 9

Slide 9 text

Zarkov’s already doing a lot… So we added a lightweight map/ reduce framework - Write your map/reduce jobs in Python - Input/Output is MongoDB - Intermediate files are local .bson files - Use ZeroMQ for job distribution

Slide 10

Slide 10 text

Quick Map/reduce Refresher def map_reduce(input_collection, query, output_collection, map, reduce): objects = input_collection.find(query) map_results = list(map(objects)) map_results.sort(key=operator.itemgetter(0)) for key, kv_pairs in itertools.groupby( (map_results, operator.itemgetter(0)): value = reduce(key, [ v for k,v in kv_pairs ]) output_collection.save( {"_id":key, "value":value})

Slide 11

Slide 11 text

Quick Map/reduce Refresher def map_reduce(input_collection, query, output_collection, map, reduce): objects = input_collection.find(query) map_results = list(map(objects)) map_results.sort(key=operator.itemgetter(0)) for key, kv_pairs in itertools.groupby( (map_results, operator.itemgetter(0)): value = reduce(key, [ v for k,v in kv_pairs ]) output_collection.save( {"_id":key, "value":value}) Parallel

Slide 12

Slide 12 text

Zarkov Map/Reduce Architecture map_in_#.bson Query Map Reduce Commit map_out_#.bson Job Mgr

Slide 13

Slide 13 text

Zarkov Map/Reduce - Phases managed by greenlets - Map and reduce jobs parceled out to remote workers via zmq PUSH/PULL - Adaptive timeout/retry to support dead workers - “Sort” is performed by mapper, generating a fixed # of reduce jobs

Slide 14

Slide 14 text

Zarkov Web Service - We’ve got the data in, now how do we get it out? - Zarkov includes a tiny HTTP server $ curl -d foo='{"c":"sfweb", "b":"date/2011-07-01/", "e":"date/2011-07-04"}' http://localhost:8081/q {"foo": {"sflogo": [[1309579200000.0, 12774], [1309665600000.0, 13458], [1309752000000.0, 13967]], "hits": [[1309579200000.0, 69357], [1309665600000.0, 68514], [1309752000000.0, 68494]]}} - Values come out tweaked for use in flot

Slide 15

Slide 15 text

Zarkov Deployment at SF.net

Slide 16

Slide 16 text

© 2011Geeknet Inc ! Lessons learned at

Slide 17

Slide 17 text

MongoDB Tricks - Autoincrement integers are harder than in MySQL but not impossible - Unsafe writes, insert > update class IdGen(object):! @classmethod! def get_ids(cls, inc=1):! obj = cls.query.find_and_modify(! query={'_id':0},! update={! '$inc': dict(inc=inc),! },! upsert=True,! new=True)! return range(obj.inc - inc, obj.inc)

Slide 18

Slide 18 text

MongoDB Pitfalls - Use databases to partition really big data (e.g. events), not collections - Avoid Javascript (mapreduce, group, $where) - Indexing is nice, but slows things down; use _id when you can - mongorestore is fast, but locks a lot

Slide 19

Slide 19 text

Open Source Ming http://sf.net/projects/merciless/ MIT License Allura http://sf.net/p/allura/ Apache License Zarkov http://sf.net/p/zarkov/ Apache License

Slide 20

Slide 20 text

Future Work   Remove SPoF   Better way of expressing aggregates   “ZQL”   Better web integration   WebSockets/Socket.io   Maybe trigger aggs based on event activity?

Slide 21

Slide 21 text

© 2011Geeknet Inc ! Questions? Rick Copeland @rick446 [email protected]

Slide 22

Slide 22 text

Credits -  http://www.flickr.com/photos/jprovost/ 5733297977/in/photostream/