Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Text Analysis Using MongoDB - Aaron Cordova, The Interllective Inc.

mongodb
October 07, 2011

Text Analysis Using MongoDB - Aaron Cordova, The Interllective Inc.

MongoBoston 2011

In this talk we present our experiences building a text analysis system on top of MongoDB. The Interllective, is building research tools to discover... conceptually related documents in technical collections (patents, journal articles, etc.). We use sharded MongoDB to scale out our document collections. We discuss experiences performing text clustering, with collection design and indexing strategies for using Hadoop MapReduce over MongoDB. We also discuss building a real-time distributed relevance scoring engine on top of MongoDB. And since our data is too large to fit into memory, we address tradeoffs to consider when scaling using a disk based approach.

mongodb

October 07, 2011
Tweet

More Decks by mongodb

Other Decks in Technology

Transcript

  1. DATA CHALLENGE • Mostly unstructured data (body, title, authors) •

    Deep relevance calculations • Low-latency • Many documents, more all the time
  2. TEXT ANALYSIS • Calculate document features • Similarity calculations -

    relevance scoring • Clustering, compression to help cut down on work • Ranking
  3. MONGO DB • Flexible schemata • Scalable - auto-sharding •

    Fast - indexing • Easy - BSON from Mongo to Python and back
  4. CLUSTERING • Pre-computed clusters cut down on work done at

    query time • Compression from clustering results in an 85% reduction in feature storage
  5. MAP REDUCE • Algorithms are written in Python • BSON

    to Python is very elegant • Use Hadoop for Streaming to Python • Can optimize Python with C extensions
  6. MONGO REDUCE • Simple adapters for Hadoop to MongoDB •

    Each shard is an input split • Output automatically pre-split, balanced • Streaming supported
  7. MONGO REDUCE • Input takes advantage of indexes • Each

    mapper submits queries to a MongoDB shard • Reduce output can go to HDFS or MongoDB • Mappers can also write to global db via local mongos
  8. CO-OCCURRENCE • Too many intermediate pair-counts to write to HDFS

    • Even using Hadoop combiners doesn’t help - uses memory to find opportunities to aggregate • With MongoDB use find() and update() calls in map and reduce functions to avoid materializing intermediate values
  9. RELEVANCE • Algorithms determine unique relevance of concepts to documents

    in collections • Need to score not all, but many related documents • Use clustering to identify groups of high-scoring documents
  10. SCORING • Score servers use Python and Thrift • Package

    up algorithms • Each query is sent to all shards • Scoring servers do heavy calculations on a subset of docs identified using indexes from pre-computation steps • Results gathered and presented to user in real time
  11. DEPLOYMENT • We use Amazon Web Service Spot instances •

    Rely on replication for availability • Use more shards on local (ephemeral) disks rather than all data in memory • Keep indexes in memory • Easy to scale for new document corpora
  12. TRADE-OFFS • Our scoring performance is dominated by CPU •

    Can read data from disk fast enough to keep CPU busy • Built-in caching does a good job of keeping portions of the data in memory • For these reasons, more machines (cores) is better for us than more memory