Text Analysis Using MongoDB - Aaron Cordova, The Interllective Inc.

THE INTERLLECTIVE Text Analysis on MongoDB

SEARCH

KEYWORDS

KEYWORDS hovercraft turbine personal craft hover

CONCEPTUAL SEARCH

patents pubmed news articles clinical trials arxive research

DATA CHALLENGE • Mostly unstructured data (body, title, authors) •
Deep relevance calculations • Low-latency • Many documents, more all the time

TEXT ANALYSIS • Calculate document features • Similarity calculations -
relevance scoring • Clustering, compression to help cut down on work • Ranking

MONGO DB • Flexible schemata • Scalable - auto-sharding •
Fast - indexing • Easy - BSON from Mongo to Python and back

APPROACH Features

FLEXIBLE SCHEMATA: FREEDOM TO INNOVATE

CLUSTERING • Pre-computed clusters cut down on work done at
query time • Compression from clustering results in an 85% reduction in feature storage

CLUSTERING

SCALABILITY

MAP REDUCE

MAP REDUCE • Algorithms are written in Python • BSON
to Python is very elegant • Use Hadoop for Streaming to Python • Can optimize Python with C extensions

MONGO REDUCE

MONGO REDUCE • Simple adapters for Hadoop to MongoDB •
Each shard is an input split • Output automatically pre-split, balanced • Streaming supported

MONGO REDUCE • Input takes advantage of indexes • Each
mapper submits queries to a MongoDB shard • Reduce output can go to HDFS or MongoDB • Mappers can also write to global db via local mongos

GITHUB.COM/ACORDOVA/MONGOREDUCE

WORLD OF TEXT Power Law

WORLD OF TEXT

CO-OCCURRENCE • Too many intermediate pair-counts to write to HDFS
• Even using Hadoop combiners doesn’t help - uses memory to ﬁnd opportunities to aggregate • With MongoDB use ﬁnd() and update() calls in map and reduce functions to avoid materializing intermediate values

LATENCY

DEEP RELEVANCE = SLOW

RELEVANCE • Algorithms determine unique relevance of concepts to documents
in collections • Need to score not all, but many related documents • Use clustering to identify groups of high-scoring documents

SCALABILITY

SCORING

SCORING • Score servers use Python and Thrift • Package
up algorithms • Each query is sent to all shards • Scoring servers do heavy calculations on a subset of docs identiﬁed using indexes from pre-computation steps • Results gathered and presented to user in real time

DEPLOYMENT

SAVING $

DEPLOYMENT • We use Amazon Web Service Spot instances •
Rely on replication for availability • Use more shards on local (ephemeral) disks rather than all data in memory • Keep indexes in memory • Easy to scale for new document corpora

TRADE-OFFS • Our scoring performance is dominated by CPU •
Can read data from disk fast enough to keep CPU busy • Built-in caching does a good job of keeping portions of the data in memory • For these reasons, more machines (cores) is better for us than more memory

QUESTIONS www.interllective.com

Text Analysis Using MongoDB - Aaron Cordova, Th...

Text Analysis Using MongoDB - Aaron Cordova, The Interllective Inc.

More Decks by mongodb

Other Decks in Technology

Featured

Transcript