• Terabyte and Petabyte datasets • Data warehousing • Advanced analytics Hadoop “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.” http://hadoop.apache.org
Operational: MongoDB First-Level Analytics Internet of Things Social Mobile Apps Product/Asset Catalog Security & Fraud Single View Customer Data Management Churn Analysis Risk Modeling Sentiment Analysis Trade Surveillance Recommender Warehouse & ETL Ad Targeting Predictive Analytics
Analytical: Hadoop First-Level Analytics Internet of Things Social Mobile Apps Product/Asset Catalog Security & Fraud Single View Customer Data Management Churn Analysis Risk Modeling Sentiment Analysis Trade Surveillance Recommender Warehouse & ETL Ad Targeting Predictive Analytics
Operational & Analytical: Lifecycle First-Level Analytics Internet of Things Social Mobile Apps Product/Asset Catalog Security & Fraud Single View Customer Data Management Churn Analysis Risk Modeling Sentiment Analysis Trade Surveillance Recommender Warehouse & ETL Ad Targeting Predictive Analytics
Insurance Applications powered by Analysis powered by Customer profiles Insurance policies Session data Call center data Customer action analysis Churn analysis Churn prediction Policy rates MongoDB Connector for Hadoop
‹#› Connector Features and Functionality • Computes splits to read data • Single Node, Replica Sets, Sharded Clusters • Mappings for Pig and Hive • MongoDB as a standard data source/destination • Support for • Filtering data with MongoDB queries • Authentication • Reading from Replica Set tags • Appending to existing collections
‹#› Pig Mappings • Input: BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader • Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage
‹#› Hive Support • Access collections as Hive tables • Use with MongoStorageHandler or BSONStorageHandler CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
‹#› Spark • Use with MapReduce input/output formats • Create Configuration objects with input/output formats and data URI • Load/save data using SparkContext Hadoop file API
‹#› Data Movement Dynamic
queries
to
MongoDB
vs.
BSON
snapshots
in
HDFS Dynamic queries with most recent data Puts load on operational database Snapshots move load to Hadoop Snapshots add predictable load to MongoDB
‹#› MovieWeb Web Application • Browse - Top movies by ratings count - Top genres by movie count • Log in to - See My Ratings - Rate movies • Recommendations - Movies You May Like - Recommendations
‹#› MovieWeb Workflow Snapshot db as BSON Predict ratings for all pairings Write Prediction to MongoDB collection Store BSON in HDFS Read BSON into Spark App Create user movie pairing Web Application exposes recommendations Repeat Process Train Model from existing ratings
‹#› Business First! First-Level Analytics Internet of Things Social Mobile Apps Product/ Asset Catalog Security & Fraud Single View Customer Data Manageme Churn Analysis Risk Modeling Sentiment Analysis Trade Surveillanc e Recommen der Warehouse & ETL Ad Targeting Predictive Analytics What/Why How