MongoDB and Hadoop

Paris Tugdual Grall Technical Evangelist [email protected] @tgrall

MongoDB & Hadoop Tugdual Grall Technical Evangelist [email protected] @tgrall

Agenda Evolving Data Landscape MongoDB & Hadoop Use Cases MongoDB
Connector Features Demo

Evolving Data Landscape

• Terabyte and Petabyte datasets • Data warehousing • Advanced
analytics Hadoop “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.” http://hadoop.apache.org

‹#› Enterprise IT Stack

‹#› Operational vs. Analytical: Enrichment Warehouse, Analytics Applications, Interactions

Operational: MongoDB First-Level Analytics Internet of Things Social Mobile Apps
Product/Asset Catalog Security & Fraud Single View Customer Data Management Churn Analysis Risk Modeling Sentiment Analysis Trade Surveillance Recommender Warehouse & ETL Ad Targeting Predictive Analytics

Analytical: Hadoop First-Level Analytics Internet of Things Social Mobile Apps
Product/Asset Catalog Security & Fraud Single View Customer Data Management Churn Analysis Risk Modeling Sentiment Analysis Trade Surveillance Recommender Warehouse & ETL Ad Targeting Predictive Analytics

Operational & Analytical: Lifecycle First-Level Analytics Internet of Things Social
Mobile Apps Product/Asset Catalog Security & Fraud Single View Customer Data Management Churn Analysis Risk Modeling Sentiment Analysis Trade Surveillance Recommender Warehouse & ETL Ad Targeting Predictive Analytics

MongoDB & Hadoop Use Cases

Commerce Applications powered by Analysis powered by Products & Inventory
Recommended products Customer profile Session management Elastic pricing Recommendation models Predictive analytics Clickstream history MongoDB Connector  for Hadoop

Insurance Applications powered by Analysis powered by Customer profiles Insurance
policies Session data Call center data Customer action analysis Churn analysis Churn prediction Policy rates MongoDB Connector  for Hadoop

Fraud Detection MongoDB Connector  for Hadoop Payments Nightly Analysis
3rd Party  Data Sources Results Cache Fraud Detection Query Only Query Only

MongoDB Connector for Hadoop

‹#› Connector Overview DATA • Read/Write MongoDB • Read/Write BSON 
TOOLS • MapReduce • Pig • Hive • Spark  PLATFORMS • Apache Hadoop • Cloudera CDH • Hortonworks HDP • MapR • Amazon EMR 

‹#› Connector Features and Functionality • Computes splits to read
data • Single Node, Replica Sets, Sharded Clusters • Mappings for Pig and Hive • MongoDB as a standard data source/destination • Support for • Filtering data with MongoDB queries • Authentication • Reading from Replica Set tags • Appending to existing collections

‹#› MapReduce Configuration • MongoDB input/output mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat mongo.input.uri
= mongodb://mydb:27017/db1.collection1 mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat mongo.output.uri = mongodb://mydb:27017/db1.collection2 • BSON input/output mongo.job.input.format = com.hadoop.BSONFileInputFormat mapred.input.dir = hdfs:///tmp/database.bson mongo.job.output.format = com.hadoop.BSONFileOutputFormat mapred.output.dir = hdfs:///tmp/output.bson

‹#› Pig Mappings • Input: BSONLoader and MongoLoader data =
LOAD ‘mongodb://mydb:27017/db.collection’   using com.mongodb.hadoop.pig.MongoLoader • Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’  using com.mongodb.hadoop.pig.BSONStorage

‹#› Hive Support • Access collections as Hive tables •
Use with MongoStorageHandler or BSONStorageHandler CREATE TABLE mongo_users (id int, name string, age int)  STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"  WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)

‹#› Spark • Use with MapReduce input/output formats • Create
Configuration objects with input/output formats and data URI • Load/save data using SparkContext Hadoop file API

‹#› Data Movement Dynamic queries to MongoDB vs. BSON snapshots
in HDFS Dynamic queries with most recent data Puts load on operational database Snapshots move load to Hadoop Snapshots add predictable load to MongoDB

Demo : Recommendation Platform

‹#› Movie Web

‹#› MovieWeb Web Application • Browse - Top movies by
ratings count - Top genres by movie count • Log in to - See My Ratings - Rate movies • Recommendations - Movies You May Like - Recommendations

‹#› MovieWeb Components • MovieLens dataset – 10M ratings, 10K
movies, 70K users – http://grouplens.org/datasets/movielens/ • Python web app to browse movies, recommendations – Flask, PyMongo • Spark app computes recommendations – MLLib collaborative filter • Predicted ratings are exposed in web app – New predictions collection

‹#› Spark Recommender • Apache Hadoop (2.3) - HDFS &
YARN - Top genres by movie count • Spark (1.0) - Execute within YARN - Assign executor resources • Data - From HDFS, MongoDB - To MongoDB

‹#› MovieWeb Workflow Snapshot db as BSON Predict ratings for
all pairings Write Prediction to MongoDB collection Store BSON   in HDFS Read BSON into Spark App Create user movie pairing Web Application exposes recommendations Repeat Process Train Model from existing ratings

‹#› Execution $ spark-submit --master local \  --driver-memory 2G --executor-memory
2G \  --jars mongo-hadoop-core.jar,mongo-java-driver.jar \  --class com.mongodb.workshop.SparkExercise \  ./target/spark-1.0-SNAPSHOT.jar \  hdfs://localhost:9000 \  mongodb://127.0.0.1:27017/movielens \  predictions \

Should I use MongoDB or Hadoop?

‹#› Business First! First-Level Analytics Internet of Things Social Mobile
Apps Product/ Asset Catalog Security & Fraud Single View Customer Data Manageme Churn Analysis Risk Modeling Sentiment Analysis Trade Surveillanc e Recommen der Warehouse & ETL Ad Targeting Predictive Analytics What/Why How

‹#› The good tool for the task • Dataset size
• Data processing complexity • Continuous improvement V1.0

‹#› The good tool for the task • Dataset size
• Data processing complexity • Continuous improvement V2.0

‹#› Resources / Questions • MongoDB Connector for Hadoop -
http://github.com/mongodb/mongo-hadoop  • Getting Started with MongoDB and Hadoop - http://docs.mongodb.org/ecosystem/tutorial/getting- started-with-hadoop/  • MongoDB-Spark Demo - https://github.com/crcsmnky/mongodb-hadoop- workshop

MongoDB & Hadoop Tugdual Grall Technical Evangelist [email protected] @tgrall

MongoDB and Hadoop

MongoDB and Hadoop

More Decks by Tugdual Grall

Other Decks in Technology

Featured

Transcript