Slide 1

Slide 1 text

Paris Tugdual Grall Technical Evangelist [email protected] @tgrall

Slide 2

Slide 2 text

MongoDB & Hadoop Tugdual Grall Technical Evangelist [email protected] @tgrall

Slide 3

Slide 3 text

Agenda Evolving Data Landscape MongoDB & Hadoop Use Cases MongoDB Connector Features Demo

Slide 4

Slide 4 text

Evolving Data Landscape

Slide 5

Slide 5 text

• Terabyte and Petabyte datasets • Data warehousing • Advanced analytics Hadoop “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.” http://hadoop.apache.org

Slide 6

Slide 6 text

‹#› Enterprise IT Stack

Slide 7

Slide 7 text

‹#› Operational vs. Analytical: Enrichment Warehouse, Analytics Applications, Interactions

Slide 8

Slide 8 text

Operational: MongoDB First-Level Analytics Internet of Things Social Mobile Apps Product/Asset Catalog Security & Fraud Single View Customer Data Management Churn Analysis Risk Modeling Sentiment Analysis Trade Surveillance Recommender Warehouse & ETL Ad Targeting Predictive Analytics

Slide 9

Slide 9 text

Analytical: Hadoop First-Level Analytics Internet of Things Social Mobile Apps Product/Asset Catalog Security & Fraud Single View Customer Data Management Churn Analysis Risk Modeling Sentiment Analysis Trade Surveillance Recommender Warehouse & ETL Ad Targeting Predictive Analytics

Slide 10

Slide 10 text

Operational & Analytical: Lifecycle First-Level Analytics Internet of Things Social Mobile Apps Product/Asset Catalog Security & Fraud Single View Customer Data Management Churn Analysis Risk Modeling Sentiment Analysis Trade Surveillance Recommender Warehouse & ETL Ad Targeting Predictive Analytics

Slide 11

Slide 11 text

MongoDB & Hadoop Use Cases

Slide 12

Slide 12 text

Commerce Applications powered by Analysis powered by Products & Inventory Recommended products Customer profile Session management Elastic pricing Recommendation models Predictive analytics Clickstream history MongoDB Connector
 for Hadoop

Slide 13

Slide 13 text

Insurance Applications powered by Analysis powered by Customer profiles Insurance policies Session data Call center data Customer action analysis Churn analysis Churn prediction Policy rates MongoDB Connector
 for Hadoop

Slide 14

Slide 14 text

Fraud Detection MongoDB Connector
 for Hadoop Payments Nightly Analysis      3rd Party
 Data Sources Results Cache Fraud   Detection Query Only Query Only

Slide 15

Slide 15 text

MongoDB Connector for Hadoop

Slide 16

Slide 16 text

‹#› Connector Overview DATA • Read/Write MongoDB • Read/Write BSON
 TOOLS • MapReduce • Pig • Hive • Spark
 PLATFORMS • Apache Hadoop • Cloudera CDH • Hortonworks HDP • MapR • Amazon EMR


Slide 17

Slide 17 text

‹#› Connector Features and Functionality • Computes splits to read data • Single Node, Replica Sets, Sharded Clusters • Mappings for Pig and Hive • MongoDB as a standard data source/destination • Support for • Filtering data with MongoDB queries • Authentication • Reading from Replica Set tags • Appending to existing collections

Slide 18

Slide 18 text

‹#› MapReduce Configuration • MongoDB input/output mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat mongo.input.uri = mongodb://mydb:27017/db1.collection1 mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat mongo.output.uri = mongodb://mydb:27017/db1.collection2 • BSON input/output mongo.job.input.format = com.hadoop.BSONFileInputFormat mapred.input.dir = hdfs:///tmp/database.bson mongo.job.output.format = com.hadoop.BSONFileOutputFormat mapred.output.dir = hdfs:///tmp/output.bson

Slide 19

Slide 19 text

‹#› Pig Mappings • Input: BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ 
 using com.mongodb.hadoop.pig.MongoLoader • Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’
 using com.mongodb.hadoop.pig.BSONStorage

Slide 20

Slide 20 text

‹#› Hive Support • Access collections as Hive tables • Use with MongoStorageHandler or BSONStorageHandler CREATE TABLE mongo_users (id int, name string, age int)
 STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
 WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)

Slide 21

Slide 21 text

‹#› Spark • Use with MapReduce input/output formats • Create Configuration objects with input/output formats and data URI • Load/save data using SparkContext Hadoop file API

Slide 22

Slide 22 text

‹#› Data Movement Dynamic  queries  to  MongoDB  vs.  BSON  snapshots  in  HDFS Dynamic queries with most recent data Puts load on operational database Snapshots move load to Hadoop Snapshots add predictable load to MongoDB

Slide 23

Slide 23 text

Demo : Recommendation Platform

Slide 24

Slide 24 text

‹#› Movie Web

Slide 25

Slide 25 text

‹#› MovieWeb Web Application • Browse - Top movies by ratings count - Top genres by movie count • Log in to - See My Ratings - Rate movies • Recommendations - Movies You May Like - Recommendations

Slide 26

Slide 26 text

‹#› MovieWeb Components • MovieLens dataset – 10M ratings, 10K movies, 70K users – http://grouplens.org/datasets/movielens/ • Python web app to browse movies, recommendations – Flask, PyMongo • Spark app computes recommendations – MLLib collaborative filter • Predicted ratings are exposed in web app – New predictions collection

Slide 27

Slide 27 text

‹#› Spark Recommender • Apache Hadoop (2.3) - HDFS & YARN - Top genres by movie count • Spark (1.0) - Execute within YARN - Assign executor resources • Data - From HDFS, MongoDB - To MongoDB

Slide 28

Slide 28 text

‹#› MovieWeb Workflow Snapshot db as BSON Predict ratings for all pairings Write Prediction to MongoDB collection Store BSON 
 in HDFS Read BSON into Spark App Create user movie pairing Web Application exposes recommendations Repeat Process Train Model from existing ratings

Slide 29

Slide 29 text

‹#› Execution $ spark-submit --master local \
 --driver-memory 2G --executor-memory 2G \
 --jars mongo-hadoop-core.jar,mongo-java-driver.jar \
 --class com.mongodb.workshop.SparkExercise \
 ./target/spark-1.0-SNAPSHOT.jar \
 hdfs://localhost:9000 \
 mongodb://127.0.0.1:27017/movielens \
 predictions \

Slide 30

Slide 30 text

Should I use MongoDB or Hadoop?

Slide 31

Slide 31 text

‹#› Business First! First-Level Analytics Internet of Things Social Mobile Apps Product/ Asset Catalog Security & Fraud Single View Customer Data Manageme Churn Analysis Risk Modeling Sentiment Analysis Trade Surveillanc e Recommen der Warehouse & ETL Ad Targeting Predictive Analytics What/Why How

Slide 32

Slide 32 text

‹#› The good tool for the task • Dataset size • Data processing complexity • Continuous improvement V1.0

Slide 33

Slide 33 text

‹#› The good tool for the task • Dataset size • Data processing complexity • Continuous improvement V2.0

Slide 34

Slide 34 text

‹#› Resources / Questions • MongoDB Connector for Hadoop - http://github.com/mongodb/mongo-hadoop
 • Getting Started with MongoDB and Hadoop - http://docs.mongodb.org/ecosystem/tutorial/getting- started-with-hadoop/
 • MongoDB-Spark Demo - https://github.com/crcsmnky/mongodb-hadoop- workshop

Slide 35

Slide 35 text

MongoDB & Hadoop Tugdual Grall Technical Evangelist [email protected] @tgrall