MongoDB and Hadoop, Austin MUG

MongoDB and Hadoop: Driving Business Insights Senior Solutions Architect, MongoDB
Sandeep Parikh #mongodb #hadoop

Agenda •  Introduction •  Use Cases •  Components •  Connector
•  Demo •  Questions

Introduction

Hadoop The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets across clusters of computers using simple programming models. •  Terabyte and Petabtye datasets •  Data warehousing •  Advanced analytics

Enterprise IT Stack EDW Management & Monitoring Security & Auditing
RDBMS CRM, ERP, Collaboration, Mobile, BI OS & Virtualization, Compute, Storage, Network RDBMS Applications Infrastructure Data Management Operational Analytical

Operational vs. Analytical: Enrichment Applications, Interactions Warehouse, Analytics

Operational: MongoDB First-‐level Analy/cs Product/Asset Catalogs
Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predic/ve Analy/cs Ad Targe/ng Sen/ment Analysis

Analytical: Hadoop First-‐level Analy/cs Product/Asset Catalogs
Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predic/ve Analy/cs Ad Targe/ng Sen/ment Analysis

Operational vs. Analytical: Lifecycle First-‐level Analy/cs Product/Asset
Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predic/ve Analy/cs Ad Targe/ng Sen/ment Analysis

Use Cases

Commerce Applications powered by Analysis powered by •  Products &
Inventory •  Recommended products •  Customer proﬁle •  Session management •  Elastic pricing •  Recommendation models •  Predictive analytics •  Clickstream history MongoDB Connector for Hadoop

Insurance Applications powered by Analysis powered by •  Customer proﬁles
•  Insurance policies •  Session data •  Call center data •  Customer action analysis •  Churn analysis •  Churn prediction •  Policy rates MongoDB Connector for Hadoop

Fraud Detection Payments Fraud modeling Nightly Analysis MongoDB Connector for
Hadoop Results Cache Online payments processing 3rd Party Data Sources Fraud Detection query only query only

Components

Overview HDFS YARN MapReduce Pig Hive
Spark

HDFS and YARN •  Hadoop Distributed File System –  Distributed
ﬁle-system that stores data on commodity machines in a Hadoop cluster •  YARN –  Resource management platform responsible for managing and scheduling compute resources in a Hadoop cluster

MapReduce •  Paralell, distributed computation across a Hadoop cluster • 
Process and/or generate large datasets •  Simplistic model for individual tasks Map(k1, v1) list(k2,v2) Reduce(k2, list(v2)) list(v3)

Pig •  High-level platform for creating MapReduce •  Pig Latin
abstracts Java into easier-to-use notation •  Executed as a series of MapReduce applications •  Supports user-deﬁned functions (UDFs)

Hive •  Data warehouse infrastructure built on top of Hadoop
•  Provides data summarization, query, and analysis •  HiveQL is a subset of SQL •  Support for user-deﬁned functions (UDFs)

Spark •  Powerful built-in transformations and actions –  map, reduceByKey,
union, distinct, sample, intersection, and more –  foreach, count, collect, take, and many more Spark is a fast and powerful engine for processing Hadoop data. It is designed to perform both general data processing (similar to MapReduce) and new workloads like streaming, interac?ve queries, and machine learning.

MongoDB Connector for Hadoop

Data Read/Write MongoDB Read/Write BSON Tools MapReduce Pig Hive Spark
Platforms Apache Hadoop Cloudera CDH Hortonworks HDP Amazon EMR Connector Overview

Features and Functionality •  MongoDB and BSON –  Input and
Output formats •  Computes splits to read data •  Support for –  Filtering data with MongoDB queries –  Authentication –  Reading directly from shard Primaries –  ReadPreferences and Replica Set tags –  Appending to existing collections

MapReduce Conﬁguration •  MongoDB input –  mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat
–  mongo.input.uri = mongodb://mydb:27017/db1.collection1 •  MongoDB output –  mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat –  mongo.output.uri = mongodb://mydb:27017/db1.collection2 •  BSON input/output –  mongo.job.input.format = com.hadoop.BSONFileInputFormat –  mapred.input.dir = hdfs:///tmp/database.bson –  mongo.job.output.format = com.hadoop.BSONFileOutputFormat –  mapred.output.dir = hdfs:///tmp/output.bson

Mapper Example public class Map extends Mapper<Object, BSONObject, Text, IntWritable>
{ public void map(Object key, BSONObject doc, Context context) { List<String> genres = (List<String>)doc.get("genres"); for(String genre : genres) { context.write(new Text(genre), new IntWritable(1)); } } } { _id: ObjectId(…), title: “Toy Story”, genres: [“Animation”, “Children”] } { _id: ObjectId(…), title: “Goldeneye”, genres: [“Action”, “Crime”, “Thriller”] } { _id: ObjectId(…), title: “Jumanji”, genres: [“Adventure”, “Children”, “Fantasy”] }

Reducer Example { _id: ObjectId(…), genre: “Action”, count: 1370 }
{ _id: ObjectId(…), genre: “Adventure”, count: 957 } { _id: ObjectId(…), genre: “Animation”, count: 258 } public class Reduce extends Reducer<Text, IntWritable, NullWritable, BSONWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for(IntWritable value : values) { sum += value.get(); } DBObject object = new BasicDBObjectBuilder().start() .add("genre", key.toString()) .add("count", sum) .get(); BSONWritable doc = new BSONWritable(object); context.write(NullWritable.get(), doc); } }

Pig – Mappings Read: –  BSONLoader and MongoLoader
data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader –  Map schema, _id, datatypes Insert: –  BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage –  Map output id, schema Update: –  MongoUpdateStorage –  Specify query, update operations, schema, update options

Pig Speciﬁcs •  Fixed or dynamic schema with Loader • 
Types auto-mapped –  Embedded documents → Map –  Arrays → Tuple •  Supply alias for “_id” –  not a legal Pig variable name

Hive – Tables CREATE TABLE mongo_users (id int, name string,
age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES("mongo.columns.mapping”="_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”) •  Access collections as Hive tables •  Use with MongoStorageHandler or BSONStorageHandler

Hive Particulars •  Queries are not (currently) pushed down to
MongoDB •  WHERE predicates are evaluated after reading data from MongoDB •  Types auto-mapped –  Embedded documents (mixed types) → STRUCT –  Embedded documents (single type) → MAP –  Arrays → ARRAY –  ObjectId → STRUCT •  Use EXTERNAL when creating tables otherwise dropping Hive table drops underlying collection

Spark Usage •  Use with MapReduce input/ output formats • 
Create Conﬁguration objects with input/output formats and data URI •  Load/save data using SparkContext Hadoop ﬁle or RDD APIs

Spark Input Example Configuration bsonDataConfig = new Configuration(); bsonDataConfig.set("mongo.job.input.format”,
"BSONFileInputFormat.class"); JavaPairRDD<Object,BSONObject> bsonData = sc.newAPIHadoopFile( "hdfs://namenode:9000/data/test/foo.bson", BSONFileInputFormat.class, Object.class, BSONObject.class, bsonDataConfig); Configuration inputDataConfig = new Configuration(); inputDataConfig.set("mongo.job.input.format”, "MongoInputFormat.class"); inputDataConfig.set(“mongo.input.uri”, “mongodb://127.0.0.1/test.foo”); JavaPairRDD<Object,BSONObject> inputData = sc.newAPIHadoopRDD( inputDataConfig MongoInputFormat.class, Object.class, BSONObject.class); MongoDB BSON

Data Movement Dynamic queries with most recent data Puts load
on operational database Snapshots move load to Hadoop Snapshots add predictable load to MongoDB Dynamic queries to MongoDB vs. BSON snapshots in HDFS

MovieWeb

MovieWeb Components •  MovieLens dataset –  10M ratings, 10K movies,
70K users •  Python web app to browse movies, recommendations –  Flask, PyMongo •  Spark app computes recommendations –  MLLib collaborative ﬁlter •  Predicted ratings are exposed in web app –  New predictions collection

MovieWeb Web Application •  Browse –  Top movies by ratings
count –  Top genres by movie count •  Log in to –  See My Ratings –  Rate movies •  What’s missing? –  Movies You May Like –  Recommendations

Spark Recommender •  Apache Hadoop 2.3.0 –  HDFS •  Spark
1.0 –  Execute locally –  Assign executor resources •  Data –  From HDFS –  To MongoDB

Snapshot database as BSON Store BSON in HDFS Read BSON
into Spark app Train model from existing ratings Create user-movie pairings Predict ratings for all pairings Write predictions to MongoDB collection Web application exposes recommendations Repeat the process weekly MovieWeb Workﬂow

$ export SPARK_HOME=~/spark-1.0.0-bin-hadoop2 $ bin/spark-submit --master local --class com.mongodb.hadoop.demo.Recommender demo-1.0.jar
--jars mongo-java-2.12.3.jar,mongo-hadoop-core-1.3.0.jar --driver-memory 2G --executor-memory 1G [insert job args here] Execution

Questions? •  MongoDB Connector for Hadoop –  http://github.com/mongodb/mongo-hadoop •  Getting
Started with MongoDB and Hadoop –  http://docs.mongodb.org/ecosystem/tutorial/getting- started-with-hadoop/ •  MongoDB-Spark Demo –  http://github.com/crcsmnky/mongodb-spark-demo

MongoDB and Hadoop, Austin MUG

MongoDB and Hadoop, Austin MUG

More Decks by Sandeep Parikh

Other Decks in Technology

Featured

Transcript