MongoDB and Hadoop, Austin MUG

Slide 1

Slide 1 text

MongoDB and Hadoop: Driving Business Insights Senior Solutions Architect, MongoDB Sandeep Parikh #mongodb #hadoop

Slide 2

Slide 2 text

Agenda •  Introduction •  Use Cases •  Components •  Connector •  Demo •  Questions

Slide 3

Slide 3 text

Introduction

Slide 4

Slide 4 text

Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. •  Terabyte and Petabtye datasets •  Data warehousing •  Advanced analytics

Slide 5

Slide 5 text

Enterprise IT Stack EDW Management & Monitoring Security & Auditing RDBMS CRM, ERP, Collaboration, Mobile, BI OS & Virtualization, Compute, Storage, Network RDBMS Applications Infrastructure Data Management Operational Analytical

Slide 6

Slide 6 text

Operational vs. Analytical: Enrichment Applications, Interactions Warehouse, Analytics

Slide 7

Slide 7 text

Operational: MongoDB First-‐level Analy/cs Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predic/ve Analy/cs Ad Targe/ng Sen/ment Analysis

Slide 8

Slide 8 text

Analytical: Hadoop First-‐level Analy/cs Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predic/ve Analy/cs Ad Targe/ng Sen/ment Analysis

Slide 9

Slide 9 text

Operational vs. Analytical: Lifecycle First-‐level Analy/cs Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predic/ve Analy/cs Ad Targe/ng Sen/ment Analysis

Slide 10

Slide 10 text

Use Cases

Slide 11

Slide 11 text

Commerce Applications powered by Analysis powered by •  Products & Inventory •  Recommended products •  Customer proﬁle •  Session management •  Elastic pricing •  Recommendation models •  Predictive analytics •  Clickstream history MongoDB Connector for Hadoop

Slide 12

Slide 12 text

Insurance Applications powered by Analysis powered by •  Customer proﬁles •  Insurance policies •  Session data •  Call center data •  Customer action analysis •  Churn analysis •  Churn prediction •  Policy rates MongoDB Connector for Hadoop

Slide 13

Slide 13 text

Fraud Detection Payments Fraud modeling Nightly Analysis MongoDB Connector for Hadoop Results Cache Online payments processing 3rd Party Data Sources Fraud Detection query only query only

Slide 14

Slide 14 text

Components

Slide 15

Slide 15 text

Overview HDFS YARN MapReduce Pig Hive Spark

Slide 16

Slide 16 text

HDFS and YARN •  Hadoop Distributed File System –  Distributed ﬁle-system that stores data on commodity machines in a Hadoop cluster •  YARN –  Resource management platform responsible for managing and scheduling compute resources in a Hadoop cluster

Slide 17

Slide 17 text

MapReduce •  Paralell, distributed computation across a Hadoop cluster •  Process and/or generate large datasets •  Simplistic model for individual tasks Map(k1, v1) list(k2,v2) Reduce(k2, list(v2)) list(v3)

Slide 18

Slide 18 text

Pig •  High-level platform for creating MapReduce •  Pig Latin abstracts Java into easier-to-use notation •  Executed as a series of MapReduce applications •  Supports user-deﬁned functions (UDFs)

Slide 19

Slide 19 text

Hive •  Data warehouse infrastructure built on top of Hadoop •  Provides data summarization, query, and analysis •  HiveQL is a subset of SQL •  Support for user-deﬁned functions (UDFs)

Slide 20

Slide 20 text

Spark •  Powerful built-in transformations and actions –  map, reduceByKey, union, distinct, sample, intersection, and more –  foreach, count, collect, take, and many more Spark is a fast and powerful engine for processing Hadoop data. It is designed to perform both general data processing (similar to MapReduce) and new workloads like streaming, interac?ve queries, and machine learning.

Slide 21

Slide 21 text

MongoDB Connector for Hadoop

Slide 22

Slide 22 text

Data Read/Write MongoDB Read/Write BSON Tools MapReduce Pig Hive Spark Platforms Apache Hadoop Cloudera CDH Hortonworks HDP Amazon EMR Connector Overview

Slide 23

Slide 23 text

Features and Functionality •  MongoDB and BSON –  Input and Output formats •  Computes splits to read data •  Support for –  Filtering data with MongoDB queries –  Authentication –  Reading directly from shard Primaries –  ReadPreferences and Replica Set tags –  Appending to existing collections

Slide 24

Slide 24 text

MapReduce Conﬁguration •  MongoDB input –  mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat –  mongo.input.uri = mongodb://mydb:27017/db1.collection1 •  MongoDB output –  mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat –  mongo.output.uri = mongodb://mydb:27017/db1.collection2 •  BSON input/output –  mongo.job.input.format = com.hadoop.BSONFileInputFormat –  mapred.input.dir = hdfs:///tmp/database.bson –  mongo.job.output.format = com.hadoop.BSONFileOutputFormat –  mapred.output.dir = hdfs:///tmp/output.bson

Slide 25

Slide 25 text

Mapper Example public class Map extends Mapper { public void map(Object key, BSONObject doc, Context context) { List genres = (List)doc.get("genres"); for(String genre : genres) { context.write(new Text(genre), new IntWritable(1)); } } } { _id: ObjectId(…), title: “Toy Story”, genres: [“Animation”, “Children”] } { _id: ObjectId(…), title: “Goldeneye”, genres: [“Action”, “Crime”, “Thriller”] } { _id: ObjectId(…), title: “Jumanji”, genres: [“Adventure”, “Children”, “Fantasy”] }

Slide 26

Slide 26 text

Reducer Example { _id: ObjectId(…), genre: “Action”, count: 1370 } { _id: ObjectId(…), genre: “Adventure”, count: 957 } { _id: ObjectId(…), genre: “Animation”, count: 258 } public class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) { int sum = 0; for(IntWritable value : values) { sum += value.get(); } DBObject object = new BasicDBObjectBuilder().start() .add("genre", key.toString()) .add("count", sum) .get(); BSONWritable doc = new BSONWritable(object); context.write(NullWritable.get(), doc); } }

Slide 27

Slide 27 text

Pig – Mappings Read: –  BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader –  Map schema, _id, datatypes Insert: –  BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage –  Map output id, schema Update: –  MongoUpdateStorage –  Specify query, update operations, schema, update options

Slide 28

Slide 28 text

Pig Speciﬁcs •  Fixed or dynamic schema with Loader •  Types auto-mapped –  Embedded documents → Map –  Arrays → Tuple •  Supply alias for “_id” –  not a legal Pig variable name

Slide 29

Slide 29 text

Hive – Tables CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES("mongo.columns.mapping”="_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”) •  Access collections as Hive tables •  Use with MongoStorageHandler or BSONStorageHandler

Slide 30

Slide 30 text

Hive Particulars •  Queries are not (currently) pushed down to MongoDB •  WHERE predicates are evaluated after reading data from MongoDB •  Types auto-mapped –  Embedded documents (mixed types) → STRUCT –  Embedded documents (single type) → MAP –  Arrays → ARRAY –  ObjectId → STRUCT •  Use EXTERNAL when creating tables otherwise dropping Hive table drops underlying collection

Slide 31

Slide 31 text

Spark Usage •  Use with MapReduce input/ output formats •  Create Conﬁguration objects with input/output formats and data URI •  Load/save data using SparkContext Hadoop ﬁle or RDD APIs

Slide 32

Slide 32 text

Spark Input Example Configuration bsonDataConfig = new Configuration(); bsonDataConfig.set("mongo.job.input.format”, "BSONFileInputFormat.class"); JavaPairRDD bsonData = sc.newAPIHadoopFile( "hdfs://namenode:9000/data/test/foo.bson", BSONFileInputFormat.class, Object.class, BSONObject.class, bsonDataConfig); Configuration inputDataConfig = new Configuration(); inputDataConfig.set("mongo.job.input.format”, "MongoInputFormat.class"); inputDataConfig.set(“mongo.input.uri”, “mongodb://127.0.0.1/test.foo”); JavaPairRDD inputData = sc.newAPIHadoopRDD( inputDataConfig MongoInputFormat.class, Object.class, BSONObject.class); MongoDB BSON

Slide 33

Slide 33 text

Data Movement Dynamic queries with most recent data Puts load on operational database Snapshots move load to Hadoop Snapshots add predictable load to MongoDB Dynamic queries to MongoDB vs. BSON snapshots in HDFS

Slide 34

Slide 34 text

Demo

Slide 35

Slide 35 text

MovieWeb

Slide 36

Slide 36 text

MovieWeb Components •  MovieLens dataset –  10M ratings, 10K movies, 70K users •  Python web app to browse movies, recommendations –  Flask, PyMongo •  Spark app computes recommendations –  MLLib collaborative ﬁlter •  Predicted ratings are exposed in web app –  New predictions collection

Slide 37

Slide 37 text

MovieWeb Web Application •  Browse –  Top movies by ratings count –  Top genres by movie count •  Log in to –  See My Ratings –  Rate movies •  What’s missing? –  Movies You May Like –  Recommendations

Slide 38

Slide 38 text

Spark Recommender •  Apache Hadoop 2.3.0 –  HDFS •  Spark 1.0 –  Execute locally –  Assign executor resources •  Data –  From HDFS –  To MongoDB

Slide 39

Slide 39 text

Snapshot database as BSON Store BSON in HDFS Read BSON into Spark app Train model from existing ratings Create user-movie pairings Predict ratings for all pairings Write predictions to MongoDB collection Web application exposes recommendations Repeat the process weekly MovieWeb Workﬂow

Slide 40

Slide 40 text

$ export SPARK_HOME=~/spark-1.0.0-bin-hadoop2 $ bin/spark-submit --master local --class com.mongodb.hadoop.demo.Recommender demo-1.0.jar --jars mongo-java-2.12.3.jar,mongo-hadoop-core-1.3.0.jar --driver-memory 2G --executor-memory 1G [insert job args here] Execution

Slide 41

Slide 41 text

Questions? •  MongoDB Connector for Hadoop –  http://github.com/mongodb/mongo-hadoop •  Getting Started with MongoDB and Hadoop –  http://docs.mongodb.org/ecosystem/tutorial/getting- started-with-hadoop/ •  MongoDB-Spark Demo –  http://github.com/crcsmnky/mongodb-spark-demo