Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB and Hadoop

MongoDB and Hadoop

Why and How MongoDB and Hadoop are working together?

Learn this in this presentation.

This presentation was delivered during MongoDB Day Paris 2014

Tugdual Grall

October 28, 2014
Tweet

More Decks by Tugdual Grall

Other Decks in Technology

Transcript

  1. Paris
    Tugdual Grall
    Technical Evangelist
    [email protected]
    @tgrall

    View Slide

  2. MongoDB & Hadoop
    Tugdual Grall
    Technical Evangelist
    [email protected]
    @tgrall

    View Slide

  3. Agenda
    Evolving Data Landscape
    MongoDB & Hadoop Use Cases
    MongoDB Connector Features
    Demo

    View Slide

  4. Evolving Data Landscape

    View Slide

  5. • Terabyte and Petabyte datasets
    • Data warehousing
    • Advanced analytics
    Hadoop
    “The Apache Hadoop software library is a framework
    that allows for the distributed processing of large
    data sets across clusters of computers using simple
    programming models.”
    http://hadoop.apache.org

    View Slide

  6. ‹#›
    Enterprise IT Stack

    View Slide

  7. ‹#›
    Operational vs. Analytical: Enrichment
    Warehouse, Analytics
    Applications, Interactions

    View Slide

  8. Operational: MongoDB
    First-Level
    Analytics
    Internet of
    Things
    Social
    Mobile Apps
    Product/Asset
    Catalog
    Security & Fraud
    Single View
    Customer Data
    Management
    Churn Analysis
    Risk Modeling
    Sentiment
    Analysis
    Trade
    Surveillance
    Recommender
    Warehouse &
    ETL
    Ad Targeting
    Predictive
    Analytics

    View Slide

  9. Analytical: Hadoop
    First-Level
    Analytics
    Internet of
    Things
    Social
    Mobile Apps
    Product/Asset
    Catalog
    Security & Fraud
    Single View
    Customer Data
    Management
    Churn Analysis
    Risk Modeling
    Sentiment
    Analysis
    Trade
    Surveillance
    Recommender
    Warehouse &
    ETL
    Ad Targeting
    Predictive
    Analytics

    View Slide

  10. Operational & Analytical: Lifecycle
    First-Level
    Analytics
    Internet of
    Things
    Social
    Mobile Apps
    Product/Asset
    Catalog
    Security & Fraud
    Single View
    Customer Data
    Management
    Churn Analysis
    Risk Modeling
    Sentiment
    Analysis
    Trade
    Surveillance
    Recommender
    Warehouse &
    ETL
    Ad Targeting
    Predictive
    Analytics

    View Slide

  11. MongoDB & Hadoop Use Cases

    View Slide

  12. Commerce
    Applications
    powered by
    Analysis
    powered by
    Products & Inventory
    Recommended products
    Customer profile
    Session management
    Elastic pricing
    Recommendation models
    Predictive analytics
    Clickstream history
    MongoDB Connector

    for Hadoop

    View Slide

  13. Insurance
    Applications
    powered by
    Analysis
    powered by
    Customer profiles
    Insurance policies
    Session data
    Call center data
    Customer action analysis
    Churn analysis
    Churn prediction
    Policy rates
    MongoDB Connector

    for Hadoop

    View Slide

  14. Fraud Detection
    MongoDB Connector

    for Hadoop
    Payments Nightly Analysis
         3rd Party

    Data Sources
    Results Cache
    Fraud  
    Detection
    Query Only
    Query Only

    View Slide

  15. MongoDB Connector for Hadoop

    View Slide

  16. ‹#›
    Connector Overview
    DATA
    • Read/Write MongoDB
    • Read/Write BSON

    TOOLS
    • MapReduce
    • Pig
    • Hive
    • Spark

    PLATFORMS
    • Apache Hadoop
    • Cloudera CDH
    • Hortonworks HDP
    • MapR
    • Amazon EMR


    View Slide

  17. ‹#›
    Connector Features and Functionality
    • Computes splits to read data
    • Single Node, Replica Sets, Sharded Clusters
    • Mappings for Pig and Hive
    • MongoDB as a standard data source/destination
    • Support for
    • Filtering data with MongoDB queries
    • Authentication
    • Reading from Replica Set tags
    • Appending to existing collections

    View Slide

  18. ‹#›
    MapReduce Configuration
    • MongoDB input/output
    mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat
    mongo.input.uri = mongodb://mydb:27017/db1.collection1
    mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat
    mongo.output.uri = mongodb://mydb:27017/db1.collection2
    • BSON input/output
    mongo.job.input.format = com.hadoop.BSONFileInputFormat
    mapred.input.dir = hdfs:///tmp/database.bson
    mongo.job.output.format = com.hadoop.BSONFileOutputFormat
    mapred.output.dir = hdfs:///tmp/output.bson

    View Slide

  19. ‹#›
    Pig Mappings
    • Input: BSONLoader and MongoLoader
    data = LOAD ‘mongodb://mydb:27017/db.collection’ 

    using com.mongodb.hadoop.pig.MongoLoader
    • Output: BSONStorage and MongoInsertStorage
    STORE records INTO ‘hdfs:///output.bson’

    using com.mongodb.hadoop.pig.BSONStorage

    View Slide

  20. ‹#›
    Hive Support
    • Access collections as Hive tables
    • Use with MongoStorageHandler or BSONStorageHandler
    CREATE TABLE mongo_users (id int, name string, age int)

    STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"

    WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”)
    TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)

    View Slide

  21. ‹#›
    Spark
    • Use with MapReduce input/output
    formats
    • Create Configuration objects with
    input/output formats and data URI
    • Load/save data using SparkContext
    Hadoop file API

    View Slide

  22. ‹#›
    Data Movement
    Dynamic  queries  to  MongoDB  vs.  BSON  snapshots  in  HDFS
    Dynamic queries with most
    recent data
    Puts load on operational
    database
    Snapshots move load to
    Hadoop
    Snapshots add predictable
    load to MongoDB

    View Slide

  23. Demo : Recommendation Platform

    View Slide

  24. ‹#›
    Movie Web

    View Slide

  25. ‹#›
    MovieWeb Web Application
    • Browse
    - Top movies by ratings count
    - Top genres by movie count
    • Log in to
    - See My Ratings
    - Rate movies
    • Recommendations
    - Movies You May Like
    - Recommendations

    View Slide

  26. ‹#›
    MovieWeb Components
    • MovieLens dataset
    – 10M ratings, 10K movies, 70K users
    – http://grouplens.org/datasets/movielens/
    • Python web app to browse movies, recommendations
    – Flask, PyMongo
    • Spark app computes recommendations
    – MLLib collaborative filter
    • Predicted ratings are exposed in web app
    – New predictions collection

    View Slide

  27. ‹#›
    Spark Recommender
    • Apache Hadoop (2.3)
    - HDFS & YARN
    - Top genres by movie count
    • Spark (1.0)
    - Execute within YARN
    - Assign executor resources
    • Data
    - From HDFS, MongoDB
    - To MongoDB

    View Slide

  28. ‹#›
    MovieWeb Workflow
    Snapshot db
    as BSON
    Predict ratings for
    all pairings
    Write Prediction
    to MongoDB
    collection
    Store BSON 

    in HDFS
    Read BSON
    into Spark App
    Create user
    movie pairing
    Web Application
    exposes
    recommendations
    Repeat Process
    Train Model from
    existing ratings

    View Slide

  29. ‹#›
    Execution
    $ spark-submit --master local \

    --driver-memory 2G --executor-memory 2G \

    --jars mongo-hadoop-core.jar,mongo-java-driver.jar \

    --class com.mongodb.workshop.SparkExercise \

    ./target/spark-1.0-SNAPSHOT.jar \

    hdfs://localhost:9000 \

    mongodb://127.0.0.1:27017/movielens \

    predictions \

    View Slide

  30. Should I use MongoDB or Hadoop?

    View Slide

  31. ‹#›
    Business First!
    First-Level
    Analytics
    Internet of
    Things
    Social
    Mobile
    Apps
    Product/
    Asset
    Catalog
    Security &
    Fraud
    Single
    View
    Customer
    Data
    Manageme
    Churn
    Analysis
    Risk
    Modeling
    Sentiment
    Analysis
    Trade
    Surveillanc
    e
    Recommen
    der
    Warehouse
    & ETL
    Ad
    Targeting
    Predictive
    Analytics
    What/Why How

    View Slide

  32. ‹#›
    The good tool for the task
    • Dataset size
    • Data processing complexity
    • Continuous improvement
    V1.0

    View Slide

  33. ‹#›
    The good tool for the task
    • Dataset size
    • Data processing complexity
    • Continuous improvement
    V2.0

    View Slide

  34. ‹#›
    Resources / Questions
    • MongoDB Connector for Hadoop
    - http://github.com/mongodb/mongo-hadoop

    • Getting Started with MongoDB and Hadoop
    - http://docs.mongodb.org/ecosystem/tutorial/getting-
    started-with-hadoop/

    • MongoDB-Spark Demo
    - https://github.com/crcsmnky/mongodb-hadoop-
    workshop

    View Slide

  35. MongoDB & Hadoop
    Tugdual Grall
    Technical Evangelist
    [email protected]
    @tgrall

    View Slide