Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB and Hadoop, Austin MUG

MongoDB and Hadoop, Austin MUG

Sandeep Parikh

August 25, 2014
Tweet

More Decks by Sandeep Parikh

Other Decks in Technology

Transcript

  1. Hadoop The Apache Hadoop software library is a framework that

    allows for the distributed processing of large data sets across clusters of computers using simple programming models. •  Terabyte and Petabtye datasets •  Data warehousing •  Advanced analytics
  2. Enterprise IT Stack EDW Management & Monitoring Security & Auditing

    RDBMS CRM, ERP, Collaboration, Mobile, BI OS & Virtualization, Compute, Storage, Network RDBMS Applications Infrastructure Data Management Operational Analytical
  3. Operational: MongoDB First-­‐level   Analy/cs   Product/Asset   Catalogs  

    Security  &   Fraud   Internet  of   Things   Mobile  Apps   Customer   Data  Mgmt   Single  View   Social   Churn  Analysis   Recommender   Warehouse  &   ETL   Risk  Modeling   Trade   Surveillance   Predic/ve   Analy/cs   Ad  Targe/ng   Sen/ment   Analysis  
  4. Analytical: Hadoop First-­‐level   Analy/cs   Product/Asset   Catalogs  

    Security  &   Fraud   Internet  of   Things   Mobile  Apps   Customer   Data  Mgmt   Single  View   Social   Churn  Analysis   Recommender   Warehouse  &   ETL   Risk  Modeling   Trade   Surveillance   Predic/ve   Analy/cs   Ad  Targe/ng   Sen/ment   Analysis  
  5. Operational vs. Analytical: Lifecycle First-­‐level   Analy/cs   Product/Asset  

    Catalogs   Security  &   Fraud   Internet  of   Things   Mobile  Apps   Customer   Data  Mgmt   Single  View   Social   Churn  Analysis   Recommender   Warehouse  &   ETL   Risk  Modeling   Trade   Surveillance   Predic/ve   Analy/cs   Ad  Targe/ng   Sen/ment   Analysis  
  6. Commerce Applications powered by Analysis powered by •  Products &

    Inventory •  Recommended products •  Customer profile •  Session management •  Elastic pricing •  Recommendation models •  Predictive analytics •  Clickstream history MongoDB Connector for Hadoop
  7. Insurance Applications powered by Analysis powered by •  Customer profiles

    •  Insurance policies •  Session data •  Call center data •  Customer action analysis •  Churn analysis •  Churn prediction •  Policy rates MongoDB Connector for Hadoop
  8. Fraud Detection Payments Fraud modeling Nightly Analysis MongoDB Connector for

    Hadoop Results Cache Online payments processing 3rd Party Data Sources Fraud Detection query only query only
  9. HDFS and YARN •  Hadoop Distributed File System –  Distributed

    file-system that stores data on commodity machines in a Hadoop cluster •  YARN –  Resource management platform responsible for managing and scheduling compute resources in a Hadoop cluster
  10. MapReduce •  Paralell, distributed computation across a Hadoop cluster • 

    Process and/or generate large datasets •  Simplistic model for individual tasks Map(k1,  v1)   list(k2,v2)   Reduce(k2,  list(v2))   list(v3)  
  11. Pig •  High-level platform for creating MapReduce •  Pig Latin

    abstracts Java into easier-to-use notation •  Executed as a series of MapReduce applications •  Supports user-defined functions (UDFs)
  12. Hive •  Data warehouse infrastructure built on top of Hadoop

    •  Provides data summarization, query, and analysis •  HiveQL is a subset of SQL •  Support for user-defined functions (UDFs)
  13. Spark •  Powerful built-in transformations and actions –  map, reduceByKey,

    union, distinct, sample, intersection, and more –  foreach, count, collect, take, and many more Spark  is  a  fast  and  powerful  engine  for   processing  Hadoop  data.  It  is  designed  to   perform  both  general  data  processing  (similar   to  MapReduce)  and  new  workloads  like   streaming,  interac?ve  queries,  and  machine   learning.  
  14. Data Read/Write MongoDB Read/Write BSON Tools MapReduce Pig Hive Spark

    Platforms Apache Hadoop Cloudera CDH Hortonworks HDP Amazon EMR Connector Overview
  15. Features and Functionality •  MongoDB and BSON –  Input and

    Output formats •  Computes splits to read data •  Support for –  Filtering data with MongoDB queries –  Authentication –  Reading directly from shard Primaries –  ReadPreferences and Replica Set tags –  Appending to existing collections
  16. MapReduce Configuration •  MongoDB input –  mongo.job.input.format  =  com.mongodb.hadoop.MongoInputFormat  

    –  mongo.input.uri  =  mongodb://mydb:27017/db1.collection1   •  MongoDB output –  mongo.job.output.format  =  com.mongodb.hadoop.MongoOutputFormat   –  mongo.output.uri  =  mongodb://mydb:27017/db1.collection2   •  BSON input/output –  mongo.job.input.format  =  com.hadoop.BSONFileInputFormat   –  mapred.input.dir  =  hdfs:///tmp/database.bson   –  mongo.job.output.format  =  com.hadoop.BSONFileOutputFormat   –  mapred.output.dir  =  hdfs:///tmp/output.bson  
  17. Mapper Example public  class  Map  extends  Mapper<Object,  BSONObject,  Text,  IntWritable>

     {      public  void  map(Object  key,  BSONObject  doc,  Context  context)  {          List<String>  genres  =  (List<String>)doc.get("genres");   for(String  genre  :  genres)  {      context.write(new  Text(genre),  new  IntWritable(1));   }      }   }   {  _id:  ObjectId(…),  title:  “Toy  Story”,      genres:  [“Animation”,  “Children”]  }     {  _id:  ObjectId(…),  title:  “Goldeneye”,      genres:  [“Action”,  “Crime”,  “Thriller”]  }     {  _id:  ObjectId(…),  title:  “Jumanji”,      genres:  [“Adventure”,  “Children”,  “Fantasy”]  }    
  18. Reducer Example {  _id:  ObjectId(…),  genre:  “Action”,  count:  1370  }

        {  _id:  ObjectId(…),  genre:  “Adventure”,  count:  957  }     {  _id:  ObjectId(…),  genre:  “Animation”,  count:  258  }   public  class  Reduce  extends  Reducer<Text,  IntWritable,  NullWritable,  BSONWritable>  {      public  void  reduce(Text  key,  Iterable<IntWritable>  values,  Context  context)  {          int  sum  =  0;          for(IntWritable  value  :  values)  {              sum  +=  value.get();          }          DBObject  object  =  new  BasicDBObjectBuilder().start()                  .add("genre",  key.toString())                  .add("count",  sum)                  .get();          BSONWritable  doc  =  new  BSONWritable(object);          context.write(NullWritable.get(),  doc);      }   }  
  19. Pig – Mappings Read: –  BSONLoader and MongoLoader    

         data  =  LOAD  ‘mongodb://mydb:27017/db.collection’                    using  com.mongodb.hadoop.pig.MongoLoader   –  Map schema, _id, datatypes Insert: –  BSONStorage and MongoInsertStorage          STORE  records  INTO  ‘hdfs:///output.bson’                  using  com.mongodb.hadoop.pig.BSONStorage   –  Map output id, schema Update: –  MongoUpdateStorage   –  Specify query, update operations, schema, update options
  20. Pig Specifics •  Fixed or dynamic schema with Loader • 

    Types auto-mapped –  Embedded documents → Map –  Arrays → Tuple •  Supply alias for “_id” –  not a legal Pig variable name
  21. Hive – Tables CREATE  TABLE  mongo_users  (id  int,  name  string,

     age  int)   STORED  BY  "com.mongodb.hadoop.hive.MongoStorageHandler"   WITH  SERDEPROPERTIES("mongo.columns.mapping”="_id,name,age”)   TBLPROPERTIES("mongo.uri"  =  "mongodb://host:27017/test.users”)   •  Access collections as Hive tables •  Use with MongoStorageHandler or BSONStorageHandler  
  22. Hive Particulars •  Queries are not (currently) pushed down to

    MongoDB •  WHERE predicates are evaluated after reading data from MongoDB •  Types auto-mapped –  Embedded documents (mixed types) → STRUCT –  Embedded documents (single type) → MAP –  Arrays → ARRAY –  ObjectId → STRUCT •  Use EXTERNAL when creating tables otherwise dropping Hive table drops underlying collection
  23. Spark Usage •  Use with MapReduce input/ output formats • 

    Create Configuration objects with input/output formats and data URI •  Load/save data using SparkContext Hadoop file or RDD APIs
  24. Spark Input Example Configuration  bsonDataConfig  =  new  Configuration();   bsonDataConfig.set("mongo.job.input.format”,

     "BSONFileInputFormat.class");   JavaPairRDD<Object,BSONObject>  bsonData  =  sc.newAPIHadoopFile(          "hdfs://namenode:9000/data/test/foo.bson",          BSONFileInputFormat.class,  Object.class,          BSONObject.class,  bsonDataConfig);   Configuration  inputDataConfig  =  new  Configuration();   inputDataConfig.set("mongo.job.input.format”,  "MongoInputFormat.class");   inputDataConfig.set(“mongo.input.uri”,  “mongodb://127.0.0.1/test.foo”);   JavaPairRDD<Object,BSONObject>  inputData  =  sc.newAPIHadoopRDD(          inputDataConfig  MongoInputFormat.class,  Object.class,          BSONObject.class);   MongoDB BSON
  25. Data Movement Dynamic queries with most recent data Puts load

    on operational database Snapshots move load to Hadoop Snapshots add predictable load to MongoDB Dynamic queries to MongoDB vs. BSON snapshots in HDFS
  26. MovieWeb Components •  MovieLens dataset –  10M ratings, 10K movies,

    70K users •  Python web app to browse movies, recommendations –  Flask, PyMongo •  Spark app computes recommendations –  MLLib collaborative filter •  Predicted ratings are exposed in web app –  New predictions collection
  27. MovieWeb Web Application •  Browse –  Top movies by ratings

    count –  Top genres by movie count •  Log in to –  See My Ratings –  Rate movies •  What’s missing? –  Movies You May Like –  Recommendations
  28. Spark Recommender •  Apache Hadoop 2.3.0 –  HDFS •  Spark

    1.0 –  Execute locally –  Assign executor resources •  Data –  From HDFS –  To MongoDB
  29. Snapshot database as BSON Store BSON in HDFS Read BSON

    into Spark app Train model from existing ratings Create user-movie pairings Predict ratings for all pairings Write predictions to MongoDB collection Web application exposes recommendations Repeat the process weekly MovieWeb Workflow
  30. $ export SPARK_HOME=~/spark-1.0.0-bin-hadoop2 $ bin/spark-submit --master local --class com.mongodb.hadoop.demo.Recommender demo-1.0.jar

    --jars mongo-java-2.12.3.jar,mongo-hadoop-core-1.3.0.jar --driver-memory 2G --executor-memory 1G [insert job args here] Execution
  31. Questions? •  MongoDB Connector for Hadoop –  http://github.com/mongodb/mongo-hadoop •  Getting

    Started with MongoDB and Hadoop –  http://docs.mongodb.org/ecosystem/tutorial/getting- started-with-hadoop/ •  MongoDB-Spark Demo –  http://github.com/crcsmnky/mongodb-spark-demo