Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB and Hadoop

cj_harris5
March 23, 2012

MongoDB and Hadoop

Presented at the London Hadoop User Group

cj_harris5

March 23, 2012
Tweet

More Decks by cj_harris5

Other Decks in Technology

Transcript

  1. Terminology RDBMS MongoDB Table Collection Row(s) JSON Document Index Index

    Join Embedding & Linking Partition Shard Partition Key Shard Key Friday, 23 March 12
  2. Here is a “simple” SQL Model mysql> select * from

    book; +----+----------------------------------------------------------+ | id | title | +----+----------------------------------------------------------+ | 1 | The Demon-Haunted World: Science as a Candle in the Dark | | 2 | Cosmos | | 3 | Programming in Scala | +----+----------------------------------------------------------+ 3 rows in set (0.00 sec) mysql> select * from bookauthor; +---------+-----------+ | book_id | author_id | +---------+-----------+ | 1 | 1 | | 2 | 1 | | 3 | 2 | | 3 | 3 | | 3 | 4 | +---------+-----------+ 5 rows in set (0.00 sec) mysql> select * from author; +----+-----------+------------+-------------+-------------+---------------+ | id | last_name | first_name | middle_name | nationality | year_of_birth | +----+-----------+------------+-------------+-------------+---------------+ | 1 | Sagan | Carl | Edward | NULL | 1934 | | 2 | Odersky | Martin | NULL | DE | 1958 | | 3 | Spoon | Lex | NULL | NULL | NULL | | 4 | Venners | Bill | NULL | NULL | NULL | +----+-----------+------------+-------------+-------------+---------------+ 4 rows in set (0.00 sec) Friday, 23 March 12
  3. The Same Data in MongoDB { "_id" : ObjectId("4dfa6baa9c65dae09a4bbda5"), "title"

    : "Programming in Scala", "author" : [ { "first_name" : "Martin", "last_name" : "Odersky", "nationality" : "DE", "year_of_birth" : 1958 }, { "first_name" : "Lex", "last_name" : "Spoon" }, { "first_name" : "Bill", "last_name" : "Venners" } ] } Friday, 23 March 12
  4. Example Question: How many unique words are there in the

    following quote? “MongoDB wasn’t designed in a lab. We built MongoDB from our own experiences building large scale, high availability, robust systems. We didn’t start from scratch, we really tried to figure out what was broken, and tackle that. So the way I think about MongoDB is that if you take MySql, and change the data model from relational to document based, you get a lot of great features: embedded docs for speed, manageability, agile development with schema-less databases, easier horizontal scalability because joins aren’t as important. There are lots of things that work great in relational databases: indexes, dynamic queries and updates to name a few, and we haven’t changed much there. For example, the way you design your indexes in MongoDB should be exactly the way you do it in MySql or Oracle, you just have the option of indexing an embedded field.” – Eliot Horowitz, 10gen CTO and Co-founder Friday, 23 March 12
  5. public static void main(String[] args) throws Exception { Mongo m

    = new Mongo(); DBCollection coll = m.getDB("words").getCollection("in"); // Read File Line By Line InputStream is = App.class.getResourceAsStream("text.txt"); BufferedReader br = new BufferedReader(new InputStreamReader(is)); String strLine; while ((strLine = br.readLine()) != null) { //save line BasicDBObject dbo = new BasicDBObject(); dbo.put("line", strLine); coll.save(dbo); } // Close the input stream is.close(); } Inserting Words in MongoDB Friday, 23 March 12
  6. >db.in.find() { "_id" : ObjectId("4f675806a0eee430f07db48d"), "line" : "MongoDB wasn’t designed

    in a lab. We built MongoDB from our own experiences building large scale, high availability, robust systems. " } { "_id" : ObjectId("4f675807a0eee430f07db48e"), "line" : "We didn’t start from scratch, we really tried to figure out what was broken, and tackle that. So the way I think about MongoDB is that " } { "_id" : ObjectId("4f675807a0eee430f07db48f"), "line" : "if you take MySql, and change the data model from relational to document based, you get a lot of great features: embedded docs for speed, " } { "_id" : ObjectId("4f675807a0eee430f07db490"), "line" : "manageability, agile development with schema-less databases, easier horizontal scalability because joins aren’t as important. There are " } .... Input Data in MongoDB Friday, 23 March 12
  7. public static void main( String[] args ) throws Exception{ final

    Configuration conf = new Configuration(); MongoConfigUtil.setInputURI( conf, "mongodb://localhost/words.in" ); MongoConfigUtil.setOutputURI( conf, "mongodb://localhost/words.out" ); final Job job = new Job( conf, "word count" ); job.setJarByClass( WordCount.class ); job.setMapperClass( TokenizerMapper.class ); job.setCombinerClass( IntSumReducer.class ); job.setReducerClass( IntSumReducer.class ); job.setOutputKeyClass( Text.class ); job.setOutputValueClass( IntWritable.class ); job.setInputFormatClass( MongoInputFormat.class ); job.setOutputFormatClass( MongoOutputFormat.class ); System.exit( job.waitForCompletion( true ) ? 0 : 1 ); } Hadoop Job Friday, 23 March 12
  8. MongoDB Hadoop Adapter public void map(LongWritable key, Text value, Context

    context) throws ..{ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } public void map(Object key, BSONObject value, Context context ) throws ....{ StringTokenizer itr = new StringTokenizer( value.get( "line" ).toString() ); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } Classic Hadoop Word Count - Map Friday, 23 March 12
  9. MongoDB Hadoop Adapter is the same code! public void reduce(

    Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException{ int sum = 0; for ( final IntWritable val : values ){ sum += val.get(); } context.write( key, new IntWritable(sum)); } Classic Hadoop Word Count - Reduce Friday, 23 March 12
  10. >db.out.find() { "_id" : "For", "value" : 1 } {

    "_id" : "I", "value" : 1 } { "_id" : "MongoDB", "value" : 4 } { "_id" : "MySql", "value" : 1 } { "_id" : "Oracle,", "value" : 1 } { "_id" : "So", "value" : 1 } { "_id" : "There", "value" : 1 } { "_id" : "We", "value" : 2 } { "_id" : "a", "value" : 3 } { "_id" : "about", "value" : 1 } { "_id" : "agile", "value" : 1 } { "_id" : "an", "value" : 1 } { "_id" : "and", "value" : 4 } { "_id" : "are", "value" : 1 } { "_id" : "aren’t", "value" : 1 } { "_id" : "as", "value" : 1 } { "_id" : "availability,", "value" : 1 } { "_id" : "based,", "value" : 1 } { "_id" : "be", "value" : 1 } ....... Word Count - Results Friday, 23 March 12
  11. Balancing Shard 1 Shard 2 Shard 3 Shard 4 5

    9 1 6 10 2 7 11 3 8 12 4 17 21 13 18 22 14 19 23 15 20 24 16 29 33 25 30 34 26 31 35 27 32 36 28 41 45 37 42 46 38 43 47 39 44 48 40 mongos balancer config config config Chunks! Friday, 23 March 12
  12. Balancing mongos balancer config config config Shard 1 Shard 2

    Shard 3 Shard 4 5 9 1 6 10 2 7 11 3 8 12 4 21 22 23 24 33 34 35 36 45 46 47 48 Imbalance Imbalance Friday, 23 March 12
  13. Balancing mongos balancer Move chunk 1 to Shard 2 config

    config config Shard 1 Shard 2 Shard 3 Shard 4 5 9 1 6 10 2 7 11 3 8 12 4 21 22 23 24 33 34 35 36 45 46 47 48 Friday, 23 March 12
  14. Balancing mongos balancer config config config Shard 1 Shard 2

    Shard 3 Shard 4 5 9 6 10 2 7 11 3 8 12 4 21 22 23 24 33 34 35 36 45 46 47 48 1 Friday, 23 March 12
  15. Balancing mongos balancer config config config Shard 1 Shard 2

    Shard 3 Shard 4 5 9 6 10 2 7 11 3 8 12 4 21 22 23 24 33 34 35 36 45 46 47 48 1 Friday, 23 March 12
  16. Balancing mongos Chunk 1 now lives on Shard 2 config

    config config Shard 1 Shard 2 Shard 3 Shard 4 5 9 1 6 10 2 7 11 3 8 12 4 21 22 23 24 33 34 35 36 45 46 47 48 Friday, 23 March 12
  17. MongoDB & Hadoop config config config Shard 1 Shard 2

    Shard 3 Shard 4 5 9 1 6 10 2 7 11 3 8 12 4 21 22 23 24 33 34 35 36 45 46 47 48 read config Hadoop mapper Hadoop mapper read config Friday, 23 March 12
  18. MongoDB & Hadoop Hadoop config config config Shard 1 Shard

    2 Shard 3 Shard 4 5 9 6 10 7 11 3 8 12 4 21 22 23 24 33 34 35 36 45 46 47 48 mapper Hadoop mapper 1 2 2 1 Friday, 23 March 12
  19. MongoDB & Hadoop Hadoop config config config Shard 1 Shard

    2 Shard 3 Shard 4 5 9 6 10 7 11 3 8 12 4 21 22 23 24 33 34 35 36 45 46 47 48 mapper Hadoop mapper 1 2 2 1 Friday, 23 March 12