Slide 1

Slide 1 text

Chris Harris Email : [email protected] Twitter : cj_harris5 MongoDB and Hadoop Friday, 23 March 12

Slide 2

Slide 2 text

Terminology RDBMS MongoDB Table Collection Row(s) JSON Document Index Index Join Embedding & Linking Partition Shard Partition Key Shard Key Friday, 23 March 12

Slide 3

Slide 3 text

Here is a “simple” SQL Model mysql> select * from book; +----+----------------------------------------------------------+ | id | title | +----+----------------------------------------------------------+ | 1 | The Demon-Haunted World: Science as a Candle in the Dark | | 2 | Cosmos | | 3 | Programming in Scala | +----+----------------------------------------------------------+ 3 rows in set (0.00 sec) mysql> select * from bookauthor; +---------+-----------+ | book_id | author_id | +---------+-----------+ | 1 | 1 | | 2 | 1 | | 3 | 2 | | 3 | 3 | | 3 | 4 | +---------+-----------+ 5 rows in set (0.00 sec) mysql> select * from author; +----+-----------+------------+-------------+-------------+---------------+ | id | last_name | first_name | middle_name | nationality | year_of_birth | +----+-----------+------------+-------------+-------------+---------------+ | 1 | Sagan | Carl | Edward | NULL | 1934 | | 2 | Odersky | Martin | NULL | DE | 1958 | | 3 | Spoon | Lex | NULL | NULL | NULL | | 4 | Venners | Bill | NULL | NULL | NULL | +----+-----------+------------+-------------+-------------+---------------+ 4 rows in set (0.00 sec) Friday, 23 March 12

Slide 4

Slide 4 text

The Same Data in MongoDB { "_id" : ObjectId("4dfa6baa9c65dae09a4bbda5"), "title" : "Programming in Scala", "author" : [ { "first_name" : "Martin", "last_name" : "Odersky", "nationality" : "DE", "year_of_birth" : 1958 }, { "first_name" : "Lex", "last_name" : "Spoon" }, { "first_name" : "Bill", "last_name" : "Venners" } ] } Friday, 23 March 12

Slide 5

Slide 5 text

MongoDB and Hadoop The classic word count Friday, 23 March 12

Slide 6

Slide 6 text

Example Question: How many unique words are there in the following quote? “MongoDB wasn’t designed in a lab. We built MongoDB from our own experiences building large scale, high availability, robust systems. We didn’t start from scratch, we really tried to figure out what was broken, and tackle that. So the way I think about MongoDB is that if you take MySql, and change the data model from relational to document based, you get a lot of great features: embedded docs for speed, manageability, agile development with schema-less databases, easier horizontal scalability because joins aren’t as important. There are lots of things that work great in relational databases: indexes, dynamic queries and updates to name a few, and we haven’t changed much there. For example, the way you design your indexes in MongoDB should be exactly the way you do it in MySql or Oracle, you just have the option of indexing an embedded field.” – Eliot Horowitz, 10gen CTO and Co-founder Friday, 23 March 12

Slide 7

Slide 7 text

public static void main(String[] args) throws Exception { Mongo m = new Mongo(); DBCollection coll = m.getDB("words").getCollection("in"); // Read File Line By Line InputStream is = App.class.getResourceAsStream("text.txt"); BufferedReader br = new BufferedReader(new InputStreamReader(is)); String strLine; while ((strLine = br.readLine()) != null) { //save line BasicDBObject dbo = new BasicDBObject(); dbo.put("line", strLine); coll.save(dbo); } // Close the input stream is.close(); } Inserting Words in MongoDB Friday, 23 March 12

Slide 8

Slide 8 text

>db.in.find() { "_id" : ObjectId("4f675806a0eee430f07db48d"), "line" : "MongoDB wasn’t designed in a lab. We built MongoDB from our own experiences building large scale, high availability, robust systems. " } { "_id" : ObjectId("4f675807a0eee430f07db48e"), "line" : "We didn’t start from scratch, we really tried to figure out what was broken, and tackle that. So the way I think about MongoDB is that " } { "_id" : ObjectId("4f675807a0eee430f07db48f"), "line" : "if you take MySql, and change the data model from relational to document based, you get a lot of great features: embedded docs for speed, " } { "_id" : ObjectId("4f675807a0eee430f07db490"), "line" : "manageability, agile development with schema-less databases, easier horizontal scalability because joins aren’t as important. There are " } .... Input Data in MongoDB Friday, 23 March 12

Slide 9

Slide 9 text

public static void main( String[] args ) throws Exception{ final Configuration conf = new Configuration(); MongoConfigUtil.setInputURI( conf, "mongodb://localhost/words.in" ); MongoConfigUtil.setOutputURI( conf, "mongodb://localhost/words.out" ); final Job job = new Job( conf, "word count" ); job.setJarByClass( WordCount.class ); job.setMapperClass( TokenizerMapper.class ); job.setCombinerClass( IntSumReducer.class ); job.setReducerClass( IntSumReducer.class ); job.setOutputKeyClass( Text.class ); job.setOutputValueClass( IntWritable.class ); job.setInputFormatClass( MongoInputFormat.class ); job.setOutputFormatClass( MongoOutputFormat.class ); System.exit( job.waitForCompletion( true ) ? 0 : 1 ); } Hadoop Job Friday, 23 March 12

Slide 10

Slide 10 text

MongoDB Hadoop Adapter public void map(LongWritable key, Text value, Context context) throws ..{ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } public void map(Object key, BSONObject value, Context context ) throws ....{ StringTokenizer itr = new StringTokenizer( value.get( "line" ).toString() ); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } Classic Hadoop Word Count - Map Friday, 23 March 12

Slide 11

Slide 11 text

MongoDB Hadoop Adapter is the same code! public void reduce( Text key, Iterable values, Context context ) throws IOException, InterruptedException{ int sum = 0; for ( final IntWritable val : values ){ sum += val.get(); } context.write( key, new IntWritable(sum)); } Classic Hadoop Word Count - Reduce Friday, 23 March 12

Slide 12

Slide 12 text

>db.out.find() { "_id" : "For", "value" : 1 } { "_id" : "I", "value" : 1 } { "_id" : "MongoDB", "value" : 4 } { "_id" : "MySql", "value" : 1 } { "_id" : "Oracle,", "value" : 1 } { "_id" : "So", "value" : 1 } { "_id" : "There", "value" : 1 } { "_id" : "We", "value" : 2 } { "_id" : "a", "value" : 3 } { "_id" : "about", "value" : 1 } { "_id" : "agile", "value" : 1 } { "_id" : "an", "value" : 1 } { "_id" : "and", "value" : 4 } { "_id" : "are", "value" : 1 } { "_id" : "aren’t", "value" : 1 } { "_id" : "as", "value" : 1 } { "_id" : "availability,", "value" : 1 } { "_id" : "based,", "value" : 1 } { "_id" : "be", "value" : 1 } ....... Word Count - Results Friday, 23 March 12

Slide 13

Slide 13 text

Scaling Friday, 23 March 12

Slide 14

Slide 14 text

http://community.qlikview.com/cfs-filesystemfile.ashx/__key/CommunityServer.Blogs.Components.WeblogFiles/ theqlikviewblog/Cutting-Grass-with-Scissors-_2D00_-2.jpg Friday, 23 March 12

Slide 15

Slide 15 text

http://www.bitquill.net/blog/wp-content/uploads/2008/07/pack_of_harvesters.jpg Friday, 23 March 12

Slide 16

Slide 16 text

Balancing Shard 1 Shard 2 Shard 3 Shard 4 5 9 1 6 10 2 7 11 3 8 12 4 17 21 13 18 22 14 19 23 15 20 24 16 29 33 25 30 34 26 31 35 27 32 36 28 41 45 37 42 46 38 43 47 39 44 48 40 mongos balancer config config config Chunks! Friday, 23 March 12

Slide 17

Slide 17 text

Balancing mongos balancer config config config Shard 1 Shard 2 Shard 3 Shard 4 5 9 1 6 10 2 7 11 3 8 12 4 21 22 23 24 33 34 35 36 45 46 47 48 Imbalance Imbalance Friday, 23 March 12

Slide 18

Slide 18 text

Balancing mongos balancer Move chunk 1 to Shard 2 config config config Shard 1 Shard 2 Shard 3 Shard 4 5 9 1 6 10 2 7 11 3 8 12 4 21 22 23 24 33 34 35 36 45 46 47 48 Friday, 23 March 12

Slide 19

Slide 19 text

Balancing mongos balancer config config config Shard 1 Shard 2 Shard 3 Shard 4 5 9 6 10 2 7 11 3 8 12 4 21 22 23 24 33 34 35 36 45 46 47 48 1 Friday, 23 March 12

Slide 20

Slide 20 text

Balancing mongos balancer config config config Shard 1 Shard 2 Shard 3 Shard 4 5 9 6 10 2 7 11 3 8 12 4 21 22 23 24 33 34 35 36 45 46 47 48 1 Friday, 23 March 12

Slide 21

Slide 21 text

Balancing mongos Chunk 1 now lives on Shard 2 config config config Shard 1 Shard 2 Shard 3 Shard 4 5 9 1 6 10 2 7 11 3 8 12 4 21 22 23 24 33 34 35 36 45 46 47 48 Friday, 23 March 12

Slide 22

Slide 22 text

MongoDB & Hadoop config config config Shard 1 Shard 2 Shard 3 Shard 4 5 9 1 6 10 2 7 11 3 8 12 4 21 22 23 24 33 34 35 36 45 46 47 48 read config Hadoop mapper Hadoop mapper read config Friday, 23 March 12

Slide 23

Slide 23 text

MongoDB & Hadoop Hadoop config config config Shard 1 Shard 2 Shard 3 Shard 4 5 9 6 10 7 11 3 8 12 4 21 22 23 24 33 34 35 36 45 46 47 48 mapper Hadoop mapper 1 2 2 1 Friday, 23 March 12

Slide 24

Slide 24 text

MongoDB & Hadoop Hadoop config config config Shard 1 Shard 2 Shard 3 Shard 4 5 9 6 10 7 11 3 8 12 4 21 22 23 24 33 34 35 36 45 46 47 48 mapper Hadoop mapper 1 2 2 1 Friday, 23 March 12

Slide 25

Slide 25 text

conferences, appearances http://www.10gen.com/events and meetups http://www.meetup.com/London-MongoDB-User-Group download at mongodb.org We’re Hiring ! Chris Harris Email : [email protected] Twitter : cj_harris5 Friday, 23 March 12