Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB + Hadoop: Dublin MUG

MongoDB + Hadoop: Dublin MUG

Brendan McAdams

October 12, 2012
Tweet

More Decks by Brendan McAdams

Other Decks in Programming

Transcript

  1. Brendan McAdams 10gen, Inc. [email protected] @rit Taming The Elephant In

    The Room with MongoDB + Hadoop Integration Friday, October 12, 12
  2. Big Data at a Glance Large Dataset Primary Key as

    “username” Friday, October 12, 12
  3. Big Data at a Glance • Big Data can be

    gigabytes, terabytes, petabytes or exabytes Large Dataset Primary Key as “username” Friday, October 12, 12
  4. Big Data at a Glance • Big Data can be

    gigabytes, terabytes, petabytes or exabytes • An ideal big data system scales up and down around various data sizes – while providing a uniform view Large Dataset Primary Key as “username” Friday, October 12, 12
  5. Big Data at a Glance • Big Data can be

    gigabytes, terabytes, petabytes or exabytes • An ideal big data system scales up and down around various data sizes – while providing a uniform view • Major concerns Large Dataset Primary Key as “username” Friday, October 12, 12
  6. Big Data at a Glance • Big Data can be

    gigabytes, terabytes, petabytes or exabytes • An ideal big data system scales up and down around various data sizes – while providing a uniform view • Major concerns • Can I read & write this data efficiently at different scale? Large Dataset Primary Key as “username” Friday, October 12, 12
  7. Big Data at a Glance • Big Data can be

    gigabytes, terabytes, petabytes or exabytes • An ideal big data system scales up and down around various data sizes – while providing a uniform view • Major concerns • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data? Large Dataset Primary Key as “username” Friday, October 12, 12
  8. Big Data at a Glance Large Dataset Primary Key as

    “username” ... Friday, October 12, 12
  9. Big Data at a Glance • Systems like Google File

    System (which inspired Hadoop’s HDFS) and MongoDB’s Sharding handle the scale problem by chunking Large Dataset Primary Key as “username” ... Friday, October 12, 12
  10. Big Data at a Glance • Systems like Google File

    System (which inspired Hadoop’s HDFS) and MongoDB’s Sharding handle the scale problem by chunking • Break up pieces of data into smaller chunks, spread across many data nodes Large Dataset Primary Key as “username” ... Friday, October 12, 12
  11. Big Data at a Glance • Systems like Google File

    System (which inspired Hadoop’s HDFS) and MongoDB’s Sharding handle the scale problem by chunking • Break up pieces of data into smaller chunks, spread across many data nodes • Each data node contains many chunks Large Dataset Primary Key as “username” ... Friday, October 12, 12
  12. Big Data at a Glance • Systems like Google File

    System (which inspired Hadoop’s HDFS) and MongoDB’s Sharding handle the scale problem by chunking • Break up pieces of data into smaller chunks, spread across many data nodes • Each data node contains many chunks • If a chunk gets too large or a node overloaded, data can be rebalanced Large Dataset Primary Key as “username” ... Friday, October 12, 12
  13. Chunks Represent Ranges of Values ... +∞ -∞ Initially, an

    empty collection has a single chunk, running the range of minimum (-∞) to maximum (+∞) Friday, October 12, 12
  14. Chunks Represent Ranges of Values ... +∞ -∞ Initially, an

    empty collection has a single chunk, running the range of minimum (-∞) to maximum (+∞) As we add data, more chunks are created of new ranges INSERT {USERNAME: “Bill”} -∞ “B” “C” +∞ Friday, October 12, 12
  15. Chunks Represent Ranges of Values ... +∞ -∞ Initially, an

    empty collection has a single chunk, running the range of minimum (-∞) to maximum (+∞) As we add data, more chunks are created of new ranges INSERT {USERNAME: “Bill”} -∞ “B” “C” +∞ Individual or partial letter ranges are one possible chunk value... but they can get smaller! “Ba” -∞ “Br” “Be” INSERT {USERNAME: “Becky”} INSERT {USERNAME: “Brendan”} Friday, October 12, 12
  16. Chunks Represent Ranges of Values ... +∞ -∞ Initially, an

    empty collection has a single chunk, running the range of minimum (-∞) to maximum (+∞) As we add data, more chunks are created of new ranges INSERT {USERNAME: “Bill”} -∞ “B” “C” +∞ Individual or partial letter ranges are one possible chunk value... but they can get smaller! “Ba” -∞ “Br” “Be” “Brendan” “Brad” The smallest possible chunk value is not a range, but a single possible value INSERT {USERNAME: “Becky”} INSERT {USERNAME: “Brendan”} INSERT {USERNAME: “Brad”} Friday, October 12, 12
  17. Big Data at a Glance Large Dataset Primary Key as

    “username” a b c d e f g h s t u v w x y z ... Friday, October 12, 12
  18. Big Data at a Glance • To simplify things, let’s

    look at our dataset split into chunks by letter • Each chunk is represented by a single letter marking it’s contents • You could think of “B” as really being “Ba” →”Bz” Large Dataset Primary Key as “username” a b c d e f g h s t u v w x y z ... Friday, October 12, 12
  19. Big Data at a Glance Large Dataset Primary Key as

    “username” a b c d e f g h s t u v w x y z Friday, October 12, 12
  20. Big Data at a Glance Large Dataset Primary Key as

    “username” a b c d e f g h s t u v w x y z MongoDB Sharding ( as well as HDFS ) breaks data into chunks (~64 mb) Friday, October 12, 12
  21. Large Dataset Primary Key as “username” Big Data at a

    Glance Data Node 1 25% of chunks Data Node 2 25% of chunks Data Node 3 25% of chunks Data Node 4 25% of chunks a b c d e f g h s t u v w x y z Representing data as chunks allows many levels of scale across n data nodes Friday, October 12, 12
  22. Large Dataset Primary Key as “username” Big Data at a

    Glance Data Node 1 25% of chunks Data Node 2 25% of chunks Data Node 3 25% of chunks Data Node 4 25% of chunks a b c d e f g h s t u v w x y z Representing data as chunks allows many levels of scale across n data nodes Friday, October 12, 12
  23. Scaling Data Node 1 Data Node 2 Data Node 3

    Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z The set of chunks can be evenly distributed across n data nodes Friday, October 12, 12
  24. Scaling Data Node 1 Data Node 2 Data Node 3

    Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z The set of chunks can be evenly distributed across n data nodes Friday, October 12, 12
  25. Add Nodes: Chunk Rebalancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z The goal is equilibrium - an equal distribution. As nodes are added (or even removed) chunks can be redistributed for balance. Friday, October 12, 12
  26. Add Nodes: Chunk Rebalancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z The goal is equilibrium - an equal distribution. As nodes are added (or even removed) chunks can be redistributed for balance. Friday, October 12, 12
  27. Writes Routed to Appropriate Chunk Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z Friday, October 12, 12
  28. Writes Routed to Appropriate Chunk Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z Friday, October 12, 12
  29. Writes Routed to Appropriate Chunk Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z Write to key“ziggy” z Writes are efficiently routed to the appropriate node & chunk Friday, October 12, 12
  30. Writes Routed to Appropriate Chunk Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z Write to key“ziggy” z Writes are efficiently routed to the appropriate node & chunk Friday, October 12, 12
  31. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z Write to key“ziggy” z If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks Friday, October 12, 12
  32. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks Friday, October 12, 12
  33. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks Friday, October 12, 12
  34. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z2 If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks z1 Friday, October 12, 12
  35. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z2 If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks z1 Friday, October 12, 12
  36. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z2 z1 Each new part of the Z chunk (left & right) now contains half of the keys Friday, October 12, 12
  37. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z2 z1 As chunks continue to grow and split, they can be rebalanced to keep an equal share of data on each server. Friday, October 12, 12
  38. Reads with Key Routed Efficiently Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z1 Read Key “xavier” Reading a single value by Primary Key Read routed efficiently to specific chunk containing key z2 Friday, October 12, 12
  39. Reads with Key Routed Efficiently Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y Read Key “xavier” Reading a single value by Primary Key Read routed efficiently to specific chunk containing key z1 z2 Friday, October 12, 12
  40. Reads with Key Routed Efficiently Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y Read Key “xavier” Reading a single value by Primary Key Read routed efficiently to specific chunk containing key z1 z2 Friday, October 12, 12
  41. Reads with Key Routed Efficiently Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y Read Keys “T”->”X” Reading multiple values by Primary Key Reads routed efficiently to specific chunks in range t u v w x z1 z2 Friday, October 12, 12
  42. Reads with Key Routed Efficiently Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y Read Keys “T”->”X” Reading multiple values by Primary Key Reads routed efficiently to specific chunks in range t u v w x z1 z2 Friday, October 12, 12
  43. Processing Scalable Big Data •Just as we must be able

    to scale our storage of data (from gigabytes through exabytes and beyond), we must be able to process it. Friday, October 12, 12
  44. Processing Scalable Big Data •Just as we must be able

    to scale our storage of data (from gigabytes through exabytes and beyond), we must be able to process it. • We had two questions, one of which we’ve answered... Friday, October 12, 12
  45. Processing Scalable Big Data •Just as we must be able

    to scale our storage of data (from gigabytes through exabytes and beyond), we must be able to process it. • We had two questions, one of which we’ve answered... • Can I read & write this data efficiently at different scale? Friday, October 12, 12
  46. Processing Scalable Big Data •Just as we must be able

    to scale our storage of data (from gigabytes through exabytes and beyond), we must be able to process it. • We had two questions, one of which we’ve answered... • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data? Friday, October 12, 12
  47. Processing Scalable Big Data •Just as we must be able

    to scale our storage of data (from gigabytes through exabytes and beyond), we must be able to process it. • We had two questions, one of which we’ve answered... • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data? Friday, October 12, 12
  48. Don’t Bite Off More Than You Can Chew... • The

    answer to calculating big data is much the same as storing it Friday, October 12, 12
  49. Don’t Bite Off More Than You Can Chew... • The

    answer to calculating big data is much the same as storing it • We need to break our data into bite sized pieces Friday, October 12, 12
  50. Don’t Bite Off More Than You Can Chew... • The

    answer to calculating big data is much the same as storing it • We need to break our data into bite sized pieces • Build functions which can be composed together repeatedly on partitions of our data Friday, October 12, 12
  51. Don’t Bite Off More Than You Can Chew... • The

    answer to calculating big data is much the same as storing it • We need to break our data into bite sized pieces • Build functions which can be composed together repeatedly on partitions of our data • Process portions of the data across multiple calculation nodes Friday, October 12, 12
  52. Don’t Bite Off More Than You Can Chew... • The

    answer to calculating big data is much the same as storing it • We need to break our data into bite sized pieces • Build functions which can be composed together repeatedly on partitions of our data • Process portions of the data across multiple calculation nodes • Aggregate the results into a final set of results Friday, October 12, 12
  53. Bite Sized Pieces Are Easier to Swallow • These pieces

    are not chunks – rather, the individual data points that make up each chunk Friday, October 12, 12
  54. Bite Sized Pieces Are Easier to Swallow • These pieces

    are not chunks – rather, the individual data points that make up each chunk • Chunks make up a useful data transfer units for processing as well Friday, October 12, 12
  55. Bite Sized Pieces Are Easier to Swallow • These pieces

    are not chunks – rather, the individual data points that make up each chunk • Chunks make up a useful data transfer units for processing as well • Transfer Chunks as “Input Splits” to calculation nodes, allowing for scalable parallel processing Friday, October 12, 12
  56. MapReduce the Pieces • The most common application of these

    techniques is MapReduce Friday, October 12, 12
  57. MapReduce the Pieces • The most common application of these

    techniques is MapReduce • Based on a Google Whitepaper, works with two primary functions – map and reduce – to calculate against large datasets Friday, October 12, 12
  58. MapReduce to Calculate Big Data • MapReduce is designed to

    effectively process data at varying scales Friday, October 12, 12
  59. MapReduce to Calculate Big Data • MapReduce is designed to

    effectively process data at varying scales • Composable function units can be reused repeatedly for scaled results Friday, October 12, 12
  60. MapReduce to Calculate Big Data • MapReduce is designed to

    effectively process data at varying scales • Composable function units can be reused repeatedly for scaled results • MongoDB Supports MapReduce with JavaScript Friday, October 12, 12
  61. MapReduce to Calculate Big Data • MapReduce is designed to

    effectively process data at varying scales • Composable function units can be reused repeatedly for scaled results • MongoDB Supports MapReduce with JavaScript • Limitations on its scalability Friday, October 12, 12
  62. MapReduce to Calculate Big Data • In addition to the

    HDFS storage component, Hadoop is built around MapReduce for calculation Friday, October 12, 12
  63. MapReduce to Calculate Big Data • In addition to the

    HDFS storage component, Hadoop is built around MapReduce for calculation • MongoDB can be integrated to MapReduce data on Hadoop Friday, October 12, 12
  64. MapReduce to Calculate Big Data • In addition to the

    HDFS storage component, Hadoop is built around MapReduce for calculation • MongoDB can be integrated to MapReduce data on Hadoop • No HDFS storage needed - data moves directly between MongoDB and Hadoop’s MapReduce engine Friday, October 12, 12
  65. What is MapReduce? • MapReduce made up of a series

    of phases, the primary of which are Friday, October 12, 12
  66. What is MapReduce? • MapReduce made up of a series

    of phases, the primary of which are • Map Friday, October 12, 12
  67. What is MapReduce? • MapReduce made up of a series

    of phases, the primary of which are • Map • Shuffle Friday, October 12, 12
  68. What is MapReduce? • MapReduce made up of a series

    of phases, the primary of which are • Map • Shuffle • Reduce Friday, October 12, 12
  69. What is MapReduce? • MapReduce made up of a series

    of phases, the primary of which are • Map • Shuffle • Reduce • Let’s look at a typical MapReduce job Friday, October 12, 12
  70. What is MapReduce? • MapReduce made up of a series

    of phases, the primary of which are • Map • Shuffle • Reduce • Let’s look at a typical MapReduce job • Email records Friday, October 12, 12
  71. What is MapReduce? • MapReduce made up of a series

    of phases, the primary of which are • Map • Shuffle • Reduce • Let’s look at a typical MapReduce job • Email records • Count # of times a particular user has received email Friday, October 12, 12
  72. MapReducing Email to: tyler from: brendan subject: Ruby Support to:

    brendan from: tyler subject: Re: Ruby Support to: mike from: brendan subject: Node Support to: brendan from: mike subject: Re: Node Support to: mike from: tyler subject: COBOL Support to: tyler from: mike subject: Re: COBOL Support (WTF?) Friday, October 12, 12
  73. Map Step to: tyler from: brendan subject: Ruby Support to:

    brendan from: tyler subject: Re: Ruby Support to: mike from: brendan subject: Node Support to: brendan from: mike subject: Re: Node Support to: mike from: tyler subject: COBOL Support to: tyler from: mike subject: Re: COBOL Support (WTF?) key: tyler value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: tyler value: {count: 1} map function emit(k, v) map function breaks each document into a key (grouping) & value Friday, October 12, 12
  74. Map Step to: tyler from: brendan subject: Ruby Support to:

    brendan from: tyler subject: Re: Ruby Support to: mike from: brendan subject: Node Support to: brendan from: mike subject: Re: Node Support to: mike from: tyler subject: COBOL Support to: tyler from: mike subject: Re: COBOL Support (WTF?) key: tyler value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: tyler value: {count: 1} map function emit(k, v) map function breaks each document into a key (grouping) & value Friday, October 12, 12
  75. Map Step to: tyler from: brendan subject: Ruby Support to:

    brendan from: tyler subject: Re: Ruby Support to: mike from: brendan subject: Node Support to: brendan from: mike subject: Re: Node Support to: mike from: tyler subject: COBOL Support to: tyler from: mike subject: Re: COBOL Support (WTF?) key: tyler value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: tyler value: {count: 1} map function emit(k, v) map function breaks each document into a key (grouping) & value Friday, October 12, 12
  76. Group/Shuffle Step key: tyler value: {count: 1} key: brendan value:

    {count: 1} key: mike value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: tyler value: {count: 1} Group like keys together, creating an array of their distinct values (Automatically done by M/R frameworks) Friday, October 12, 12
  77. Group/Shuffle Step key: tyler value: {count: 1} key: brendan value:

    {count: 1} key: mike value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: tyler value: {count: 1} Group like keys together, creating an array of their distinct values (Automatically done by M/R frameworks) Friday, October 12, 12
  78. Group/Shuffle Step key: brendan values: [{count: 1}, {count: 1}] key:

    mike values: [{count: 1}, {count: 1}] key: tyler values: [{count: 1}, {count: 1}] Group like keys together, creating an array of their distinct values (Automatically done by M/R frameworks) Friday, October 12, 12
  79. Reduce Step key: brendan values: [{count: 1}, {count: 1}] key:

    mike values: [{count: 1}, {count: 1}] key: tyler values: [{count: 1}, {count: 1}] For each key reduce function flattens the list of values to a single result reduce function aggregate values return (result) key: tyler value: {count: 2} key: mike value: {count: 2} key: brendan value: {count: 2} Friday, October 12, 12
  80. Reduce Step key: brendan values: [{count: 1}, {count: 1}] key:

    mike values: [{count: 1}, {count: 1}] key: tyler values: [{count: 1}, {count: 1}] For each key reduce function flattens the list of values to a single result reduce function aggregate values return (result) key: tyler value: {count: 2} key: mike value: {count: 2} key: brendan value: {count: 2} Friday, October 12, 12
  81. Reduce Step key: brendan values: [{count: 1}, {count: 1}] key:

    mike values: [{count: 1}, {count: 1}] key: tyler values: [{count: 1}, {count: 1}] For each key reduce function flattens the list of values to a single result reduce function aggregate values return (result) key: tyler value: {count: 2} key: mike value: {count: 2} key: brendan value: {count: 2} Friday, October 12, 12
  82. Reduce Step key: brendan values: [{count: 1}, {count: 1}] key:

    mike values: [{count: 1}, {count: 1}] key: tyler values: [{count: 1}, {count: 1}] For each key reduce function flattens the list of values to a single result reduce function aggregate values return (result) key: tyler value: {count: 2} key: mike value: {count: 2} key: brendan value: {count: 2} Friday, October 12, 12
  83. Processing Scalable Big Data • MapReduce provides an effective system

    for calculating and processing our large datasets (from gigabytes through exabytes and beyond) Friday, October 12, 12
  84. Processing Scalable Big Data • MapReduce provides an effective system

    for calculating and processing our large datasets (from gigabytes through exabytes and beyond) • MapReduce is supported in many places including MongoDB & Hadoop Friday, October 12, 12
  85. Processing Scalable Big Data • MapReduce provides an effective system

    for calculating and processing our large datasets (from gigabytes through exabytes and beyond) • MapReduce is supported in many places including MongoDB & Hadoop • We have effective answers for both of our concerns. Friday, October 12, 12
  86. Processing Scalable Big Data • MapReduce provides an effective system

    for calculating and processing our large datasets (from gigabytes through exabytes and beyond) • MapReduce is supported in many places including MongoDB & Hadoop • We have effective answers for both of our concerns. • Can I read & write this data efficiently at different scale? Friday, October 12, 12
  87. Processing Scalable Big Data • MapReduce provides an effective system

    for calculating and processing our large datasets (from gigabytes through exabytes and beyond) • MapReduce is supported in many places including MongoDB & Hadoop • We have effective answers for both of our concerns. • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data? Friday, October 12, 12
  88. Processing Scalable Big Data • MapReduce provides an effective system

    for calculating and processing our large datasets (from gigabytes through exabytes and beyond) • MapReduce is supported in many places including MongoDB & Hadoop • We have effective answers for both of our concerns. • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data? Friday, October 12, 12
  89. Integrating MongoDB + Hadoop • Data storage and data processing

    are often separate concerns Friday, October 12, 12
  90. Integrating MongoDB + Hadoop • Data storage and data processing

    are often separate concerns • MongoDB has limited ability to aggregate and process large datasets (JavaScript parallelism - alleviated some with New Aggregation Framework) Friday, October 12, 12
  91. Integrating MongoDB + Hadoop • Data storage and data processing

    are often separate concerns • MongoDB has limited ability to aggregate and process large datasets (JavaScript parallelism - alleviated some with New Aggregation Framework) • Hadoop is built for scalable processing of large datasets Friday, October 12, 12
  92. MapReducing in MongoDB - Single Server Large Dataset (single mongod)

    Primary Key as “username” Only one MapReduce thread available Friday, October 12, 12
  93. MapReducing in MongoDB - Sharding One MapReduce thread per shard

    (no per-chunk parallelism) Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z One MapReduce thread per shard (no per-chunk parallelism) One MapReduce thread per shard (no per-chunk parallelism) One MapReduce thread per shard (no per-chunk parallelism) One MapReduce thread per shard (no per-chunk parallelism) Friday, October 12, 12
  94. MapReducing in MongoDB - Sharding One MapReduce thread per shard

    (no per-chunk parallelism) Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z One MapReduce thread per shard (no per-chunk parallelism) One MapReduce thread per shard (no per-chunk parallelism) One MapReduce thread per shard (no per-chunk parallelism) One MapReduce thread per shard (no per-chunk parallelism) ... Architecturally, the number of processing nodes is limited to our number of data storage nodes. Friday, October 12, 12
  95. The Right Tool for the Job • JavaScript isn’t always

    the most ideal language for many types of calculations Friday, October 12, 12
  96. The Right Tool for the Job • JavaScript isn’t always

    the most ideal language for many types of calculations • Slow Friday, October 12, 12
  97. The Right Tool for the Job • JavaScript isn’t always

    the most ideal language for many types of calculations • Slow • Limited datatypes Friday, October 12, 12
  98. The Right Tool for the Job • JavaScript isn’t always

    the most ideal language for many types of calculations • Slow • Limited datatypes • No access to complex analytics libraries available on the JVM Friday, October 12, 12
  99. The Right Tool for the Job • JavaScript isn’t always

    the most ideal language for many types of calculations • Slow • Limited datatypes • No access to complex analytics libraries available on the JVM • Rich, powerful ecosystem of tools on the JVM + Hadoop Friday, October 12, 12
  100. The Right Tool for the Job • JavaScript isn’t always

    the most ideal language for many types of calculations • Slow • Limited datatypes • No access to complex analytics libraries available on the JVM • Rich, powerful ecosystem of tools on the JVM + Hadoop • Hadoop has machine learning, ETL, and many other tools which are much more flexible than the processing tools in MongoDB Friday, October 12, 12
  101. Being a Good Neighbor • Integration with Customers’ Existing Stacks

    & Toolchains is Crucial Friday, October 12, 12
  102. Being a Good Neighbor • Integration with Customers’ Existing Stacks

    & Toolchains is Crucial • Many users & customers already have Hadoop in their stacks Friday, October 12, 12
  103. Being a Good Neighbor • Integration with Customers’ Existing Stacks

    & Toolchains is Crucial • Many users & customers already have Hadoop in their stacks • They want us to “play nicely” with their existing toolchains Friday, October 12, 12
  104. Being a Good Neighbor • Integration with Customers’ Existing Stacks

    & Toolchains is Crucial • Many users & customers already have Hadoop in their stacks • They want us to “play nicely” with their existing toolchains • Different groups in companies may mandate all data be processable in Hadoop Friday, October 12, 12
  105. Introducing the MongoDB Hadoop Connector • Recently, we released v1.0.0

    of this Integration: The MongoDB Hadoop Connector Friday, October 12, 12
  106. Introducing the MongoDB Hadoop Connector • Recently, we released v1.0.0

    of this Integration: The MongoDB Hadoop Connector • Read/Write between MongoDB + Hadoop (Core MapReduce) in Java Friday, October 12, 12
  107. Introducing the MongoDB Hadoop Connector • Recently, we released v1.0.0

    of this Integration: The MongoDB Hadoop Connector • Read/Write between MongoDB + Hadoop (Core MapReduce) in Java • Write Pig (ETL) jobs’ output to MongoDB Friday, October 12, 12
  108. Introducing the MongoDB Hadoop Connector • Recently, we released v1.0.0

    of this Integration: The MongoDB Hadoop Connector • Read/Write between MongoDB + Hadoop (Core MapReduce) in Java • Write Pig (ETL) jobs’ output to MongoDB • Write MapReduce jobs in Python via Hadoop Streaming Friday, October 12, 12
  109. Introducing the MongoDB Hadoop Connector • Recently, we released v1.0.0

    of this Integration: The MongoDB Hadoop Connector • Read/Write between MongoDB + Hadoop (Core MapReduce) in Java • Write Pig (ETL) jobs’ output to MongoDB • Write MapReduce jobs in Python via Hadoop Streaming • Collect massive amounts of Logging output into MongoDB via Flume Friday, October 12, 12
  110. Hadoop Connector Capabilities • Split large datasets into smaller chunks

    (“Input Splits”) for parallel Hadoop processing Friday, October 12, 12
  111. Hadoop Connector Capabilities • Split large datasets into smaller chunks

    (“Input Splits”) for parallel Hadoop processing • Without splits, only one mapper can run Friday, October 12, 12
  112. Hadoop Connector Capabilities • Split large datasets into smaller chunks

    (“Input Splits”) for parallel Hadoop processing • Without splits, only one mapper can run • Connector can split both sharded & unsharded collections Friday, October 12, 12
  113. Hadoop Connector Capabilities • Split large datasets into smaller chunks

    (“Input Splits”) for parallel Hadoop processing • Without splits, only one mapper can run • Connector can split both sharded & unsharded collections • Sharded: Read individual chunks from config server into Hadoop Friday, October 12, 12
  114. Hadoop Connector Capabilities • Split large datasets into smaller chunks

    (“Input Splits”) for parallel Hadoop processing • Without splits, only one mapper can run • Connector can split both sharded & unsharded collections • Sharded: Read individual chunks from config server into Hadoop • Unsharded: Create splits, similar to how sharding chunks are calculated Friday, October 12, 12
  115. MapReducing MongoDB + Hadoop - Single Server Large Dataset (single

    mongod) Primary Key as “username” Each Hadoop node runs a processing task per core a b c d e f g h s t u v w x y z Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Friday, October 12, 12
  116. MapReducing MongoDB + Hadoop - Single Server Large Dataset (single

    mongod) Primary Key as “username” Each Hadoop node runs a processing task per core a b c d e f g h s t u v w x y z Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Friday, October 12, 12
  117. MapReducing MongoDB + Hadoop - Sharding Data Node 1 Data

    Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Friday, October 12, 12
  118. MapReducing MongoDB + Hadoop - Sharding Data Node 1 Data

    Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Friday, October 12, 12
  119. Parallel Processing of Splits • Ship “Splits” to Mappers as

    hostname, database, collection, & query Friday, October 12, 12
  120. Parallel Processing of Splits • Ship “Splits” to Mappers as

    hostname, database, collection, & query • Each Mapper reads the relevant documents in Friday, October 12, 12
  121. Parallel Processing of Splits • Ship “Splits” to Mappers as

    hostname, database, collection, & query • Each Mapper reads the relevant documents in • Parallel processing for high performance Friday, October 12, 12
  122. Parallel Processing of Splits • Ship “Splits” to Mappers as

    hostname, database, collection, & query • Each Mapper reads the relevant documents in • Parallel processing for high performance • Speaks BSON between all layers! Friday, October 12, 12
  123. Python Streaming •The Hadoop Streaming interface is much easier to

    demo (it’s also my favorite feature, and was the hardest to implement) Friday, October 12, 12
  124. Python Streaming •The Hadoop Streaming interface is much easier to

    demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python Friday, October 12, 12
  125. Python Streaming •The Hadoop Streaming interface is much easier to

    demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python • Java Hadoop + MongoDB integrates cleanly though for those inclined Friday, October 12, 12
  126. Python Streaming •The Hadoop Streaming interface is much easier to

    demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python • Java Hadoop + MongoDB integrates cleanly though for those inclined • Map functions get an initial key of type Object and value of type BSONObject Friday, October 12, 12
  127. Python Streaming •The Hadoop Streaming interface is much easier to

    demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python • Java Hadoop + MongoDB integrates cleanly though for those inclined • Map functions get an initial key of type Object and value of type BSONObject • Represent _id and the full document, respectively Friday, October 12, 12
  128. Python Streaming •The Hadoop Streaming interface is much easier to

    demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python • Java Hadoop + MongoDB integrates cleanly though for those inclined • Map functions get an initial key of type Object and value of type BSONObject • Represent _id and the full document, respectively •Processing 1.75 gigabytes of the Enron Email Corpus (501,513 emails) Friday, October 12, 12
  129. Python Streaming •The Hadoop Streaming interface is much easier to

    demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python • Java Hadoop + MongoDB integrates cleanly though for those inclined • Map functions get an initial key of type Object and value of type BSONObject • Represent _id and the full document, respectively •Processing 1.75 gigabytes of the Enron Email Corpus (501,513 emails) • I ran this test on a 6 node Hadoop cluster Friday, October 12, 12
  130. Python Streaming •The Hadoop Streaming interface is much easier to

    demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python • Java Hadoop + MongoDB integrates cleanly though for those inclined • Map functions get an initial key of type Object and value of type BSONObject • Represent _id and the full document, respectively •Processing 1.75 gigabytes of the Enron Email Corpus (501,513 emails) • I ran this test on a 6 node Hadoop cluster • Grab your own copy of this dataset at: http://goo.gl/fSleC Friday, October 12, 12
  131. A Sample Input Doc { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" :

    "Here is our forecast\n\n ", "subFolder" : "allen-p/_sent_mail", "mailbox" : "maildir", "filename" : "1.", "headers" : { "X-cc" : "", "From" : "[email protected]", "Subject" : "", "X-Folder" : "\\Phillip_Allen_Jan2002_1\\Allen, Phillip K.\\'Sent Mail", "Content-Transfer-Encoding" : "7bit", "X-bcc" : "", "To" : "[email protected]", "X-Origin" : "Allen-P", "X-FileName" : "pallen (Non-Privileged).pst", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Friday, October 12, 12
  132. Setting up Hadoop Streaming •Install the Python support module on

    each Hadoop Node: •Build (or download) the Streaming module for the Hadoop adapter: $ sudo pip install pymongo_hadoop $ git clone http://github.com/mongodb/mongo-hadoop.git $ ./sbt mongo-hadoop-streaming/assembly Friday, October 12, 12
  133. Mapper Code #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import

    BSONMapper def mapper(documents): i = 0 for doc in documents: i = i + 1 if 'headers' in doc and 'To' in doc['headers'] and 'From' in doc['headers']: from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Friday, October 12, 12
  134. Reducer Code #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import

    BSONReducer def reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer) Friday, October 12, 12
  135. Running the MapReduce hadoop jar mongo-hadoop-streaming-assembly-1.0.0.jar -mapper /home/ec2-user/enron_map.py -reducer /home/ec2-user/enron_reduce.py

    -inputURI mongodb://test_mongodb:27020/enron_mail.messages -outputURI mongodb://test_mongodb:27020/enron_mail.sender_map Friday, October 12, 12
  136. Results! mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" :

    "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 6 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } has more Friday, October 12, 12
  137. Parallelism is Good The Input Data was split into 44

    pieces for parallel processing... Friday, October 12, 12
  138. Parallelism is Good The Input Data was split into 44

    pieces for parallel processing... ... coincidentally, there were exactly 44 chunks on my sharded setup. Friday, October 12, 12
  139. Parallelism is Good The Input Data was split into 44

    pieces for parallel processing... ... coincidentally, there were exactly 44 chunks on my sharded setup. Friday, October 12, 12
  140. Parallelism is Good The Input Data was split into 44

    pieces for parallel processing... ... coincidentally, there were exactly 44 chunks on my sharded setup. Even with an unsharded collection, MongoHadoop can calculate splits! Friday, October 12, 12
  141. We aren’t restricted to Python •For Mongo-Hadoop 1.0, Streaming only

    shipped Python support Friday, October 12, 12
  142. We aren’t restricted to Python •For Mongo-Hadoop 1.0, Streaming only

    shipped Python support •Currently in git master, and due to be released with 1.1 is support for two additional languages Friday, October 12, 12
  143. We aren’t restricted to Python •For Mongo-Hadoop 1.0, Streaming only

    shipped Python support •Currently in git master, and due to be released with 1.1 is support for two additional languages • Ruby (Tyler Brock - @tylerbrock) Friday, October 12, 12
  144. We aren’t restricted to Python •For Mongo-Hadoop 1.0, Streaming only

    shipped Python support •Currently in git master, and due to be released with 1.1 is support for two additional languages • Ruby (Tyler Brock - @tylerbrock) • Node.JS (Mike O’Brien - @mpobrien) Friday, October 12, 12
  145. We aren’t restricted to Python •For Mongo-Hadoop 1.0, Streaming only

    shipped Python support •Currently in git master, and due to be released with 1.1 is support for two additional languages • Ruby (Tyler Brock - @tylerbrock) • Node.JS (Mike O’Brien - @mpobrien) •The same Enron MapReduce job can be accomplished with either of these languages as well Friday, October 12, 12
  146. Ruby + Mongo-Hadoop Streaming •As there isn’t an official release

    for Ruby support yet, you’ll need to build the gem by hand out of git Friday, October 12, 12
  147. Ruby + Mongo-Hadoop Streaming •As there isn’t an official release

    for Ruby support yet, you’ll need to build the gem by hand out of git •Like with Python, make sure you install this gem on each of your Hadoop nodes Friday, October 12, 12
  148. Ruby + Mongo-Hadoop Streaming •As there isn’t an official release

    for Ruby support yet, you’ll need to build the gem by hand out of git •Like with Python, make sure you install this gem on each of your Hadoop nodes •Once the gem is built & installed, you’ll have access to the mongo-hadoop module from Ruby Friday, October 12, 12
  149. Enron Map from Ruby #!/usr/bin/env ruby require 'mongo-hadoop' MongoHadoop.map do

    |document| if document.has_key?('headers') headers = document['headers'] if ['To', 'From'].all? { |header| headers.has_key? (header) } to_field = headers['To'] from_field = headers['From'] recipients = to_field.split(',').map { |recipient| recipient.strip } recipients.map { |recipient| {:_id => {:f => from_field, :t => recipient}, :count => 1} } end end end Friday, October 12, 12
  150. Enron Reduce from Ruby #!/usr/bin/env ruby require 'mongo-hadoop' MongoHadoop.reduce do

    |key, values| count = values.reduce { |sum, current| sum += current['count'] } { :_id => key, :count => count } end Friday, October 12, 12
  151. Running the Ruby MapReduce hadoop jar mongo-hadoop-streaming-assembly-1.0.0.jar -mapper examples/enron/enron_map.rb -reducer

    examples/enron/enron_reduce.rb -inputURI mongodb://127.0.0.1/enron_mail.messages -outputURI mongodb://127.0.0.1/enron_mail.output Friday, October 12, 12
  152. Node.JS + Mongo-Hadoop Streaming •As there isn’t an official release

    for Node.JS support yet, you’ll need to build the Node module by hand out of git Friday, October 12, 12
  153. Node.JS + Mongo-Hadoop Streaming •As there isn’t an official release

    for Node.JS support yet, you’ll need to build the Node module by hand out of git •Like with Python, make sure you install this package on each of your Hadoop nodes Friday, October 12, 12
  154. Node.JS + Mongo-Hadoop Streaming •As there isn’t an official release

    for Node.JS support yet, you’ll need to build the Node module by hand out of git •Like with Python, make sure you install this package on each of your Hadoop nodes •Once the gem is built & installed, you’ll have access to the node_mongo_hadoop module from Node.JS Friday, October 12, 12
  155. Enron Map from Node.JS #!/usr/bin/env node var node_mongo_hadoop = require('node_mongo_hadoop')

    var trimString = function(str){ return String(str).replace(/^\s+|\s+$/g, ''); } function mapFunc(doc, callback){ if(doc.headers && doc.headers.From && doc.headers.To){ var from_field = doc['headers']['From'] var to_field = doc['headers']['To'] var recips = [] to_field.split(',').forEach(function(to){ callback( {'_id': {'f':from_field, 't':trimString(to)}, 'count': 1} ) }); } } node_mongo_hadoop.MapBSONStream(mapFunc); Friday, October 12, 12
  156. Enron Reduce from Node.JS #!/usr/bin/env node var node_mongo_hadoop = require('node_mongo_hadoop')

    function reduceFunc(key, values, callback){ var count = 0; values.forEach(function(v){ count += v.count }); callback( {'_id':key, 'count':count } ); } node_mongo_hadoop.ReduceBSONStream(reduceFunc); Friday, October 12, 12
  157. Running the Node.JS MapReduce hadoop jar mongo-hadoop-streaming-assembly-1.0.0.jar -mapper examples/enron/enron_map.js -reducer

    examples/enron/enron_reduce.js -inputURI mongodb://127.0.0.1/enron_mail.messages -outputURI mongodb://127.0.0.1/enron_mail.output Friday, October 12, 12
  158. Hive + MongoDB • We are always seeking to add

    new functionality, and recently began integrating Hive with MongoDB+Hadoop Friday, October 12, 12
  159. Hive + MongoDB • Hive is a Hadoop based Data

    Warehousing system, providing a SQL-like language (dubbed “QL”) Friday, October 12, 12
  160. Hive + MongoDB • Hive is a Hadoop based Data

    Warehousing system, providing a SQL-like language (dubbed “QL”) • Designed for large datasets stored on HDFS Friday, October 12, 12
  161. Hive + MongoDB • Hive is a Hadoop based Data

    Warehousing system, providing a SQL-like language (dubbed “QL”) • Designed for large datasets stored on HDFS • Lots of SQL-like facilities such as data summarization, aggregation and analysis; all compile down to Hadoop MapReduce tasks Friday, October 12, 12
  162. Hive + MongoDB • Hive is a Hadoop based Data

    Warehousing system, providing a SQL-like language (dubbed “QL”) • Designed for large datasets stored on HDFS • Lots of SQL-like facilities such as data summarization, aggregation and analysis; all compile down to Hadoop MapReduce tasks • Custom User Defined functions can even replace inefficient Hive queries with raw MapReduce Friday, October 12, 12
  163. Hive + MongoDB • Hive is a Hadoop based Data

    Warehousing system, providing a SQL-like language (dubbed “QL”) • Designed for large datasets stored on HDFS • Lots of SQL-like facilities such as data summarization, aggregation and analysis; all compile down to Hadoop MapReduce tasks • Custom User Defined functions can even replace inefficient Hive queries with raw MapReduce • Many users have requested support for this with MongoDB Data Friday, October 12, 12
  164. Sticking BSON in the Hive • Step 1 involved teaching

    Hive to read MongoDB Backup files - essentially, raw BSON Friday, October 12, 12
  165. Sticking BSON in the Hive • Step 1 involved teaching

    Hive to read MongoDB Backup files - essentially, raw BSON • While there are some APIs that we can use to talk directly to MongoDB, we haven’t explored that yet Friday, October 12, 12
  166. Sticking BSON in the Hive • Step 1 involved teaching

    Hive to read MongoDB Backup files - essentially, raw BSON • While there are some APIs that we can use to talk directly to MongoDB, we haven’t explored that yet • With this code, it is possible to load a .bson file (typically produced by mongodump) directly into Hive and query it Friday, October 12, 12
  167. Sticking BSON in the Hive • Step 1 involved teaching

    Hive to read MongoDB Backup files - essentially, raw BSON • While there are some APIs that we can use to talk directly to MongoDB, we haven’t explored that yet • With this code, it is possible to load a .bson file (typically produced by mongodump) directly into Hive and query it • No conversion needed to a “native” Hive format - BSON is read directly Friday, October 12, 12
  168. Sticking BSON in the Hive • Step 1 involved teaching

    Hive to read MongoDB Backup files - essentially, raw BSON • While there are some APIs that we can use to talk directly to MongoDB, we haven’t explored that yet • With this code, it is possible to load a .bson file (typically produced by mongodump) directly into Hive and query it • No conversion needed to a “native” Hive format - BSON is read directly • Still needs some polish and tweaking, but this is now slated to be included in the upcoming 1.1 release Friday, October 12, 12
  169. Loading BSON into Hive • As Hive emulates a Relational

    Database, tables need schema ( evaluating ways to ‘infer’ schema to make this more automatic ) • Let’s load some MongoDB collections into Hive and play with the data! Friday, October 12, 12
  170. Loading BSON into Hive • We have BSON Files to

    load, now we need to instruct Hive about their Schemas ... Friday, October 12, 12
  171. Defining Hive Schemas • We’ve given some instructions to Hive

    about the structure as well as storage of our MongoDB files. Let’s look at “scores” close Friday, October 12, 12
  172. Defining Hive Schemas • We’ve given some instructions to Hive

    about the structure as well as storage of our MongoDB files. Let’s look at “scores” close CREATE TABLE scores ( student int, name string, score int ) ROW FORMAT SERDE "com.mongodb.hadoop.hive.BSONSerde" STORED AS INPUTFORMAT "com.mongodb.hadoop.hive.input.BSONFileInputFormat" OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION "/Users/brendan/code/mongodb/mongo-hadoop/hive/demo/meta/scores"; Friday, October 12, 12
  173. Defining Hive Schemas • We’ve given some instructions to Hive

    about the structure as well as storage of our MongoDB files. Let’s look at “scores” close CREATE TABLE scores ( student int, name string, score int ) ROW FORMAT SERDE "com.mongodb.hadoop.hive.BSONSerde" STORED AS INPUTFORMAT "com.mongodb.hadoop.hive.input.BSONFileInputFormat" OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION "/Users/brendan/code/mongodb/mongo-hadoop/hive/demo/meta/scores"; • Our first line defines the structure - with a column for ‘student’, ‘name’ and ‘score’, each having a SQL-like datatype. Friday, October 12, 12
  174. Defining Hive Schemas • We’ve given some instructions to Hive

    about the structure as well as storage of our MongoDB files. Let’s look at “scores” close CREATE TABLE scores ( student int, name string, score int ) ROW FORMAT SERDE "com.mongodb.hadoop.hive.BSONSerde" STORED AS INPUTFORMAT "com.mongodb.hadoop.hive.input.BSONFileInputFormat" OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION "/Users/brendan/code/mongodb/mongo-hadoop/hive/demo/meta/scores"; • Our first line defines the structure - with a column for ‘student’, ‘name’ and ‘score’, each having a SQL-like datatype. • ROW FORMAT SERDE instructs Hive to use a Serde of ‘BSONSerde’ Friday, October 12, 12
  175. Defining Hive Schemas • We’ve given some instructions to Hive

    about the structure as well as storage of our MongoDB files. Let’s look at “scores” close CREATE TABLE scores ( student int, name string, score int ) ROW FORMAT SERDE "com.mongodb.hadoop.hive.BSONSerde" STORED AS INPUTFORMAT "com.mongodb.hadoop.hive.input.BSONFileInputFormat" OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION "/Users/brendan/code/mongodb/mongo-hadoop/hive/demo/meta/scores"; • Our first line defines the structure - with a column for ‘student’, ‘name’ and ‘score’, each having a SQL-like datatype. • ROW FORMAT SERDE instructs Hive to use a Serde of ‘BSONSerde’ • In Hive, a Serde is a special codec that explains how to read and write (serialize and deserialize) a custom data format containing Hive rows Friday, October 12, 12
  176. Defining Hive Schemas • We’ve given some instructions to Hive

    about the structure as well as storage of our MongoDB files. Let’s look at “scores” close CREATE TABLE scores ( student int, name string, score int ) ROW FORMAT SERDE "com.mongodb.hadoop.hive.BSONSerde" STORED AS INPUTFORMAT "com.mongodb.hadoop.hive.input.BSONFileInputFormat" OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION "/Users/brendan/code/mongodb/mongo-hadoop/hive/demo/meta/scores"; • Our first line defines the structure - with a column for ‘student’, ‘name’ and ‘score’, each having a SQL-like datatype. • ROW FORMAT SERDE instructs Hive to use a Serde of ‘BSONSerde’ • In Hive, a Serde is a special codec that explains how to read and write (serialize and deserialize) a custom data format containing Hive rows • We also need to tell Hive to use an INPUTFORMAT of ‘BSONFileInputFormat’ which tells it how to Read BSON files off of disk into individual rows (the Serde is instructions for how to turn individual lines of BSON into a Hive friendly format) Friday, October 12, 12
  177. Defining Hive Schemas • We’ve given some instructions to Hive

    about the structure as well as storage of our MongoDB files. Let’s look at “scores” close CREATE TABLE scores ( student int, name string, score int ) ROW FORMAT SERDE "com.mongodb.hadoop.hive.BSONSerde" STORED AS INPUTFORMAT "com.mongodb.hadoop.hive.input.BSONFileInputFormat" OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION "/Users/brendan/code/mongodb/mongo-hadoop/hive/demo/meta/scores"; • Our first line defines the structure - with a column for ‘student’, ‘name’ and ‘score’, each having a SQL-like datatype. • ROW FORMAT SERDE instructs Hive to use a Serde of ‘BSONSerde’ • In Hive, a Serde is a special codec that explains how to read and write (serialize and deserialize) a custom data format containing Hive rows • We also need to tell Hive to use an INPUTFORMAT of ‘BSONFileInputFormat’ which tells it how to Read BSON files off of disk into individual rows (the Serde is instructions for how to turn individual lines of BSON into a Hive friendly format) • Finally, we specify where Hive should store the metadata, etc with LOCATION Friday, October 12, 12
  178. Loading Data to Hive •Finally, we need to load data

    into the Hive table, from our raw BSON file • Now we can query! hive> LOAD DATA LOCAL INPATH "dump/training/scores.bson" INTO TABLE scores; Friday, October 12, 12
  179. Querying Hive • Most standard SQL-like queries work - though

    I’m not going to enumerate the ins and outs of HiveQL today Friday, October 12, 12
  180. Querying Hive • Most standard SQL-like queries work - though

    I’m not going to enumerate the ins and outs of HiveQL today • What we can do with Hive that we can’t with MongoDB ... is joins Friday, October 12, 12
  181. Querying Hive • Most standard SQL-like queries work - though

    I’m not going to enumerate the ins and outs of HiveQL today • What we can do with Hive that we can’t with MongoDB ... is joins • In addition to the scores data, I also created a collection of student ids and randomly generated names. Let’s look at joining these to our scores in Hive Friday, October 12, 12
  182. Joins from BSON + Hive hive> SELECT u.firstName, u.lastName, u.sex,

    s.name, s.score FROM > scores s JOIN students u ON u.studentID = s.student > ORDER BY s.score DESC; DELPHIA DOUIN Female exam 99 DOMINIQUE SUAZO Male essay 99 ETTIE BETZIG Female exam 99 ADOLFO PIRONE Male exam 99 IVORY NETTERS Male essay 99 RAFAEL HURLES Male essay 99 KRISTEN VALLERO Female exam 99 CONNIE KNAPPER Female quiz 99 JEANNA DIVELY Female exam 99 TRISTAN SEGAL Male exam 99 WILTON TRULOVE Male essay 99 THAO OTSMAN Female essay 99 CLARENCE STITZ Male quiz 99 LUIS GUAMAN Male exam 99 WILLARD RUSSAK Male quiz 99 MARCOS HOELLER Male quiz 99 TED BOTTCHER Male essay 99 LAKEISHA NAGAMINE Female essay 99 ALLEN HITT Male exam 99 MADELINE DAWKINS Female essay 99 Friday, October 12, 12
  183. Looking Forward • Mongo Hadoop Connector 1.0.0 is released and

    available • Docs: http://api.mongodb.org/hadoop/ • Downloads & Code: http://github.com/mongodb/ mongo-hadoop Friday, October 12, 12
  184. Looking Forward •Lots More Coming; 1.1.0 in development • Support

    for reading from Multiple Input Collections (“MultiMongo”) • Static BSON Support... Read from and Write to Mongo Backup files! • S3 / HDFS stored, mongodump format • Great for big offline batch jobs (this is how Foursquare does it) • Pig input (Read from MongoDB into Pig) • Performance improvements (e.g. pipelining BSON for streaming) • Future: Expanded Ecosystem support (Cascading, Oozie, Mahout, etc) Friday, October 12, 12
  185. Looking Forward • We are looking to grow our integration

    with Big Data • Not only Hadoop, but other data processing systems our users want such as Storm, Disco and Spark. • Initial Disco support (Nokia’s Python MapReduce framework) is almost complete • If you have other data processing toolchains you’d like to see integration with, let us know! Friday, October 12, 12