Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JRuby and Big Data

JRuby and Big Data

This is an overview talk discussion the basics of what people think the term "Big Data" is, and when you have a Big Data problem, some of the tools from the Hadoop ecosystem that may help you out.

Jeremy Hinegardner

September 30, 2011
Tweet

More Decks by Jeremy Hinegardner

Other Decks in Technology

Transcript

  1. Officially ... “Big Data is a term applied to data

    sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.” -- Wikipedia (Big_Data) Friday, September 30, 11
  2. Officially... “..increasing volume (amount of data), velocity (speed of data

    in/ out), and variety (range of data types, sources).” -- Gartner, Inc Friday, September 30, 11
  3. Uh, that data processing job. “You need the number of

    WHAT by tomorrow?!? &#@!” Unofficially... Friday, September 30, 11
  4. Uh, that data processing job. The one that takes all

    day, to run “You need the number of WHAT by tomorrow?!? &#@!” Unofficially... Friday, September 30, 11
  5. Uh, that data processing job. The one that takes all

    day, to run Yeah, we need do that over. “You need the number of WHAT by tomorrow?!? &#@!” Unofficially... Friday, September 30, 11
  6. Uh, that data processing job. The one that takes all

    day, to run Yeah, we need do that over. Its wrong “You need the number of WHAT by tomorrow?!? &#@!” Unofficially... Friday, September 30, 11
  7. Uh, that data processing job. The one that takes all

    day, to run Yeah, we need do that over. Its wrong And we’ll need to run that every day, “You need the number of WHAT by tomorrow?!? &#@!” Unofficially... Friday, September 30, 11
  8. Uh, that data processing job. The one that takes all

    day, to run Yeah, we need do that over. Its wrong And we’ll need to run that every day, and it needs to be done by 8 a.m. “You need the number of WHAT by tomorrow?!? &#@!” Unofficially... Friday, September 30, 11
  9. Uh, that data processing job. The one that takes all

    day, to run Yeah, we need do that over. Its wrong And we’ll need to run that every day, and it needs to be done by 8 a.m. On the previous day’s data. “You need the number of WHAT by tomorrow?!? &#@!” Unofficially... Friday, September 30, 11
  10. 2,500 bytes / tweet 155,000,000 Tweets / day 16,145,833,333 bytes

    / hour http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11
  11. 2,500 bytes / tweet 155,000,000 Tweets / day 16,145,833,333 bytes

    / hour 4,484,953 bytes / second http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11
  12. 2,500 bytes / tweet 155,000,000 Tweets / day 16,145,833,333 bytes

    / hour 4,484,953 bytes / second 4.2 Megabytes / second http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11
  13. 2,500 bytes / tweet 155,000,000 Tweets / day 16,145,833,333 bytes

    / hour 4,484,953 bytes / second 4.2 Megabytes / second or http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11
  14. 2,500 bytes / tweet 155,000,000 Tweets / day 16,145,833,333 bytes

    / hour 4,484,953 bytes / second 4.2 Megabytes / second or 33.8 Megabits / second http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11
  15. 2,500 bytes / tweet 155,000,000 Tweets / day 16,145,833,333 bytes

    / hour 4,484,953 bytes / second 4.2 Megabytes / second or 33.8 Megabits / second or http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11
  16. 2,500 bytes / tweet 155,000,000 Tweets / day 16,145,833,333 bytes

    / hour 4,484,953 bytes / second 4.2 Megabytes / second or 33.8 Megabits / second or Majority of an OC-1 SONET line http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11
  17. Sampling Works Given a Population of 155,000,000 Things How many

    Things do you NEED to look at to make an analysis about the Population. http://www.custominsight.com/articles/random- sample-calculator.asp Friday, September 30, 11
  18. Sampling Works Given a Population of 155,000,000 Things How many

    Things do you NEED to look at to make an analysis about the Population. http://www.custominsight.com/articles/random- sample-calculator.asp 1% error tolerance 99% confidence Friday, September 30, 11
  19. Sampling Works Given a Population of 155,000,000 Things How many

    Things do you NEED to look at to make an analysis about the Population. http://www.custominsight.com/articles/random- sample-calculator.asp 1% error tolerance 99% confidence 16,588 Friday, September 30, 11
  20. I do need to process large volumes of data in

    a timely manner Friday, September 30, 11
  21. HDFS 2 3 4 5 6 7 8 9 Data

    Nodes 1 1 1 Friday, September 30, 11
  22. HDFS 2 3 4 5 6 7 8 9 Data

    Nodes 1 1 1 Friday, September 30, 11
  23. HDFS Data Nodes 2 2 2 3 4 5 6

    7 8 9 3 3 4 4 5 5 6 6 7 7 8 8 9 9 1 1 1 Friday, September 30, 11
  24. HDFS & JRUBY? Not Really Maybe if you want to

    write files directly to HDFS Friday, September 30, 11
  25. MapInput = [ [x0,y0], ... , [ xi, yi ]]

    MapInput.each do |x,y| a, b = map(x, y) MapResult << [ a, b ] end ReduceInput = MapResult.group_by { |mr| m[0] } Final =ReduceInput.collect { |g, list|reduce(g,list) } Map/Reduce Friday, September 30, 11
  26. MapInput = [ [x0,y0], ... , [ xi, yi ]]

    MapInput.each do |x,y| a, b = map(x, y) MapResult << [ a, b ] end ReduceInput = MapResult.group_by { |mr| m[0] } Final =ReduceInput.collect { |g, list|reduce(g,list) } Map/Reduce Embarrassingly Parallel Problems Friday, September 30, 11
  27. MapInput = [ [x0,y0], ... , [ xi, yi ]]

    MapInput.each do |x,y| a, b = map(x, y) MapResult << [ a, b ] end ReduceInput = MapResult.group_by { |mr| m[0] } Final =ReduceInput.collect { |g, list|reduce(g,list) } Map/Reduce Embarrassingly Parallel Problems Friday, September 30, 11
  28. MapInput = [ [x0,y0], ... , [ xi, yi ]]

    MapInput.each do |x,y| a, b = map(x, y) MapResult << [ a, b ] end ReduceInput = MapResult.group_by { |mr| m[0] } Final =ReduceInput.collect { |g, list|reduce(g,list) } Map/Reduce Embarrassingly Parallel Problems Friday, September 30, 11
  29. Map/Reduce 3 2 2 2 3 4 5 7 8

    9 3 4 5 5 6 6 7 7 8 8 9 9 1 1 4 1 Data/ Task Nodes Job Tracker You Friday, September 30, 11
  30. Map/Reduce 3 2 2 2 3 4 5 7 8

    9 3 4 5 5 6 6 7 7 8 8 9 9 1 1 4 1 Data/ Task Nodes Job Tracker You Submit Job for that file we loaded Friday, September 30, 11
  31. Map/Reduce 3 2 2 2 3 4 5 7 8

    9 3 4 5 5 6 6 7 7 8 8 9 9 1 1 4 1 Data/ Task Nodes Job Tracker You Submit Job for that file we loaded Friday, September 30, 11
  32. Map/Red & JRUBY? “It's complicated, you know. Lots of ins

    and outs, lots of what have yous” -- The Dude Friday, September 30, 11
  33. Map/Red & JRUBY? Job Submission/Running details 1. Build job jar

    file 2. Submit the jar file to the Job Tracker 3. Job Tracker gives the jar to each Task Tracker 4. Finding the Mapper and Reducer classes is a runtime lookup starting from the Java side. Friday, September 30, 11
  34. Map/Red & JRUBY? Job Submission/Running details 1. Build job jar

    file 2. Submit the jar file to the Job Tracker 3. Job Tracker gives the jar to each Task Tracker 4. Finding the Mapper and Reducer classes is a runtime lookup starting from the Java side. packaging Friday, September 30, 11
  35. Map/Red & JRUBY? Job Submission/Running details 1. Build job jar

    file 2. Submit the jar file to the Job Tracker 3. Job Tracker gives the jar to each Task Tracker 4. Finding the Mapper and Reducer classes is a runtime lookup starting from the Java side. packaging runtime Friday, September 30, 11
  36. Map/Red & JRUBY? https://github.com/banshee/radoop Use ‘radoop’ commandline instead of the

    ‘hadoop’ Inherit map/reduce classes from Java shims Friday, September 30, 11
  37. Map/Red & JRUBY? https://github.com/banshee/radoop Use ‘radoop’ commandline instead of the

    ‘hadoop’ Inherit map/reduce classes from Java shims packaging runtim e Friday, September 30, 11
  38. Map/Red & JRUBY? Unfortunately Last Updated Sep 2008 https://github.com/banshee/radoop Use

    ‘radoop’ commandline instead of the ‘hadoop’ Inherit map/reduce classes from Java shims packaging runtim e Friday, September 30, 11
  39. Map/Red & JRUBY? https://github.com/fujibee/jruby-on-hadoop Use ‘joh’ commandline instead of the

    ‘hadoop’ define map/reduce methods, which end up being supported by Java shims Friday, September 30, 11
  40. Map/Red & JRUBY? https://github.com/fujibee/jruby-on-hadoop Use ‘joh’ commandline instead of the

    ‘hadoop’ define map/reduce methods, which end up being supported by Java shims packaging runtim e Friday, September 30, 11
  41. Map/Red & JRUBY? https://github.com/fujibee/jruby-on-hadoop Unfortunately Last Updated May 2010 Use

    ‘joh’ commandline instead of the ‘hadoop’ define map/reduce methods, which end up being supported by Java shims packaging runtim e Friday, September 30, 11
  42. Map/Red & JRUBY? https://github.com/fujibee/hadoop-papyrus Use ‘papyrus’ commandline instead of the

    ‘hadoop’ use DSL, which is implemented on top of jruby- hadoop Friday, September 30, 11
  43. Map/Red & JRUBY? https://github.com/fujibee/hadoop-papyrus Use ‘papyrus’ commandline instead of the

    ‘hadoop’ use DSL, which is implemented on top of jruby- hadoop packaging runtim e Friday, September 30, 11
  44. Map/Red & JRUBY? https://github.com/fujibee/hadoop-papyrus Unfortunately Last Updated May 2010 Use

    ‘papyrus’ commandline instead of the ‘hadoop’ use DSL, which is implemented on top of jruby- hadoop packaging runtim e Friday, September 30, 11
  45. Map/Red & JRUBY? What I would like? Use the normal

    ‘hadoop’ commandline Inherit from the Java classes Friday, September 30, 11
  46. Map/Red & JRUBY? What I would like? Use the normal

    ‘hadoop’ commandline Inherit from the Java classes packaging runtim e Friday, September 30, 11
  47. Map/Red & JRUBY? What I would like? Use the normal

    ‘hadoop’ commandline Inherit from the Java classes packaging runtim e Unfortunately this is not written... Yet. Friday, September 30, 11
  48. Avro Rich Data Structure, Think ‘Document’ Compact, Fast, Binary Data

    Format RPC/Protocol Buffer/Thrift-like ability Friday, September 30, 11
  49. Avro Rich Data Structure, Think ‘Document’ Compact, Fast, Binary Data

    Format Container File Structure RPC/Protocol Buffer/Thrift-like ability Friday, September 30, 11
  50. Avro Rich Data Structure, Think ‘Document’ Compact, Fast, Binary Data

    Format Container File Structure RPC/Protocol Buffer/Thrift-like ability No Code Generation Friday, September 30, 11
  51. Avro Rich Data Structure, Think ‘Document’ Compact, Fast, Binary Data

    Format Container File Structure RPC/Protocol Buffer/Thrift-like ability No Code Generation Record Structure Defined via JSON Schema Friday, September 30, 11
  52. Avro Rich Data Structure, Think ‘Document’ Compact, Fast, Binary Data

    Format Container File Structure RPC/Protocol Buffer/Thrift-like ability No Code Generation Record Structure Defined via JSON Schema Map/Reduce Friendly Friday, September 30, 11
  53. Avro Rich Data Structure, Think ‘Document’ Compact, Fast, Binary Data

    Format Container File Structure RPC/Protocol Buffer/Thrift-like ability No Code Generation Record Structure Defined via JSON Schema Map/Reduce Friendly Compression Friday, September 30, 11
  54. Avro Rich Data Structure, Think ‘Document’ Compact, Fast, Binary Data

    Format Container File Structure RPC/Protocol Buffer/Thrift-like ability No Code Generation Record Structure Defined via JSON Schema Map/Reduce Friendly Compression Language Neutral Friday, September 30, 11
  55. Avro Container File Structure 2 3 4 5 6 7

    8 9 1 Friday, September 30, 11
  56. Node 1 Node 2 Avro 1 2 Task 1 Task

    2 Friday, September 30, 11
  57. Avro 5500 Records 12MB Raw JSON 2.0MB .tgz 3.6MB Avro

    File (no compression) 1.9MB (snappy compression) Friday, September 30, 11
  58. Zookeeper Highly Available Quorum of Servers “coordination services” distributed locks

    group membership registration providing Friday, September 30, 11
  59. Zookeeper Highly Available Quorum of Servers “coordination services” distributed locks

    group membership registration providing sequences Friday, September 30, 11
  60. Zookeeper Highly Available Quorum of Servers “coordination services” distributed locks

    group membership registration providing sequences watches Friday, September 30, 11
  61. HBASE Builds upon most of what we have just seen.

    HDFS Friendly Files Stored on HDFS Processes coordinate via Zookeeper Friday, September 30, 11
  62. HBASE & JRUBY Best spot for JRuby in the Hadoop

    Ecosystem Excellent! Friday, September 30, 11
  63. HBASE & JRUBY Best spot for JRuby in the Hadoop

    Ecosystem Excellent! On the Top of the Heap Friday, September 30, 11
  64. HBASE & JRUBY Best spot for JRuby in the Hadoop

    Ecosystem HBase Shell IS irb Excellent! On the Top of the Heap Friday, September 30, 11
  65. HBASE & JRUBY HBase has Thrift, Protocol Buffers and Avro

    RPC interfaces Friday, September 30, 11
  66. HBASE & JRUBY HBase has Thrift, Protocol Buffers and Avro

    RPC interfaces https://github.com/bmuller/hbaserb Friday, September 30, 11
  67. HBASE & JRUBY HBase has Thrift, Protocol Buffers and Avro

    RPC interfaces https://github.com/bmuller/hbaserb https://github.com/copiousfreetime/ashbe Friday, September 30, 11