Slide 1

Slide 1 text

“Big Data” and JRuby Jeremy Hinegardner 2011-09-29 RubyConf 2011 Friday, September 30, 11

Slide 2

Slide 2 text

What is BigData? Friday, September 30, 11

Slide 3

Slide 3 text

Officially ... “Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.” -- Wikipedia (Big_Data) Friday, September 30, 11

Slide 4

Slide 4 text

Officially... “..increasing volume (amount of data), velocity (speed of data in/ out), and variety (range of data types, sources).” -- Gartner, Inc Friday, September 30, 11

Slide 5

Slide 5 text

Unofficially... Friday, September 30, 11

Slide 6

Slide 6 text

“You need the number of WHAT by tomorrow?!? &#@!” Unofficially... Friday, September 30, 11

Slide 7

Slide 7 text

“You need the number of WHAT by tomorrow?!? &#@!” Unofficially... Friday, September 30, 11

Slide 8

Slide 8 text

Uh, that data processing job. “You need the number of WHAT by tomorrow?!? &#@!” Unofficially... Friday, September 30, 11

Slide 9

Slide 9 text

Uh, that data processing job. The one that takes all day, to run “You need the number of WHAT by tomorrow?!? &#@!” Unofficially... Friday, September 30, 11

Slide 10

Slide 10 text

Uh, that data processing job. The one that takes all day, to run Yeah, we need do that over. “You need the number of WHAT by tomorrow?!? &#@!” Unofficially... Friday, September 30, 11

Slide 11

Slide 11 text

Uh, that data processing job. The one that takes all day, to run Yeah, we need do that over. Its wrong “You need the number of WHAT by tomorrow?!? &#@!” Unofficially... Friday, September 30, 11

Slide 12

Slide 12 text

Uh, that data processing job. The one that takes all day, to run Yeah, we need do that over. Its wrong And we’ll need to run that every day, “You need the number of WHAT by tomorrow?!? &#@!” Unofficially... Friday, September 30, 11

Slide 13

Slide 13 text

Uh, that data processing job. The one that takes all day, to run Yeah, we need do that over. Its wrong And we’ll need to run that every day, and it needs to be done by 8 a.m. “You need the number of WHAT by tomorrow?!? &#@!” Unofficially... Friday, September 30, 11

Slide 14

Slide 14 text

Uh, that data processing job. The one that takes all day, to run Yeah, we need do that over. Its wrong And we’ll need to run that every day, and it needs to be done by 8 a.m. On the previous day’s data. “You need the number of WHAT by tomorrow?!? &#@!” Unofficially... Friday, September 30, 11

Slide 15

Slide 15 text

Friday, September 30, 11

Slide 16

Slide 16 text

That’s a Lot Of DAta... Friday, September 30, 11

Slide 17

Slide 17 text

Friday, September 30, 11

Slide 18

Slide 18 text

...headed this way FAST.... Friday, September 30, 11

Slide 19

Slide 19 text

Friday, September 30, 11

Slide 20

Slide 20 text

2,500 bytes / tweet http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

Slide 21

Slide 21 text

2,500 bytes / tweet 155,000,000 Tweets / day http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

Slide 22

Slide 22 text

2,500 bytes / tweet 155,000,000 Tweets / day 16,145,833,333 bytes / hour http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

Slide 23

Slide 23 text

2,500 bytes / tweet 155,000,000 Tweets / day 16,145,833,333 bytes / hour 4,484,953 bytes / second http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

Slide 24

Slide 24 text

2,500 bytes / tweet 155,000,000 Tweets / day 16,145,833,333 bytes / hour 4,484,953 bytes / second 4.2 Megabytes / second http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

Slide 25

Slide 25 text

2,500 bytes / tweet 155,000,000 Tweets / day 16,145,833,333 bytes / hour 4,484,953 bytes / second 4.2 Megabytes / second or http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

Slide 26

Slide 26 text

2,500 bytes / tweet 155,000,000 Tweets / day 16,145,833,333 bytes / hour 4,484,953 bytes / second 4.2 Megabytes / second or 33.8 Megabits / second http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

Slide 27

Slide 27 text

2,500 bytes / tweet 155,000,000 Tweets / day 16,145,833,333 bytes / hour 4,484,953 bytes / second 4.2 Megabytes / second or 33.8 Megabits / second or http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

Slide 28

Slide 28 text

2,500 bytes / tweet 155,000,000 Tweets / day 16,145,833,333 bytes / hour 4,484,953 bytes / second 4.2 Megabytes / second or 33.8 Megabits / second or Majority of an OC-1 SONET line http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

Slide 29

Slide 29 text

...AND I NEED TO DO WHAT WITH IT? Friday, September 30, 11

Slide 30

Slide 30 text

...And KEEP it available for how long? Friday, September 30, 11

Slide 31

Slide 31 text

That makes me feel Uncomfortable Friday, September 30, 11

Slide 32

Slide 32 text

That makes me feel Uncomfortable Friday, September 30, 11

Slide 33

Slide 33 text

How fast and how often can you boil the ocean. Friday, September 30, 11

Slide 34

Slide 34 text

Basic Instructions - Scott Meyer Friday, September 30, 11

Slide 35

Slide 35 text

you do not need “Big Data” to make good decisions Friday, September 30, 11

Slide 36

Slide 36 text

WHY do you think you need “Big Data” Friday, September 30, 11

Slide 37

Slide 37 text

http://evilmartini.com/post/7946263965/prob-w-bigdata Friday, September 30, 11

Slide 38

Slide 38 text

Sampling Works Given a Population of 155,000,000 Things http://www.custominsight.com/articles/random- sample-calculator.asp Friday, September 30, 11

Slide 39

Slide 39 text

Sampling Works Given a Population of 155,000,000 Things How many Things do you NEED to look at to make an analysis about the Population. http://www.custominsight.com/articles/random- sample-calculator.asp Friday, September 30, 11

Slide 40

Slide 40 text

Sampling Works Given a Population of 155,000,000 Things How many Things do you NEED to look at to make an analysis about the Population. http://www.custominsight.com/articles/random- sample-calculator.asp 1% error tolerance 99% confidence Friday, September 30, 11

Slide 41

Slide 41 text

Sampling Works Given a Population of 155,000,000 Things How many Things do you NEED to look at to make an analysis about the Population. http://www.custominsight.com/articles/random- sample-calculator.asp 1% error tolerance 99% confidence 16,588 Friday, September 30, 11

Slide 42

Slide 42 text

Understand your Problem Domain Friday, September 30, 11

Slide 43

Slide 43 text

I do need to process large volumes of data in a timely manner Friday, September 30, 11

Slide 44

Slide 44 text

Big Happy Hadoop Family Friday, September 30, 11

Slide 45

Slide 45 text

Where to Store lots of data Friday, September 30, 11

Slide 46

Slide 46 text

HDFS 2 3 4 5 6 7 8 9 Data Nodes 1 1 1 Friday, September 30, 11

Slide 47

Slide 47 text

HDFS 2 3 4 5 6 7 8 9 Data Nodes 1 1 1 Friday, September 30, 11

Slide 48

Slide 48 text

HDFS Data Nodes 2 2 2 3 4 5 6 7 8 9 3 3 4 4 5 5 6 6 7 7 8 8 9 9 1 1 1 Friday, September 30, 11

Slide 49

Slide 49 text

HDFS & JRUBY? Not Really Maybe if you want to write files directly to HDFS Friday, September 30, 11

Slide 50

Slide 50 text

How to process Lots of data Friday, September 30, 11

Slide 51

Slide 51 text

MapInput = [ [x0,y0], ... , [ xi, yi ]] MapInput.each do |x,y| a, b = map(x, y) MapResult << [ a, b ] end ReduceInput = MapResult.group_by { |mr| m[0] } Final =ReduceInput.collect { |g, list|reduce(g,list) } Map/Reduce Friday, September 30, 11

Slide 52

Slide 52 text

MapInput = [ [x0,y0], ... , [ xi, yi ]] MapInput.each do |x,y| a, b = map(x, y) MapResult << [ a, b ] end ReduceInput = MapResult.group_by { |mr| m[0] } Final =ReduceInput.collect { |g, list|reduce(g,list) } Map/Reduce Embarrassingly Parallel Problems Friday, September 30, 11

Slide 53

Slide 53 text

MapInput = [ [x0,y0], ... , [ xi, yi ]] MapInput.each do |x,y| a, b = map(x, y) MapResult << [ a, b ] end ReduceInput = MapResult.group_by { |mr| m[0] } Final =ReduceInput.collect { |g, list|reduce(g,list) } Map/Reduce Embarrassingly Parallel Problems Friday, September 30, 11

Slide 54

Slide 54 text

MapInput = [ [x0,y0], ... , [ xi, yi ]] MapInput.each do |x,y| a, b = map(x, y) MapResult << [ a, b ] end ReduceInput = MapResult.group_by { |mr| m[0] } Final =ReduceInput.collect { |g, list|reduce(g,list) } Map/Reduce Embarrassingly Parallel Problems Friday, September 30, 11

Slide 55

Slide 55 text

Map/Reduce 3 2 2 2 3 4 5 7 8 9 3 4 5 5 6 6 7 7 8 8 9 9 1 1 4 1 Data/ Task Nodes Job Tracker You Friday, September 30, 11

Slide 56

Slide 56 text

Map/Reduce 3 2 2 2 3 4 5 7 8 9 3 4 5 5 6 6 7 7 8 8 9 9 1 1 4 1 Data/ Task Nodes Job Tracker You Submit Job for that file we loaded Friday, September 30, 11

Slide 57

Slide 57 text

Map/Reduce 3 2 2 2 3 4 5 7 8 9 3 4 5 5 6 6 7 7 8 8 9 9 1 1 4 1 Data/ Task Nodes Job Tracker You Submit Job for that file we loaded Friday, September 30, 11

Slide 58

Slide 58 text

Map/Red & JRUBY? Friday, September 30, 11

Slide 59

Slide 59 text

Map/Red & JRUBY? “It's complicated, you know. Lots of ins and outs, lots of what have yous” -- The Dude Friday, September 30, 11

Slide 60

Slide 60 text

Map/Red & JRUBY? Job Submission/Running details 1. Build job jar file 2. Submit the jar file to the Job Tracker 3. Job Tracker gives the jar to each Task Tracker 4. Finding the Mapper and Reducer classes is a runtime lookup starting from the Java side. Friday, September 30, 11

Slide 61

Slide 61 text

Map/Red & JRUBY? Job Submission/Running details 1. Build job jar file 2. Submit the jar file to the Job Tracker 3. Job Tracker gives the jar to each Task Tracker 4. Finding the Mapper and Reducer classes is a runtime lookup starting from the Java side. packaging Friday, September 30, 11

Slide 62

Slide 62 text

Map/Red & JRUBY? Job Submission/Running details 1. Build job jar file 2. Submit the jar file to the Job Tracker 3. Job Tracker gives the jar to each Task Tracker 4. Finding the Mapper and Reducer classes is a runtime lookup starting from the Java side. packaging runtime Friday, September 30, 11

Slide 63

Slide 63 text

Map/Red & JRUBY? https://github.com/banshee/radoop Use ‘radoop’ commandline instead of the ‘hadoop’ Inherit map/reduce classes from Java shims Friday, September 30, 11

Slide 64

Slide 64 text

Map/Red & JRUBY? https://github.com/banshee/radoop Use ‘radoop’ commandline instead of the ‘hadoop’ Inherit map/reduce classes from Java shims packaging runtim e Friday, September 30, 11

Slide 65

Slide 65 text

Map/Red & JRUBY? Unfortunately Last Updated Sep 2008 https://github.com/banshee/radoop Use ‘radoop’ commandline instead of the ‘hadoop’ Inherit map/reduce classes from Java shims packaging runtim e Friday, September 30, 11

Slide 66

Slide 66 text

Map/Red & JRUBY? https://github.com/fujibee/jruby-on-hadoop Use ‘joh’ commandline instead of the ‘hadoop’ define map/reduce methods, which end up being supported by Java shims Friday, September 30, 11

Slide 67

Slide 67 text

Map/Red & JRUBY? https://github.com/fujibee/jruby-on-hadoop Use ‘joh’ commandline instead of the ‘hadoop’ define map/reduce methods, which end up being supported by Java shims packaging runtim e Friday, September 30, 11

Slide 68

Slide 68 text

Map/Red & JRUBY? https://github.com/fujibee/jruby-on-hadoop Unfortunately Last Updated May 2010 Use ‘joh’ commandline instead of the ‘hadoop’ define map/reduce methods, which end up being supported by Java shims packaging runtim e Friday, September 30, 11

Slide 69

Slide 69 text

Map/Red & JRUBY? https://github.com/fujibee/hadoop-papyrus Use ‘papyrus’ commandline instead of the ‘hadoop’ use DSL, which is implemented on top of jruby- hadoop Friday, September 30, 11

Slide 70

Slide 70 text

Map/Red & JRUBY? https://github.com/fujibee/hadoop-papyrus Use ‘papyrus’ commandline instead of the ‘hadoop’ use DSL, which is implemented on top of jruby- hadoop packaging runtim e Friday, September 30, 11

Slide 71

Slide 71 text

Map/Red & JRUBY? https://github.com/fujibee/hadoop-papyrus Unfortunately Last Updated May 2010 Use ‘papyrus’ commandline instead of the ‘hadoop’ use DSL, which is implemented on top of jruby- hadoop packaging runtim e Friday, September 30, 11

Slide 72

Slide 72 text

Map/Red & JRUBY? Other Approaches? https://github.com/etsy/cascading.jruby https://github.com/mrflip/wukong Friday, September 30, 11

Slide 73

Slide 73 text

Map/Red & JRUBY? What I would like? Use the normal ‘hadoop’ commandline Inherit from the Java classes Friday, September 30, 11

Slide 74

Slide 74 text

Map/Red & JRUBY? What I would like? Use the normal ‘hadoop’ commandline Inherit from the Java classes packaging runtim e Friday, September 30, 11

Slide 75

Slide 75 text

Map/Red & JRUBY? What I would like? Use the normal ‘hadoop’ commandline Inherit from the Java classes packaging runtim e Unfortunately this is not written... Yet. Friday, September 30, 11

Slide 76

Slide 76 text

How to store lots of data Friday, September 30, 11

Slide 77

Slide 77 text

Avro Friday, September 30, 11

Slide 78

Slide 78 text

Avro Rich Data Structure, Think ‘Document’ Friday, September 30, 11

Slide 79

Slide 79 text

Avro Rich Data Structure, Think ‘Document’ Compact, Fast, Binary Data Format Friday, September 30, 11

Slide 80

Slide 80 text

Avro Rich Data Structure, Think ‘Document’ Compact, Fast, Binary Data Format RPC/Protocol Buffer/Thrift-like ability Friday, September 30, 11

Slide 81

Slide 81 text

Avro Rich Data Structure, Think ‘Document’ Compact, Fast, Binary Data Format Container File Structure RPC/Protocol Buffer/Thrift-like ability Friday, September 30, 11

Slide 82

Slide 82 text

Avro Rich Data Structure, Think ‘Document’ Compact, Fast, Binary Data Format Container File Structure RPC/Protocol Buffer/Thrift-like ability No Code Generation Friday, September 30, 11

Slide 83

Slide 83 text

Avro Rich Data Structure, Think ‘Document’ Compact, Fast, Binary Data Format Container File Structure RPC/Protocol Buffer/Thrift-like ability No Code Generation Record Structure Defined via JSON Schema Friday, September 30, 11

Slide 84

Slide 84 text

Avro Rich Data Structure, Think ‘Document’ Compact, Fast, Binary Data Format Container File Structure RPC/Protocol Buffer/Thrift-like ability No Code Generation Record Structure Defined via JSON Schema Map/Reduce Friendly Friday, September 30, 11

Slide 85

Slide 85 text

Avro Rich Data Structure, Think ‘Document’ Compact, Fast, Binary Data Format Container File Structure RPC/Protocol Buffer/Thrift-like ability No Code Generation Record Structure Defined via JSON Schema Map/Reduce Friendly Compression Friday, September 30, 11

Slide 86

Slide 86 text

Avro Rich Data Structure, Think ‘Document’ Compact, Fast, Binary Data Format Container File Structure RPC/Protocol Buffer/Thrift-like ability No Code Generation Record Structure Defined via JSON Schema Map/Reduce Friendly Compression Language Neutral Friday, September 30, 11

Slide 87

Slide 87 text

Avro Container File Structure 2 3 4 5 6 7 8 9 1 Friday, September 30, 11

Slide 88

Slide 88 text

Node 1 Node 2 Avro 1 2 Friday, September 30, 11

Slide 89

Slide 89 text

Node 1 Node 2 Avro 1 2 Friday, September 30, 11

Slide 90

Slide 90 text

Node 1 Node 2 Avro 1 2 Task 1 Task 2 Friday, September 30, 11

Slide 91

Slide 91 text

Avro 5500 Records 12MB Raw JSON 2.0MB .tgz Friday, September 30, 11

Slide 92

Slide 92 text

Avro 5500 Records 12MB Raw JSON 2.0MB .tgz 3.6MB Avro File (no compression) 1.9MB (snappy compression) Friday, September 30, 11

Slide 93

Slide 93 text

Avro & JRUBY Happiness! via Java via Ruby Friday, September 30, 11

Slide 94

Slide 94 text

How to Coordinate Around lots of data Friday, September 30, 11

Slide 95

Slide 95 text

Zookeeper Friday, September 30, 11

Slide 96

Slide 96 text

Zookeeper Highly Available Quorum of Servers Friday, September 30, 11

Slide 97

Slide 97 text

Zookeeper Highly Available Quorum of Servers providing Friday, September 30, 11

Slide 98

Slide 98 text

Zookeeper Highly Available Quorum of Servers “coordination services” providing Friday, September 30, 11

Slide 99

Slide 99 text

Zookeeper Highly Available Quorum of Servers “coordination services” group membership registration providing Friday, September 30, 11

Slide 100

Slide 100 text

Zookeeper Highly Available Quorum of Servers “coordination services” distributed locks group membership registration providing Friday, September 30, 11

Slide 101

Slide 101 text

Zookeeper Highly Available Quorum of Servers “coordination services” distributed locks group membership registration providing sequences Friday, September 30, 11

Slide 102

Slide 102 text

Zookeeper Highly Available Quorum of Servers “coordination services” distributed locks group membership registration providing sequences watches Friday, September 30, 11

Slide 103

Slide 103 text

Zookeeper a high-availability “filesystem” Friday, September 30, 11

Slide 104

Slide 104 text

Zookeeper & JRUBY Outlook good! zookeeper wire protocol https://github.com/twitter/zookeeper Friday, September 30, 11

Slide 105

Slide 105 text

Low Latency Access To lots of data Friday, September 30, 11

Slide 106

Slide 106 text

HBASE Implementation of Google Big-Table millions billions Friday, September 30, 11

Slide 107

Slide 107 text

HBASE Builds upon most of what we have just seen. HDFS Friendly Files Stored on HDFS Processes coordinate via Zookeeper Friday, September 30, 11

Slide 108

Slide 108 text

HBASE & JRUBY Excellent! Friday, September 30, 11

Slide 109

Slide 109 text

HBASE & JRUBY Best spot for JRuby in the Hadoop Ecosystem Excellent! Friday, September 30, 11

Slide 110

Slide 110 text

HBASE & JRUBY Best spot for JRuby in the Hadoop Ecosystem Excellent! On the Top of the Heap Friday, September 30, 11

Slide 111

Slide 111 text

HBASE & JRUBY Best spot for JRuby in the Hadoop Ecosystem HBase Shell IS irb Excellent! On the Top of the Heap Friday, September 30, 11

Slide 112

Slide 112 text

HBASE & JRUBY HBase has Thrift, Protocol Buffers and Avro RPC interfaces Friday, September 30, 11

Slide 113

Slide 113 text

HBASE & JRUBY HBase has Thrift, Protocol Buffers and Avro RPC interfaces https://github.com/bmuller/hbaserb Friday, September 30, 11

Slide 114

Slide 114 text

HBASE & JRUBY HBase has Thrift, Protocol Buffers and Avro RPC interfaces https://github.com/bmuller/hbaserb https://github.com/copiousfreetime/ashbe Friday, September 30, 11

Slide 115

Slide 115 text

“Big Data” and JRuby Jeremy Hinegardner 2011-09-29 RubyConf 2011 Friday, September 30, 11