JRuby and Big Data

“Big Data” and JRuby Jeremy Hinegardner 2011-09-29 RubyConf 2011 Friday,
September 30, 11

What is BigData? Friday, September 30, 11

Ofﬁcially ... “Big Data is a term applied to data
sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.” -- Wikipedia (Big_Data) Friday, September 30, 11

Ofﬁcially... “..increasing volume (amount of data), velocity (speed of data
in/ out), and variety (range of data types, sources).” -- Gartner, Inc Friday, September 30, 11

Unofﬁcially... Friday, September 30, 11

“You need the number of WHAT by tomorrow?!? &#@!” Unofﬁcially...
Friday, September 30, 11

Uh, that data processing job. “You need the number of
WHAT by tomorrow?!? &#@!” Unofﬁcially... Friday, September 30, 11

Uh, that data processing job. The one that takes all
day, to run “You need the number of WHAT by tomorrow?!? &#@!” Unofﬁcially... Friday, September 30, 11

day, to run Yeah, we need do that over. “You need the number of WHAT by tomorrow?!? &#@!” Unofﬁcially... Friday, September 30, 11

day, to run Yeah, we need do that over. Its wrong “You need the number of WHAT by tomorrow?!? &#@!” Unofﬁcially... Friday, September 30, 11

day, to run Yeah, we need do that over. Its wrong And we’ll need to run that every day, “You need the number of WHAT by tomorrow?!? &#@!” Unofﬁcially... Friday, September 30, 11

day, to run Yeah, we need do that over. Its wrong And we’ll need to run that every day, and it needs to be done by 8 a.m. “You need the number of WHAT by tomorrow?!? &#@!” Unofﬁcially... Friday, September 30, 11

day, to run Yeah, we need do that over. Its wrong And we’ll need to run that every day, and it needs to be done by 8 a.m. On the previous day’s data. “You need the number of WHAT by tomorrow?!? &#@!” Unofﬁcially... Friday, September 30, 11

That’s a Lot Of DAta... Friday, September 30, 11

...headed this way FAST.... Friday, September 30, 11

2,500 bytes / tweet http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

2,500 bytes / tweet 155,000,000 Tweets / day http://blog.gnip.com/handling-high-volume-realtime- big-social-data/

2,500 bytes / tweet 155,000,000 Tweets / day 16,145,833,333 bytes
/ hour http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

/ hour 4,484,953 bytes / second http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

/ hour 4,484,953 bytes / second 4.2 Megabytes / second http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

/ hour 4,484,953 bytes / second 4.2 Megabytes / second or http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

/ hour 4,484,953 bytes / second 4.2 Megabytes / second or 33.8 Megabits / second http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

/ hour 4,484,953 bytes / second 4.2 Megabytes / second or 33.8 Megabits / second or http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

/ hour 4,484,953 bytes / second 4.2 Megabytes / second or 33.8 Megabits / second or Majority of an OC-1 SONET line http://blog.gnip.com/handling-high-volume-realtime- big-social-data/ Friday, September 30, 11

...AND I NEED TO DO WHAT WITH IT? Friday, September
30, 11

...And KEEP it available for how long? Friday, September 30,
11

That makes me feel Uncomfortable Friday, September 30, 11

How fast and how often can you boil the ocean.

Basic Instructions - Scott Meyer Friday, September 30, 11

you do not need “Big Data” to make good decisions

WHY do you think you need “Big Data” Friday, September
30, 11

http://evilmartini.com/post/7946263965/prob-w-bigdata Friday, September 30, 11

Sampling Works Given a Population of 155,000,000 Things http://www.custominsight.com/articles/random- sample-calculator.asp

Sampling Works Given a Population of 155,000,000 Things How many
Things do you NEED to look at to make an analysis about the Population. http://www.custominsight.com/articles/random- sample-calculator.asp Friday, September 30, 11

Things do you NEED to look at to make an analysis about the Population. http://www.custominsight.com/articles/random- sample-calculator.asp 1% error tolerance 99% conﬁdence Friday, September 30, 11

Things do you NEED to look at to make an analysis about the Population. http://www.custominsight.com/articles/random- sample-calculator.asp 1% error tolerance 99% conﬁdence 16,588 Friday, September 30, 11

Understand your Problem Domain Friday, September 30, 11

I do need to process large volumes of data in
a timely manner Friday, September 30, 11

Big Happy Hadoop Family Friday, September 30, 11

Where to Store lots of data Friday, September 30, 11

HDFS 2 3 4 5 6 7 8 9 Data
Nodes 1 1 1 Friday, September 30, 11

HDFS Data Nodes 2 2 2 3 4 5 6
7 8 9 3 3 4 4 5 5 6 6 7 7 8 8 9 9 1 1 1 Friday, September 30, 11

HDFS & JRUBY? Not Really Maybe if you want to
write ﬁles directly to HDFS Friday, September 30, 11

How to process Lots of data Friday, September 30, 11

MapInput = [ [x0,y0], ... , [ xi, yi ]]
MapInput.each do |x,y| a, b = map(x, y) MapResult << [ a, b ] end ReduceInput = MapResult.group_by { |mr| m[0] } Final =ReduceInput.collect { |g, list|reduce(g,list) } Map/Reduce Friday, September 30, 11

MapInput = [ [x0,y0], ... , [ xi, yi ]]
MapInput.each do |x,y| a, b = map(x, y) MapResult << [ a, b ] end ReduceInput = MapResult.group_by { |mr| m[0] } Final =ReduceInput.collect { |g, list|reduce(g,list) } Map/Reduce Embarrassingly Parallel Problems Friday, September 30, 11

Map/Reduce 3 2 2 2 3 4 5 7 8
9 3 4 5 5 6 6 7 7 8 8 9 9 1 1 4 1 Data/ Task Nodes Job Tracker You Friday, September 30, 11

Map/Reduce 3 2 2 2 3 4 5 7 8
9 3 4 5 5 6 6 7 7 8 8 9 9 1 1 4 1 Data/ Task Nodes Job Tracker You Submit Job for that ﬁle we loaded Friday, September 30, 11

Map/Red & JRUBY? Friday, September 30, 11

Map/Red & JRUBY? “It's complicated, you know. Lots of ins
and outs, lots of what have yous” -- The Dude Friday, September 30, 11

Map/Red & JRUBY? Job Submission/Running details 1. Build job jar
ﬁle 2. Submit the jar ﬁle to the Job Tracker 3. Job Tracker gives the jar to each Task Tracker 4. Finding the Mapper and Reducer classes is a runtime lookup starting from the Java side. Friday, September 30, 11

ﬁle 2. Submit the jar ﬁle to the Job Tracker 3. Job Tracker gives the jar to each Task Tracker 4. Finding the Mapper and Reducer classes is a runtime lookup starting from the Java side. packaging Friday, September 30, 11

ﬁle 2. Submit the jar ﬁle to the Job Tracker 3. Job Tracker gives the jar to each Task Tracker 4. Finding the Mapper and Reducer classes is a runtime lookup starting from the Java side. packaging runtime Friday, September 30, 11

Map/Red & JRUBY? https://github.com/banshee/radoop Use ‘radoop’ commandline instead of the
‘hadoop’ Inherit map/reduce classes from Java shims Friday, September 30, 11

Map/Red & JRUBY? https://github.com/banshee/radoop Use ‘radoop’ commandline instead of the
‘hadoop’ Inherit map/reduce classes from Java shims packaging runtim e Friday, September 30, 11

Map/Red & JRUBY? Unfortunately Last Updated Sep 2008 https://github.com/banshee/radoop Use
‘radoop’ commandline instead of the ‘hadoop’ Inherit map/reduce classes from Java shims packaging runtim e Friday, September 30, 11

Map/Red & JRUBY? https://github.com/fujibee/jruby-on-hadoop Use ‘joh’ commandline instead of the
‘hadoop’ deﬁne map/reduce methods, which end up being supported by Java shims Friday, September 30, 11

Map/Red & JRUBY? https://github.com/fujibee/jruby-on-hadoop Use ‘joh’ commandline instead of the
‘hadoop’ deﬁne map/reduce methods, which end up being supported by Java shims packaging runtim e Friday, September 30, 11

Map/Red & JRUBY? https://github.com/fujibee/jruby-on-hadoop Unfortunately Last Updated May 2010 Use
‘joh’ commandline instead of the ‘hadoop’ deﬁne map/reduce methods, which end up being supported by Java shims packaging runtim e Friday, September 30, 11

Map/Red & JRUBY? https://github.com/fujibee/hadoop-papyrus Use ‘papyrus’ commandline instead of the
‘hadoop’ use DSL, which is implemented on top of jruby- hadoop Friday, September 30, 11

Map/Red & JRUBY? https://github.com/fujibee/hadoop-papyrus Use ‘papyrus’ commandline instead of the
‘hadoop’ use DSL, which is implemented on top of jruby- hadoop packaging runtim e Friday, September 30, 11

Map/Red & JRUBY? https://github.com/fujibee/hadoop-papyrus Unfortunately Last Updated May 2010 Use
‘papyrus’ commandline instead of the ‘hadoop’ use DSL, which is implemented on top of jruby- hadoop packaging runtim e Friday, September 30, 11

Map/Red & JRUBY? Other Approaches? https://github.com/etsy/cascading.jruby https://github.com/mrﬂip/wukong Friday, September 30,
11

Map/Red & JRUBY? What I would like? Use the normal
‘hadoop’ commandline Inherit from the Java classes Friday, September 30, 11

‘hadoop’ commandline Inherit from the Java classes packaging runtim e Friday, September 30, 11

‘hadoop’ commandline Inherit from the Java classes packaging runtim e Unfortunately this is not written... Yet. Friday, September 30, 11

How to store lots of data Friday, September 30, 11

Avro Friday, September 30, 11

Avro Rich Data Structure, Think ‘Document’ Friday, September 30, 11

Avro Rich Data Structure, Think ‘Document’ Compact, Fast, Binary Data
Format Friday, September 30, 11

Format RPC/Protocol Buffer/Thrift-like ability Friday, September 30, 11

Format Container File Structure RPC/Protocol Buffer/Thrift-like ability Friday, September 30, 11

Format Container File Structure RPC/Protocol Buffer/Thrift-like ability No Code Generation Friday, September 30, 11

Format Container File Structure RPC/Protocol Buffer/Thrift-like ability No Code Generation Record Structure Deﬁned via JSON Schema Friday, September 30, 11

Format Container File Structure RPC/Protocol Buffer/Thrift-like ability No Code Generation Record Structure Deﬁned via JSON Schema Map/Reduce Friendly Friday, September 30, 11

Format Container File Structure RPC/Protocol Buffer/Thrift-like ability No Code Generation Record Structure Deﬁned via JSON Schema Map/Reduce Friendly Compression Friday, September 30, 11

Format Container File Structure RPC/Protocol Buffer/Thrift-like ability No Code Generation Record Structure Deﬁned via JSON Schema Map/Reduce Friendly Compression Language Neutral Friday, September 30, 11

Avro Container File Structure 2 3 4 5 6 7
8 9 1 Friday, September 30, 11

Node 1 Node 2 Avro 1 2 Friday, September 30,
11

Node 1 Node 2 Avro 1 2 Task 1 Task
2 Friday, September 30, 11

Avro 5500 Records 12MB Raw JSON 2.0MB .tgz Friday, September
30, 11

Avro 5500 Records 12MB Raw JSON 2.0MB .tgz 3.6MB Avro
File (no compression) 1.9MB (snappy compression) Friday, September 30, 11

Avro & JRUBY Happiness! via Java via Ruby Friday, September
30, 11

How to Coordinate Around lots of data Friday, September 30,
11

Zookeeper Friday, September 30, 11

Zookeeper Highly Available Quorum of Servers Friday, September 30, 11

Zookeeper Highly Available Quorum of Servers providing Friday, September 30,
11

Zookeeper Highly Available Quorum of Servers “coordination services” providing Friday,
September 30, 11

Zookeeper Highly Available Quorum of Servers “coordination services” group membership
registration providing Friday, September 30, 11

Zookeeper Highly Available Quorum of Servers “coordination services” distributed locks
group membership registration providing Friday, September 30, 11

group membership registration providing sequences Friday, September 30, 11

group membership registration providing sequences watches Friday, September 30, 11

Zookeeper a high-availability “ﬁlesystem” Friday, September 30, 11

Zookeeper & JRUBY Outlook good! zookeeper wire protocol https://github.com/twitter/zookeeper Friday,
September 30, 11

Low Latency Access To lots of data Friday, September 30,
11

HBASE Implementation of Google Big-Table millions billions Friday, September 30,
11

HBASE Builds upon most of what we have just seen.
HDFS Friendly Files Stored on HDFS Processes coordinate via Zookeeper Friday, September 30, 11

HBASE & JRUBY Excellent! Friday, September 30, 11

HBASE & JRUBY Best spot for JRuby in the Hadoop
Ecosystem Excellent! Friday, September 30, 11

Ecosystem Excellent! On the Top of the Heap Friday, September 30, 11

Ecosystem HBase Shell IS irb Excellent! On the Top of the Heap Friday, September 30, 11

HBASE & JRUBY HBase has Thrift, Protocol Buffers and Avro
RPC interfaces Friday, September 30, 11

RPC interfaces https://github.com/bmuller/hbaserb Friday, September 30, 11

RPC interfaces https://github.com/bmuller/hbaserb https://github.com/copiousfreetime/ashbe Friday, September 30, 11

“Big Data” and JRuby Jeremy Hinegardner 2011-09-29 RubyConf 2011 Friday,
September 30, 11

JRuby and Big Data

JRuby and Big Data

More Decks by Jeremy Hinegardner

Other Decks in Technology

Featured

Transcript