Build your first MapReduce with Hadoop and Ruby

Ohai Hadoop! Build your first MapReduce with Hadoop & Ruby

Tweet@_swanand GitHub@swanandp StackOverflow@18678 Work@Kaverisoft Make { DispatchTrack } mailto:swanand@pagnis. in
Who am I? Ruby, Coffeescript, Java, Rails, Sinatra, Android, TextMate, Emacs, Minitest, MySQL, Cassandra, Hadoop, Mountain Lion, Curl, Zsh, GMail, Solarized, Oscar Wilde, Robert Jordan, Quentin Tarantino, Charlize Theron

• MapReduce! Wait, what? • Enter the Hadoop. *gong* •
Convention over Configuration? You wish. • Instant Gratification. Now you're talkin' • Further Reading. Go forth and read! Tell 'em what you're going to tell 'em

MapReduce! Wait, what? • Map: Given a set of values
(or key-values), output another set of values (or key-values) • [K1, V1] -> map -> [K2, V2] • Map each value into a new value

MapReduce! Wait, what? • Reduce: Given a set of values
for a key, come up with a summarized version • K1[V1, V2 ... Vn] -> reduce -> K1[Y] • Reduce given values into 1 value

MapReduce! Wait, what?

MapReduce! Um.. hmm.. Q: What is the single biggest takeaway
from mapping? A: Map operation is stateless i.e. one iteration doesn't depend on previous iteration. Q: What is the single biggest takeaway from reducing? A: Reduce represents an operation for a particular key.

Enter the Hadoop. *gong* "The really interesting thing I want
you to notice, here, is that as soon as you think of map and reduce as functions that everybody can use, and they use them, you only have to get one supergenius to write the hard code to run map and reduce on a global massively parallel array of computers, and all the old code that used to work fine when you just ran a loop still works only it's a zillion times faster which means it can be used to tackle huge problems in an instant." - Joel Spolsky

MapReduce! Oh, yeah! 1. Convert raw data into readable format
2. Iterate over data chunks, convert each chunk into meaningful key, value pairs 3. Do this for all your data using massive parallelization 4. Group all the keys and their respective values 5. Take values for a key and convert into desired meaningful format 6. Step 2 is called mapper 7. Step 5 is called reducer

Enter the Hadoop. *gong* Same process has now become: 1.
Put data into Hadoop 2. Define your mapper 3. Define your reducer 4. Run your jobs 5. Read processed data from Hadoop Other advantages: • Encapsulations over common problems like large files, process management, disk / node failure

Top Level Descriptor job has_many tasks HDFS Boss core-site.xml HDFS
Slaves slaves MapReduce Boss mapred-site.xml MapReduce Slave mapred-site.xml User's window into Hadoop, through the command hadoop Convention over Configuration? You wish. Job Task NameNode DataNode JobTracker TaskTracker Client

Convention over Configuration? You wish. • Configuration in XML &
Shell scripts. Yuck! • Respite: ◦ Option for specifying a configuration directory ◦ Shell script configuration is mostly ENV variables • Which means: ◦ Configuration can be written in YML or JSON or Ruby and exported in XML ◦ ENV variables can be set using rake, thor or just plain Ruby • Caveats: ◦ No standard wrapper to do this (Go write one!)

Convention over Configuration? You wish. • Default mappers and reducers
are defined in Java • Other languages supported using Streaming API • Streaming API makes use of STDIN and STDOUT to read and output data and executable binaries for processing • Caveats ◦ No dependency management, we are on our own

Instant Gratification. Now you're talkin' GOAL: 1. Take a couple
of books in txt format 2. Find out the total usage of each character in the english alphabet. 3. Establish that e is the most used. 4. Why this example? a. Perfect use case for MapReduce. b. Algorithm is simple. c. Results are simple to analyze. d. Txt formatted books are easily available in Project Gutenberg.

• Official Documentation • Wiki: http://wiki.apache.org/hadoop/ • Hadoop examples that
ship with Hadoop • http://www.bigfastblog.com/map-reduce- with-ruby-using-hadoop • http://www.youtube.com/watch? v=d2xeNpfzsYI Further Reading and Watching

Questions?

Thank you!

Build your first MapReduce with Hadoop and Ruby

Build your first MapReduce with Hadoop and Ruby

Swanand Pagnis

More Decks by Swanand Pagnis

Other Decks in Technology

Featured

Transcript

Ohai Hadoop! Build your first MapReduce with Hadoop & Ruby

Tweet@_swanand GitHub@swanandp StackOverflow@18678 Work@Kaverisoft Make { DispatchTrack } mailto:swanand@pagnis. in

• MapReduce! Wait, what? • Enter the Hadoop. gong •

MapReduce! Wait, what? • Map: Given a set of values

MapReduce! Wait, what? • Reduce: Given a set of values

MapReduce! Wait, what?

MapReduce! Um.. hmm.. Q: What is the single biggest takeaway

Enter the Hadoop. gong "The really interesting thing I want

MapReduce! Oh, yeah! 1. Convert raw data into readable format

Enter the Hadoop. gong Same process has now become: 1.

Top Level Descriptor job has_many tasks HDFS Boss core-site.xml HDFS

Convention over Configuration? You wish. • Configuration in XML &

Convention over Configuration? You wish. • Default mappers and reducers

Instant Gratification. Now you're talkin' GOAL: 1. Take a couple

• Official Documentation • Wiki: http://wiki.apache.org/hadoop/ • Hadoop examples that

Questions?

Thank you!