Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cascading Through Hadoop for Boulder JUG

Cascading Through Hadoop for Boulder JUG

An exploration of Cascading, a more fluent API on top of Hadoop.

Bededa744012c87721d68f69342f81b0?s=128

Matthew McCullough

October 13, 2011
Tweet

Transcript

  1. Cascading through Hadoop Simpler mapreduce through data flows by Matthew

    McCullough, Ambient Ideas, LLC
  2. Matthew McCullough

  3. ✓ Using Hadoop? Work with Big Data? Familiar with MapReduce?

    ✓ ✓ ? ? ?
  4. None
  5. http://delicious.com/matthew.mccullough/cascading

  6. http://delicious.com/matthew.mccullough/hadoop

  7. http://github.com/matthewmccullough/cascading-course

  8. None
  9. MapReduce

  10. a quick review...

  11. classical Map & Reduce

  12. now MapReduce®

  13. Raw Data Split Shuffle Processed Data Map Reduce

  14. Hadoop Java API implementation...

  15. Raw Data Split Shuffle Processed Data Map Reduce

  16. // The WordCount Mapper public static class TokenizerMapper extends Mapper<Object,

    Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
  17. Raw Data Split Shuffle Processed Data Map Reduce

  18. // The WordCount Reducer public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>

    { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
  19. but wait...

  20. // The WordCount main() public static void main(String[] args) throws

    Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }
  21. and how about multiple files?

  22. package org.apache.hadoop.examples; import java.io.BufferedReader; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException;

    import java.io.InputStreamReader; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.InputSplit; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.MultiFileInputFormat; import org.apache.hadoop.mapred.MultiFileSplit; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.RecordReader;
  23. //set the InputFormat of the job to our InputFormat job.setInputFormat(MyInputFormat.class);

    // the keys are words (strings) job.setOutputKeyClass(Text.class); // the values are counts (ints) job.setOutputValueClass(IntWritable.class); //use the defined mapper job.setMapperClass(MapClass.class); //use the WordCount Reducer job.setCombinerClass(LongSumReducer.class); job.setReducerClass(LongSumReducer.class); FileInputFormat.addInputPaths(job, args[0]); FileOutputFormat.setOutputPath(job, new Path(args[1])); JobClient.runJob(job); return 0; } public static void main(String[] args) throws Exception { int ret = ToolRunner.run(new MultiFileWordCount(), args); System.exit(ret); } }
  24. None
  25. // The WordCount main() public static void main(String[] arg

  26. None
  27. Coding a Java Flow

  28. public class SimplestPipe1Flip { public static void main(String[] args) {

    String inputPath = "data/babynamedefinitions.csv"; String outputPath = "output/simplestpipe1"; Scheme sourceScheme = new TextDelimited( new Fields( "name", "definition" ), "," ); Tap source = new Hfs( sourceScheme, inputPath ); Scheme sinkScheme = new TextDelimited( new Fields( "definition", "name" ), " ++ " ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); Pipe assembly = new Pipe( "flip" ); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, SimplestPipe1Flip.class); FlowConnector flowConnector = new FlowConnector( properties ); Flow flow = flowConnector.connect( "flipflow", source, sink, assembly ); flow.complete(); } }
  29. None
  30. The Author

  31. Ignoring that Hadoop is as much about analytics as it

    is about integration leads to a fair number of compromises, including, but not exclusive to a loss in quality of life (in trade for a false sense of accomplishment) -Chris Wensel, Cascading Inventor
  32. http://cascading.org

  33. http://concurrentinc.com

  34. citizen of the big data domain

  35. proper level of abstraction for Hadoop

  36. None
  37. Hadoop: 2011

  38. who's using Hadoop?

  39. -Meetup.com -AOL -Bing -Facebook -Netflix -Yahoo -Twitter

  40. Hadoop is as much about analytics as it is about

    integration. Ignoring that leads to crazy complex tool chains that typically involve XML -Chris Wensel, Cascading Inventor
  41. None
  42. ✓Two humans ✓IBM cluster ✓Hadoop ✓Java Go!

  43. None
  44. None
  45. Hadoop DSLs

  46. Pig approximates ETL

  47. #Pig Script Person = LOAD 'people.csv' using PigStorage(','); Names =

    FOREACH Person GENERATE $2 AS name; OrderedNames = ORDER Names BY name ASC; GroupedNames = GROUP OrderedNames BY name; NameCount = FOREACH GroupedNames GENERATE group, COUNT(OrderedNames); store NameCount into 'names.out';
  48. Hive approximates SQL

  49. #Hive Script LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare; SELECT

    * FROM shakespeare WHERE freq > 100 SORT BY freq ASC LIMIT 10;
  50. Cascading Groovy approximates MapR

  51. //Cascading Groovy Script def cascading = new Cascading() def builder

    = cascading.builder(); Flow flow = builder.flow("wordcount") { source(input, scheme: text()) tokenize(/[.,]*\s+/) group() count() group(["count"], reverse: true) sink(output, delete: true) }
  52. Cascalog approximates Datalog

  53. #Cascalog Script (?<- (stdout) [?person] (age ?person 25))

  54. None
  55. Here's another faux DSL for you!

  56. Don't worry Martin. Cascading isn't a DSL. Really.

  57. None
  58. The Metaphor

  59. Divide & Conquer

  60. with a different metaphor

  61. Water

  62. Pipes

  63. Taps

  64. Source

  65. Sink

  66. Flows

  67. Planner

  68. Planner to optimize parallelism

  69. None
  70. Tuples

  71. Tuples

  72. ordered list of elements

  73. ["Matthew", 2, true]

  74. Tuple Stream

  75. ["Matthew", 2, true], ["Jay", 2, true], ["Peter", 0, false]

  76. ["Matthew", "Red"], ["Jay", "Grey"], ["Peter", "Brown"] ["Matthew", 2, true], ["Jay",

    2, true], ["Peter", 0, false] Co-Group
  77. None
  78. The Process

  79. Pipe Head Tail Source Sink Tap Tap

  80. Pipe Head Tail Pipe Head Tail Pipe Head Tail Source

    Tap Sink Tap Flow
  81. Late binding to taps

  82. public class SimplestPipe1Flip { public static void main(String[] args) {

    String inputPath = "data/babynamedefinitions.csv"; String outputPath = "output/simplestpipe1"; Scheme sourceScheme = new TextDelimited( new Fields( "name", "definition" ), "," ); Tap source = new Hfs( sourceScheme, inputPath ); Scheme sinkScheme = new TextDelimited( new Fields( "definition", "name" ), " ++ " ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); Pipe assembly = new Pipe( "flip" ); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, SimplestPipe1Flip.class); FlowConnector flowConnector = new FlowConnector( properties ); Flow flow = flowConnector.connect( "flipflow", source, sink, assembly ); flow.complete(); } }
  83. Pipe Types Each GroupBy CoGroup Every Sub-Assembly

  84. GroupBy CoGroup Every Sub-Assembly Each CoGroup Flow

  85. DAG

  86. Cascade GroupBy CoGroup Every Sub-Assembly Each CoGroup GroupBy CoGroup Every

    Sub-Assembly Each CoGroup GroupBy CoGroup Every Sub-Assembly Each CoGroup
  87. public class SimplestPipe3CoGroup { public static void main(String[] args) {

    String inputPathDefinitions = "data/babynamedefinitions.csv"; String inputPathCounts = "data/babynamecounts.csv"; String outputPath = "output/simplestpipe3"; Scheme sourceSchemeDefinitions = new TextDelimited( new Fields( "name", "definition" ), "," ); Scheme sourceSchemeCounts = new TextDelimited( new Fields( "name", "count" ), "," ); Tap sourceDefinitions = new Hfs( sourceSchemeDefinitions, inputPathDefinitions ); Tap sourceCounts = new Hfs( sourceSchemeCounts, inputPathCounts ); Scheme sinkScheme = new TextDelimited( new Fields( "dname", "count", "definition" ), " ^^^ " ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); Pipe definitionspipe = new Pipe( "definitionspipe" ); Pipe countpipe = new Pipe( "countpipe" ); //Join the tuple streams Fields commonfields = new Fields( "name" ); Fields newfields = new Fields("dname", "definition", "cname", "count"); Pipe joinpipe = new CoGroup( definitionspipe, commonfields, countpipe, commonfields, newfields, new InnerJoin() ); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, SimplestPipe3CoGroup.class); FlowConnector flowConnector = new FlowConnector( properties ); Map<String, Tap> sources = new HashMap<String, Tap>(); sources.put("definitionspipe", sourceDefinitions); sources.put("countpipe", sourceCounts); Flow flow = flowConnector.connect( sources, sink, joinpipe ); flow.complete(); } }
  88. None
  89. Motivations

  90. Big Data is a g r o w i n

    g field
  91. MapReduce is the primary technique

  92. is becoming the MR standard

  93. Why a new MR toolkit? ㊌ Simpler coding ㊌ More

    logical processing abstractions ㊌ Run MapReduce locally ㊌ Debug jobs with ease
  94. easy debugging...

  95. public class SimplestPipe1Flip { public static void main(String[] args) {

    String inputPath = "data/babynamedefinitions.csv"; String outputPath = "output/simplestpipe1"; Scheme sourceScheme = new TextDelimited( new Fields( "name", "definition" ), "," ); Tap source = new Hfs( sourceScheme, inputPath ); Scheme sinkScheme = new TextDelimited( new Fields( "definition", "name" ), " ++ " ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); Pipe assembly = new Pipe( "flip" ); //OPTIONAL: Debug the tuple //assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() ); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, SimplestPipe1Flip.class); FlowConnector flowConnector = new FlowConnector( properties ); //OPTIONAL: Have the planner use or filter out the debugging statements //FlowConnector.setDebugLevel( properties, DebugLevel.VERBOSE ); Flow flow = flowConnector.connect( "flipflow", source, sink, assembly ); flow.complete(); } }
  96. Cascading User Roles ㊌ Application executor ㊌ Process assembler ㊌

    Operation developer
  97. Hadoop is never used alone. The dirty secret is that

    it is really a huge ETL tool. -Chris Wensel, Cascading Inventor
  98. 50gal Hot Water Heater

  99. Tankless Hot Water Heater

  100. None
  101. Building

  102. Let's prep the build

  103. Why?

  104. When in doubt, look at the Cascading source code. If

    something is not documented in this User Guide, the source code will give you clear instructions on what to do or expect. -Chris Wensel, Cascading Inventor
  105. https://github.com/cwensel

  106. None
  107. https://github.com/cwensel/cascading

  108. Ant 1.8.x

  109. Ivy 2.2.x

  110. # Verified Ant > 1.8.x # Verified Ivy > 2.2.x

    $ ant retrieve
  111. None
  112. Let's build it...

  113. $ ls -al drwxr-xr-x 15 mccm06 staff 510B Feb 21

    14:31 ./ drwxr-xr-x 20 mccm06 staff 680B Feb 17 15:39 ../ drwxr-xr-x 10 mccm06 staff 340B Feb 19 01:40 cascading.groovy_git/ drwxr-xr-x 7 mccm06 staff 238B Feb 19 01:40 cascading.hbase_git/ drwxr-xr-x 8 mccm06 staff 272B Feb 19 01:40 cascading.jdbc_git/ drwxr-xr-x 8 mccm06 staff 272B Feb 19 01:39 cascading.load_git/ drwxr-xr-x 9 mccm06 staff 306B Feb 19 01:39 cascading.memcached_git/ drwxr-xr-x 9 mccm06 staff 306B Feb 19 01:39 cascading.multitool_git/ drwxr-xr-x 10 mccm06 staff 340B Feb 19 01:39 cascading.samples_git/ drwxr-xr-x 8 mccm06 staff 272B Feb 19 01:39 cascading.work_git/ drwxr-xr-x 14 mccm06 staff 476B Feb 21 14:26 cascading_git/ drwxr-xr-x 11 mccm06 staff 374B Dec 31 16:16 cascalog_git/ lrwxr-xr-x 1 mccm06 staff 45B Feb 21 14:31 hadoop -> /Applications/Dev/hadoop-family/hadoop-0.20.1
  114. # Trying Hadoop == 0.21.0 # Verified 'hadoop' is neighbor

    to cascading $ ant compile
  115. [javac] cascading_git/src/core/cascading/tap/hadoop/TapIterator.java:52: cannot find symbol [javac] symbol : class JobConf

    [javac] location: class cascading.tap.hadoop.TapIterator [javac] private final JobConf conf; [javac] ^ [javac] cascading_git/src/core/cascading/tap/hadoop/TapIterator.java:54: cannot find symbol [javac] symbol : class InputSplit [javac] location: class cascading.tap.hadoop.TapIterator [javac] private InputSplit[] splits; [javac] ^ [javac] cascading_git/src/core/cascading/tap/hadoop/TapIterator.java:56: cannot find symbol [javac] symbol : class RecordReader [javac] location: class cascading.tap.hadoop.TapIterator [javac] private RecordReader reader; [javac] ^ [javac] cascading_git/src/core/cascading/tap/hadoop/TapIterator.java:75: cannot find symbol [javac] symbol : class JobConf [javac] location: class cascading.tap.hadoop.TapIterator [javac] public TapIterator( Tap tap, JobConf conf ) throws IOException [javac] ^ [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 100 errors
  116. Hadoop 0.21.0

  117. Argh!

  118. Hadoop 0.20.1

  119. # Verified Hadoop == 0.20.1 # Verified 'hadoop' is neighbor

    to cascading $ ant compile
  120. Buildfile: cascading_git/build.xml init: [echo] initializing cascading environment... [mkdir] Created dir:

    cascading_git/build/core [mkdir] Created dir: cascading_git/build/xml [mkdir] Created dir: cascading_git/build/test [mkdir] Created dir: cascading_git/build/testresults echo-compile-buildnum: compile: [echo] building cascading... [javac] Compiling 238 source files to cascading_git/build/core [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [copy] Copying 1 file to cascading_git/build/core/cascading [javac] Compiling 5 source files to cascading_git/build/xml [javac] Compiling 85 source files to cascading_git/build/test [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [copy] Copying 24 files to cascading_git/build/test BUILD SUCCESSFUL Total time: 7 seconds
  121. None
  122. Planner

  123. planner diagrams

  124. public class SimplestPipe2Sort { public static void main(String[] args) {

    String inputPath = "data/babynamedefinitions.csv"; String outputPath = "output/simplestpipe2"; Scheme sourceScheme = new TextDelimited( new Fields( "name", "definition" ), "," ); Tap source = new Hfs( sourceScheme, inputPath ); Scheme sinkScheme = new TextDelimited( new Fields( "definition", "name" ), " ^^^ " ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); Pipe assembly = new Pipe( "sortreverse" ); Fields groupFields = new Fields( "name"); //OPTIONAL: Set the comparator //groupFields.setComparator("name", Collections.reverseOrder()); assembly = new GroupBy( assembly, groupFields ); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, SimplestPipe2Sort.class); FlowConnector flowConnector = new FlowConnector( properties ); Flow flow = flowConnector.connect( "sortflow", source, sink, assembly ); flow.complete(); //OPTIONAL: Output a debugging diagram //flow.writeDOT(outputPath + "/flowdiagram.dot"); } }
  125. None
  126. Abstraction Levels

  127. a unique Java API

  128. similar to command abstractions in the core JVM

  129. CPU Instruction Assembly Language Class File Java Groovy DSL Hadoop

    Cascading Cascalog Cascading Groovy
  130. None
  131. Builders

  132. a unique Java API

  133. but enhanced via...

  134. Jython

  135. JRuby

  136. Clojure

  137. Groovy

  138. None
  139. Coding a Groovy Flow

  140. Groovy

  141. setup...

  142. $ cd cascading.groovy $ ant dist $ cd dist $

    groovy setup.groovy
  143. coding...

  144. def cascading = new Cascading() def builder = cascading.builder(); Flow

    flow = builder.flow("wordcount") { source(input, scheme: text()) // output new tuple for each split, //result replaces stream by default tokenize(/[.,]*\s+/) group() // group on stream // count values in group // creates 'count' field by default count() // group/sort on 'count', reverse the sort order group(["count"], reverse: true) sink(output, delete: true) }
  145. execution...

  146. $ groovy wordcount INFO - Concurrent, Inc - Cascading 1.2.1

    [hadoop-0.19.2+] INFO - [wordcount] starting INFO - [wordcount] source: Hfs["TextLine[['line']->[ALL]]"]["output/fetched/fetch.txt"]"] INFO - [wordcount] sink: Hfs["TextLine[['line']->[ALL]]"]["output/counted"]"] INFO - [wordcount] parallel execution is enabled: false INFO - [wordcount] starting jobs: 2 INFO - [wordcount] allocating threads: 1 INFO - [wordcount] starting step: (1/2) TempHfs["SequenceFile[[0, 'count']]"][wordcount/18750/] INFO - [wordcount] starting step: (2/2) Hfs["TextLine[['line']->[ALL]]"]["output/counted"]"] INFO - deleting temp path output/counted/_temporary
  147. None
  148. Cascalog

  149. Clojure

  150. None
  151. functional MR programming

  152. None
  153. None
  154. ㊌ Simple ㊌ Functions, filters, and aggregators all use the

    same syntax. Joins are implicit and natural.
  155. ㊌Expressive ㊌Logical composition is very powerful, and you can run

    arbitrary Clojure code in your query with little effort.
  156. ㊌Interactive ㊌Run queries from the Clojure REPL.

  157. ㊌Scalable ㊌Cascalog queries run as a series of MapReduce jobs.

  158. ㊌Query Anything ㊌Query HDFS data, database data, and/or local data

    by making use of Cascading’s “Tap” abstraction
  159. influenced by Datalog

  160. http://www.ccs.neu.edu/home/ramsdell/tools/datalog/datalog.html

  161. None
  162. query planner

  163. alternative to Pig, Hive

  164. read or write any data source

  165. higher density of code

  166. (?<- (stdout) [?word ?count] (sentence ?s) (split ?s :> ?word)

    (c/ count ?count)) Tap Outputs Source
  167. None
  168. Is It Fully Baked?

  169. Java is 16 years old

  170. None
  171. Hadoop is ~6 years old

  172. None
  173. Cascading is 4 years old

  174. None
  175. Cascading is MapReduce done right

  176. None
  177. None
  178. Cascading through Hadoop Simpler mapreduce through data flows by Matthew

    McCullough, Ambient Ideas, LLC
  179. None