Slide 1

Slide 1 text

SELECT ALL THE THINGS! Cascading Lingual ANSI SQL for Apache Hadoop TU-Berlin, January 13th 2014 André Kelpe concurrentinc.com

Slide 2

Slide 2 text

Speaker André Kelpe Software Engineer at Concurrent The company behind Cascading and Lingual concurrentinc.com / @concurrent [email protected] @fs111

Slide 3

Slide 3 text

Agenda Cascading and Lingual Lingual: design goals Lingual: features Demo: Lingual in action Q&A

Slide 4

Slide 4 text

Cascading http://cascading.org

Slide 5

Slide 5 text

Cascading terminology Taps are sources and sinks for data Schemes represent the format of the data Pipes are connecting Taps

Slide 6

Slide 6 text

Cascading terminology Tuples flow through Pipes Fields describe the Tuples Operations are executed on Tuples in TupleStreams Flows get scheduled and executed

Slide 7

Slide 7 text

Word Count 1/4 // define source and sink Taps. Scheme sourceScheme = new TextLine( new Fields( "line" ) ); Tap source = new Hfs( sourceScheme, inputPath ); Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE ); // the 'head' of the pipe assembly Pipe assembly = new Pipe( "wordcount" );

Slide 8

Slide 8 text

Word Count 2/4 // For each input Tuple // parse out each word into a new Tuple with the field name "word" // regular expressions are optional in Cascading String regex = "(?

Slide 9

Slide 9 text

// For every Tuple group // count the number of occurrences of "word" and store result in // a field named "count" Aggregator count = new Count( new Fields( "count" ) ); assembly = new Every( assembly, count ); // initialize app properties, tell Hadoop which jar file to use Properties properties = new Properties(); FlowConnector.setApplicationJarClass( properties, Main.class ); Word Count 3/4

Slide 10

Slide 10 text

// plan a new Flow from the assembly using the source and sink Taps // with the above properties FlowConnector flowConnector = new FlowConnector( properties ); Flow flow = flowConnector.connect( "word-count", source, sink, assembly ); // execute the flow, block until complete flow.complete(); Word Count 4/4

Slide 11

Slide 11 text

Not just Java!

Slide 12

Slide 12 text

Scalding Scala DSL on top of Cascading Developed by twitter https://github.com/twitter/scalding/

Slide 13

Slide 13 text

Cascalog Clojure DSL on top of Cascading Inspired by datalog https://github.com/nathanmarz/cascalog

Slide 14

Slide 14 text

ANSI SQL via Lingual http://www.cascading.org/lingual/

Slide 15

Slide 15 text

Lingual – design goals 1/3 Immediate Data Access SQL access via Shell or JDBC driver

Slide 16

Slide 16 text

Lingual – design goals 2/3 Simplify SQL Migration move SQL workflows on your Hadoop Cluster via Cascading flows or JDBC driver

Slide 17

Slide 17 text

Lingual – design goals 3/3 Simplify System & Data Integration Read and write from hdfs, jdbc, memcached, HBase, redshift...

Slide 18

Slide 18 text

Lingual – ANSI SQL on Cascading http://www.cascading.org/lingual/

Slide 19

Slide 19 text

Lingual and Cascading are about batch processing large amounts of data

Slide 20

Slide 20 text

Demo

Slide 21

Slide 21 text

SQL in Cascading 1/2 String statement = "select *\n" + "from \"example\".\"sales_fact_1997\" as s\n" + "join \"example\".\"employee\" as e\n" + "on e.\"EMPID\" = s.\"CUST_ID\""; Tap empTap = new FileTap( new SQLTypedTextDelimited( ",", "\"" ), "src/main/resources/data/example/employee.tcsv", SinkMode.KEEP ); Tap salesTap = new FileTap( new SQLTypedTextDelimited( ",", "\"" ), "src/main/resources/data/example/sales_fact_1997.tcsv", SinkMode.KEEP ); Tap resultsTap = new FileTap( new SQLTypedTextDelimited( ",", "\"" ), "build/test/output/flow/results.tcsv", SinkMode.REPLACE );

Slide 22

Slide 22 text

SQL in Cascading 2/2 FlowDef flowDef = FlowDef.flowDef() .setName( "sql flow" ) .addSource( "example.employee", empTap ) .addSource( "example.sales_fact_1997", salesTap ) .addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner() .setSql( statement ); flowDef.addAssemblyPlanner( sqlPlanner ); Flow flow = new LocalFlowConnector().connect( flowDef ); flow.complete();

Slide 23

Slide 23 text

Lingual – ANSI SQL on Cascading http://www.cascading.org/lingual/

Slide 24

Slide 24 text

Q&A @fs111 / http://concurrentinc.com http://cascading.org/lingual

Slide 25

Slide 25 text

Link collection http://www.cascading.org/lingual/ http://www.cascading.org/ http://docs.cascading.org/lingual/1.0/ https://github.com/Cascading/ http://concurrentinc.com https://groups.google.com/forum/#!forum/lingual-user https://groups.google.com/forum/#!forum/cascading-user http://docs.cascading.org/impatient/ https://github.com/Cascading/vagrant-cascading-hadoop-cluster

Slide 26

Slide 26 text

No content