Slide 1

Slide 1 text

The Cascading (big) data application framework André Kelpe | HUG France | Paris | 25. November 2014

Slide 2

Slide 2 text

Who am I? André Kelpe Senior Software Engineer at Concurrent company behind Cascading, Lingual and Driven http://concurrentinc.com / @concurrent [email protected] / @fs111

Slide 3

Slide 3 text

http://cascading.org Apache licensed Java framework for writing data oriented applications production ready, stable and battle proven (soundcloud, twitter, etsy, climate corp + many more)

Slide 4

Slide 4 text

Cascading goals developer productivity focus on business problems, not distributed systems knowledge useful abstractions over underlying „fabrics“

Slide 5

Slide 5 text

Cascading goals Testability & robustness production quality applications rather than a collection of scripts (hooks into the core for experts)

Slide 6

Slide 6 text

https://www.flickr.com/photos/theilr/4283377543/sizes/l

Slide 7

Slide 7 text

Cascading terminology Taps are sources and sinks for data Schemes represent the format of the data Pipes are connecting Taps

Slide 8

Slide 8 text

Cascading terminology ● Tuples flow through Pipes ● Fields describe the Tuples ● Operations are executed on Tuples in TupleStreams ● FlowConnector uses QueryPlanner to translate FlowDef into Flow to run on computational fabric

Slide 9

Slide 9 text

Compiler QueryPlanner FlowDef FlowDef FlowDef Hadoop Tez FlowDef Spark User Code Translation Optimization Assembly CPU Architecture

Slide 10

Slide 10 text

User-APIs ● Fluid - A Fluent API for Cascading – Targeted at application writers – https://github.com/Cascading/fluid ● „Raw“ Cascading API – Targeted for library writers, code generators, integration layers – https://github.com/Cascading/cascading

Slide 11

Slide 11 text

Counting words // configuration String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath ); ...

Slide 12

Slide 12 text

Counting words (cont.) // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); ...

Slide 13

Slide 13 text

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); Flow wcFlow = flowConnector.connect( flowDef ) wcFlow.complete(); // ← runs the code } Counting words (cont.)

Slide 14

Slide 14 text

https://driven.cascading.io/driven/871A2C66DA1D 4841B229CDD2B04B9FDA

Slide 15

Slide 15 text

Impatient Cascading for the Impatient http://docs.cascading.org/impatient/index.html

Slide 16

Slide 16 text

● Operations – Function – Filter – Regex/Scripts – Boolean operators – Count/Limit/Last/First – Scripts – Unique – Asserts – Min/Max – … ● Splices – GroupBy – CoGroup – HashJoin – Merge A full toolbox ● Joins Left, right, outer, inner, mixed...

Slide 17

Slide 17 text

A full toolbox data access: JDBC, HBase, elasticsearch, redshift, HDFS, S3, Cassandra... data formats: avro, thrift, protobuf, CSV, TSV... integration points: Cascading Lingual (SQL), Apache Hive, classical M/R apps.. not Java?: Scalding (Scala), Cascalog (clojure)

Slide 18

Slide 18 text

Status quo ● Cascading 2.6 – Production release ● Hadoop 2.x ● Hadoop 1.x ● Local mode ● Cascading 3.0 – public wip builds ● Tez ● Hadoop 2.x ● Hadoop 1.x ● Local mode ● Others (Spark...)

Slide 19

Slide 19 text

Questions? [email protected]

Slide 20

Slide 20 text

Link Collection http://www.cascading.org/ https://github.com/Cascading/ http://concurrentinc.com http://cascading.io/driven/ https://groups.google.com/forum/#!forum/cascading-user http://docs.cascading.org/impatient/ http://docs.cascading.org/cascading/2.6/userguide/html/

Slide 21

Slide 21 text

fin.