Cascading 3 and beyond

DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe
| Apache Big Data Europe | Budapest, September 28th 2015

SPEAKER 2 André Kelpe Senior Software Engineer at Concurrent company
behind Cascading, Lingual and Driven http://concurrentinc.com / @concurrent andre@concurrentinc.com / @fs111

http://cascading.org Apache licensed Java framework for writing data oriented applications
production ready, stable and battle proven INTRODUCTION 3

4 PHILOSOPHY

developer productivity users focus on business problems, not distributed systems
knowledge predictable runtime behaviour fail fast PHILOSOPHY 5

stable user APIs safe defaults with knobs for experts batch
workloads PHILOSOPHY 6

testability & robustness production quality applications rather than a collection
of scripts abstractions over interchangeable platforms PHILOSOPHY 7

8 TERMINOLOGY

A SERIES OF PIPES 9 https://www.flickr.com/photos/theilr/4283377543/sizes/l

CASCADING TERMINOLOGY 10 • Taps are sources and sinks for
data • Schemes represent the format of the data • Pipes are connecting Taps

• Tuples ﬂow through Pipes • Fields describe the Tuples
• Operations are executed on Tuples in TupleStreams • Pipes can be merged, spliced, joined etc. • Pipe-assemblies are reusable components CASCADING TERMINOLOGY 11

FlowConnector uses QueryPlanner to translate FlowDef into Flow to run
on computational platform Flows can be orchestrated via Cascade Applications are Directed Acyclic Graphs (DAG) CASCADING TERMINOLOGY 12

DAG 13

14 PLATFORMS

CASCADING PLATFORMS 15 local change 1 line of code, recompile,
done.

COMPILER ANALOGY 16 User Code Translation Optimisation Assembly CPU Architecture
QueryPlanner/ RuleEngine MR Tez Flink FlowDef FlowDef FlowDef FlowDef FlowDef others…

DAG 17

A DAG RUNNING ON A PLATFORM 18

REAL WORLD DAG 19 https://github.com/cchepelov/wcplus https://driven.cascading.io/index.html#/apps/A7544E2B8E7C410397B4AE88F53326D1

20 CODE EXAMPLE

• Fluid - A Fluent API for Cascading − Targeted
at application writers − https://github.com/Cascading/fluid • „Raw“ Cascading API − Targeted for library writers, code generators, integration layers − https://github.com/Cascading/cascading APIS 21

COUNTING WORDS 22 String docPath = args[ 0 ]; String
wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath ); ...

COUNTING WORDS (CONT.) 23 // specify a regex operation to
split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); ...

COUNTING WORDS (CONT.) 24 // connect the taps, pipes, etc.,
into a flow FlowDef flowDef = FlowDef.flowDef() .setName( “word count" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); Flow wcFlow = flowConnector.connect( flowDef ) wcFlow.complete(); // ← runs the code }

A FULL TOOLBOX 25 • Operations − Function − Filter
− Regex/Scripts − Boolean operators − Count/Limit/Last/First − Scripts − Unique − Asserts − Min/Max • Splices − GroupBy − CoGroup − HashJoin − Merge • Joins Left, right, outer, inner, mixed, custom

A FULL TOOLBOX 26 • data access: JDBC, HBase, elasticsearch,
redshift, HDFS, S3, Cassandra, kinesis, accumulo … • data formats: avro, parquet, ORC (+ACID), thrift, protobuf, CSV, TSV… • integration points: Cascading Lingual (SQL), Apache Hive, M/R apps, custom

OUTLOOK TO CASCADING 3.1+ 27 • improved serialization through strong
typing • Cascading on Apache Flink • Cascading on Hazelcast

DON’T LIKE JAVA? 28 Clojure/logic programming https://github.com/nathanmarz/cascalog Clojure https://github.com/Netflix/PigPen Scala
https://github.com/twitter/scalding

29 QUESTIONS?

LINK COLLECTION 30 • http://www.cascading.org/ • https://github.com/Cascading/ • http://driven.io/ •
http://concurrentinc.com • https://groups.google.com/forum/#!forum/ cascading-user • http://docs.cascading.org/tutorials/etl-log/ • http://docs.cascading.org/cascading/3.0/ userguide/html/

DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe
| Apache Big Data Europe | Budapest, September 28th 2015

Cascading 3 and beyond

Cascading 3 and beyond

André Kelpe

More Decks by André Kelpe

Other Decks in Technology

Featured

Transcript

DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe

SPEAKER 2 André Kelpe Senior Software Engineer at Concurrent company

http://cascading.org Apache licensed Java framework for writing data oriented applications

4 PHILOSOPHY

developer productivity users focus on business problems, not distributed systems

stable user APIs safe defaults with knobs for experts batch

testability & robustness production quality applications rather than a collection

8 TERMINOLOGY

A SERIES OF PIPES 9 https://www.flickr.com/photos/theilr/4283377543/sizes/l

CASCADING TERMINOLOGY 10 • Taps are sources and sinks for

• Tuples ﬂow through Pipes • Fields describe the Tuples

FlowConnector uses QueryPlanner to translate FlowDef into Flow to run

DAG 13

14 PLATFORMS

CASCADING PLATFORMS 15 local change 1 line of code, recompile,

COMPILER ANALOGY 16 User Code Translation Optimisation Assembly CPU Architecture

DAG 17

A DAG RUNNING ON A PLATFORM 18

REAL WORLD DAG 19 https://github.com/cchepelov/wcplus https://driven.cascading.io/index.html#/apps/A7544E2B8E7C410397B4AE88F53326D1

20 CODE EXAMPLE

• Fluid - A Fluent API for Cascading − Targeted

COUNTING WORDS 22 String docPath = args[ 0 ]; String

COUNTING WORDS (CONT.) 23 // specify a regex operation to

COUNTING WORDS (CONT.) 24 // connect the taps, pipes, etc.,

A FULL TOOLBOX 25 • Operations − Function − Filter

A FULL TOOLBOX 26 • data access: JDBC, HBase, elasticsearch,

OUTLOOK TO CASCADING 3.1+ 27 • improved serialization through strong

DON’T LIKE JAVA? 28 Clojure/logic programming https://github.com/nathanmarz/cascalog Clojure https://github.com/Netflix/PigPen Scala

29 QUESTIONS?

LINK COLLECTION 30 • http://www.cascading.org/ • https://github.com/Cascading/ • http://driven.io/ •

DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe