Cascading 3 and beyond - Speaker Deck

Slide 1

Slide 1 text

DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th 2015

Slide 2

Slide 2 text

SPEAKER 2 André Kelpe Senior Software Engineer at Concurrent company behind Cascading, Lingual and Driven http://concurrentinc.com / @concurrent [email protected] / @fs111

Slide 3

Slide 3 text

http://cascading.org Apache licensed Java framework for writing data oriented applications production ready, stable and battle proven INTRODUCTION 3

Slide 4

Slide 4 text

4 PHILOSOPHY

Slide 5

Slide 5 text

developer productivity users focus on business problems, not distributed systems knowledge predictable runtime behaviour fail fast PHILOSOPHY 5

Slide 6

Slide 6 text

stable user APIs safe defaults with knobs for experts batch workloads PHILOSOPHY 6

Slide 7

Slide 7 text

testability & robustness production quality applications rather than a collection of scripts abstractions over interchangeable platforms PHILOSOPHY 7

Slide 8

Slide 8 text

8 TERMINOLOGY

Slide 9

Slide 9 text

A SERIES OF PIPES 9 https://www.flickr.com/photos/theilr/4283377543/sizes/l

Slide 10

Slide 10 text

CASCADING TERMINOLOGY 10 • Taps are sources and sinks for data • Schemes represent the format of the data • Pipes are connecting Taps

Slide 11

Slide 11 text

● Tuples ﬂow through Pipes ● Fields describe the Tuples ● Operations are executed on Tuples in TupleStreams ● Pipes can be merged, spliced, joined etc. ● Pipe-assemblies are reusable components CASCADING TERMINOLOGY 11

Slide 12

Slide 12 text

FlowConnector uses QueryPlanner to translate FlowDef into Flow to run on computational platform Flows can be orchestrated via Cascade Applications are Directed Acyclic Graphs (DAG) CASCADING TERMINOLOGY 12

Slide 13

Slide 13 text

DAG 13

Slide 14

Slide 14 text

14 PLATFORMS

Slide 15

Slide 15 text

CASCADING PLATFORMS 15 local change 1 line of code, recompile, done.

Slide 16

Slide 16 text

COMPILER ANALOGY 16 User Code Translation Optimisation Assembly CPU Architecture QueryPlanner/ RuleEngine MR Tez Flink FlowDef FlowDef FlowDef FlowDef FlowDef others…

Slide 17

Slide 17 text

DAG 17

Slide 18

Slide 18 text

A DAG RUNNING ON A PLATFORM 18

Slide 19

Slide 19 text

REAL WORLD DAG 19 https://github.com/cchepelov/wcplus https://driven.cascading.io/index.html#/apps/A7544E2B8E7C410397B4AE88F53326D1

Slide 20

Slide 20 text

20 CODE EXAMPLE

Slide 21

Slide 21 text

● Fluid - A Fluent API for Cascading − Targeted at application writers − https://github.com/Cascading/fluid ● „Raw“ Cascading API − Targeted for library writers, code generators, integration layers − https://github.com/Cascading/cascading APIS 21

Slide 22

Slide 22 text

COUNTING WORDS 22 String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath ); ...

Slide 23

Slide 23 text

COUNTING WORDS (CONT.) 23 // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); ...

Slide 24

Slide 24 text

COUNTING WORDS (CONT.) 24 // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( “word count" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); Flow wcFlow = flowConnector.connect( flowDef ) wcFlow.complete(); // ← runs the code }

Slide 25

Slide 25 text

A FULL TOOLBOX 25 ● Operations − Function − Filter − Regex/Scripts − Boolean operators − Count/Limit/Last/First − Scripts − Unique − Asserts − Min/Max ● Splices − GroupBy − CoGroup − HashJoin − Merge ● Joins Left, right, outer, inner, mixed, custom

Slide 26

Slide 26 text

A FULL TOOLBOX 26 • data access: JDBC, HBase, elasticsearch, redshift, HDFS, S3, Cassandra, kinesis, accumulo … • data formats: avro, parquet, ORC (+ACID), thrift, protobuf, CSV, TSV… • integration points: Cascading Lingual (SQL), Apache Hive, M/R apps, custom

Slide 27

Slide 27 text

OUTLOOK TO CASCADING 3.1+ 27 • improved serialization through strong typing • Cascading on Apache Flink • Cascading on Hazelcast

Slide 28

Slide 28 text

DON’T LIKE JAVA? 28 Clojure/logic programming https://github.com/nathanmarz/cascalog Clojure https://github.com/Netflix/PigPen Scala https://github.com/twitter/scalding

Slide 29

Slide 29 text

29 QUESTIONS?

Slide 30

Slide 30 text

LINK COLLECTION 30 • http://www.cascading.org/ • https://github.com/Cascading/ • http://driven.io/ • http://concurrentinc.com • https://groups.google.com/forum/#!forum/ cascading-user • http://docs.cascading.org/tutorials/etl-log/ • http://docs.cascading.org/cascading/3.0/ userguide/html/

Slide 31

Slide 31 text

DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th 2015