Cascading 3 and beyond

6da3d4048a89eae74e790545d08ff687?s=47 André Kelpe
September 28, 2015

Cascading 3 and beyond

Introduction to Cascading 3 given at Apache Big Data Europe 2015 in Budapest.

6da3d4048a89eae74e790545d08ff687?s=128

André Kelpe

September 28, 2015
Tweet

Transcript

  1. DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe

    | Apache Big Data Europe | Budapest, September 28th 2015
  2. SPEAKER 2 André Kelpe Senior Software Engineer at Concurrent company

    behind Cascading, Lingual and Driven http://concurrentinc.com / @concurrent andre@concurrentinc.com / @fs111
  3. http://cascading.org Apache licensed Java framework for writing data oriented applications

    production ready, stable and battle proven INTRODUCTION 3
  4. 4 PHILOSOPHY

  5. developer productivity users focus on business problems, not distributed systems

    knowledge predictable runtime behaviour fail fast PHILOSOPHY 5
  6. stable user APIs safe defaults with knobs for experts batch

    workloads PHILOSOPHY 6
  7. testability & robustness production quality applications rather than a collection

    of scripts abstractions over interchangeable platforms PHILOSOPHY 7
  8. 8 TERMINOLOGY

  9. A SERIES OF PIPES 9 https://www.flickr.com/photos/theilr/4283377543/sizes/l

  10. CASCADING TERMINOLOGY 10 • Taps are sources and sinks for

    data • Schemes represent the format of the data • Pipes are connecting Taps
  11. • Tuples flow through Pipes • Fields describe the Tuples

    • Operations are executed on Tuples in TupleStreams • Pipes can be merged, spliced, joined etc. • Pipe-assemblies are reusable components CASCADING TERMINOLOGY 11
  12. FlowConnector uses QueryPlanner to translate FlowDef into Flow to run

    on computational platform Flows can be orchestrated via Cascade Applications are Directed Acyclic Graphs (DAG) CASCADING TERMINOLOGY 12
  13. DAG 13

  14. 14 PLATFORMS

  15. CASCADING PLATFORMS 15 local change 1 line of code, recompile,

    done.
  16. COMPILER ANALOGY 16 User Code Translation Optimisation Assembly CPU Architecture

    QueryPlanner/ RuleEngine MR Tez Flink FlowDef FlowDef FlowDef FlowDef FlowDef others…
  17. DAG 17

  18. A DAG RUNNING ON A PLATFORM 18

  19. REAL WORLD DAG 19 https://github.com/cchepelov/wcplus https://driven.cascading.io/index.html#/apps/A7544E2B8E7C410397B4AE88F53326D1

  20. 20 CODE EXAMPLE

  21. • Fluid - A Fluent API for Cascading − Targeted

    at application writers − https://github.com/Cascading/fluid • „Raw“ Cascading API − Targeted for library writers, code generators, integration layers − https://github.com/Cascading/cascading APIS 21
  22. COUNTING WORDS 22 String docPath = args[ 0 ]; String

    wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath ); ...
  23. COUNTING WORDS (CONT.) 23 // specify a regex operation to

    split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); ...
  24. COUNTING WORDS (CONT.) 24 // connect the taps, pipes, etc.,

    into a flow FlowDef flowDef = FlowDef.flowDef() .setName( “word count" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); Flow wcFlow = flowConnector.connect( flowDef ) wcFlow.complete(); // ← runs the code }
  25. A FULL TOOLBOX 25 • Operations − Function − Filter

    − Regex/Scripts − Boolean operators − Count/Limit/Last/First − Scripts − Unique − Asserts − Min/Max • Splices − GroupBy − CoGroup − HashJoin − Merge • Joins Left, right, outer, inner, mixed, custom
  26. A FULL TOOLBOX 26 • data access: JDBC, HBase, elasticsearch,

    redshift, HDFS, S3, Cassandra, kinesis, accumulo … • data formats: avro, parquet, ORC (+ACID), thrift, protobuf, CSV, TSV… • integration points: Cascading Lingual (SQL), Apache Hive, M/R apps, custom
  27. OUTLOOK TO CASCADING 3.1+ 27 • improved serialization through strong

    typing • Cascading on Apache Flink • Cascading on Hazelcast
  28. DON’T LIKE JAVA? 28 Clojure/logic programming https://github.com/nathanmarz/cascalog Clojure https://github.com/Netflix/PigPen Scala

    https://github.com/twitter/scalding
  29. 29 QUESTIONS?

  30. LINK COLLECTION 30 • http://www.cascading.org/ • https://github.com/Cascading/ • http://driven.io/ •

    http://concurrentinc.com • https://groups.google.com/forum/#!forum/ cascading-user • http://docs.cascading.org/tutorials/etl-log/ • http://docs.cascading.org/cascading/3.0/ userguide/html/
  31. DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe

    | Apache Big Data Europe | Budapest, September 28th 2015