Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cascading 3 and beyond

André Kelpe
September 28, 2015

Cascading 3 and beyond

Introduction to Cascading 3 given at Apache Big Data Europe 2015 in Budapest.

André Kelpe

September 28, 2015
Tweet

More Decks by André Kelpe

Other Decks in Technology

Transcript

  1. DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe

    | Apache Big Data Europe | Budapest, September 28th 2015
  2. SPEAKER 2 André Kelpe Senior Software Engineer at Concurrent company

    behind Cascading, Lingual and Driven http://concurrentinc.com / @concurrent [email protected] / @fs111
  3. http://cascading.org Apache licensed Java framework for writing data oriented applications

    production ready, stable and battle proven INTRODUCTION 3
  4. developer productivity users focus on business problems, not distributed systems

    knowledge predictable runtime behaviour fail fast PHILOSOPHY 5
  5. testability & robustness production quality applications rather than a collection

    of scripts abstractions over interchangeable platforms PHILOSOPHY 7
  6. CASCADING TERMINOLOGY 10 • Taps are sources and sinks for

    data • Schemes represent the format of the data • Pipes are connecting Taps
  7. • Tuples flow through Pipes • Fields describe the Tuples

    • Operations are executed on Tuples in TupleStreams • Pipes can be merged, spliced, joined etc. • Pipe-assemblies are reusable components CASCADING TERMINOLOGY 11
  8. FlowConnector uses QueryPlanner to translate FlowDef into Flow to run

    on computational platform Flows can be orchestrated via Cascade Applications are Directed Acyclic Graphs (DAG) CASCADING TERMINOLOGY 12
  9. COMPILER ANALOGY 16 User Code Translation Optimisation Assembly CPU Architecture

    QueryPlanner/ RuleEngine MR Tez Flink FlowDef FlowDef FlowDef FlowDef FlowDef others…
  10. • Fluid - A Fluent API for Cascading − Targeted

    at application writers − https://github.com/Cascading/fluid • „Raw“ Cascading API − Targeted for library writers, code generators, integration layers − https://github.com/Cascading/cascading APIS 21
  11. COUNTING WORDS 22 String docPath = args[ 0 ]; String

    wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath ); ...
  12. COUNTING WORDS (CONT.) 23 // specify a regex operation to

    split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); ...
  13. COUNTING WORDS (CONT.) 24 // connect the taps, pipes, etc.,

    into a flow FlowDef flowDef = FlowDef.flowDef() .setName( “word count" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); Flow wcFlow = flowConnector.connect( flowDef ) wcFlow.complete(); // ← runs the code }
  14. A FULL TOOLBOX 25 • Operations − Function − Filter

    − Regex/Scripts − Boolean operators − Count/Limit/Last/First − Scripts − Unique − Asserts − Min/Max • Splices − GroupBy − CoGroup − HashJoin − Merge • Joins Left, right, outer, inner, mixed, custom
  15. A FULL TOOLBOX 26 • data access: JDBC, HBase, elasticsearch,

    redshift, HDFS, S3, Cassandra, kinesis, accumulo … • data formats: avro, parquet, ORC (+ACID), thrift, protobuf, CSV, TSV… • integration points: Cascading Lingual (SQL), Apache Hive, M/R apps, custom
  16. OUTLOOK TO CASCADING 3.1+ 27 • improved serialization through strong

    typing • Cascading on Apache Flink • Cascading on Hazelcast
  17. LINK COLLECTION 30 • http://www.cascading.org/ • https://github.com/Cascading/ • http://driven.io/ •

    http://concurrentinc.com • https://groups.google.com/forum/#!forum/ cascading-user • http://docs.cascading.org/tutorials/etl-log/ • http://docs.cascading.org/cascading/3.0/ userguide/html/
  18. DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe

    | Apache Big Data Europe | Budapest, September 28th 2015