The Cascading (big) data application framework

6da3d4048a89eae74e790545d08ff687?s=47 André Kelpe
November 25, 2014

The Cascading (big) data application framework

A talk about Cascading given at the Hadoop User Group France in Paris in November 2014

6da3d4048a89eae74e790545d08ff687?s=128

André Kelpe

November 25, 2014
Tweet

Transcript

  1. The Cascading (big) data application framework André Kelpe | HUG

    France | Paris | 25. November 2014
  2. Who am I? André Kelpe Senior Software Engineer at Concurrent

    company behind Cascading, Lingual and Driven http://concurrentinc.com / @concurrent andre@concurrentinc.com / @fs111
  3. http://cascading.org Apache licensed Java framework for writing data oriented applications

    production ready, stable and battle proven (soundcloud, twitter, etsy, climate corp + many more)
  4. Cascading goals developer productivity focus on business problems, not distributed

    systems knowledge useful abstractions over underlying „fabrics“
  5. Cascading goals Testability & robustness production quality applications rather than

    a collection of scripts (hooks into the core for experts)
  6. https://www.flickr.com/photos/theilr/4283377543/sizes/l

  7. Cascading terminology Taps are sources and sinks for data Schemes

    represent the format of the data Pipes are connecting Taps
  8. Cascading terminology • Tuples flow through Pipes • Fields describe

    the Tuples • Operations are executed on Tuples in TupleStreams • FlowConnector uses QueryPlanner to translate FlowDef into Flow to run on computational fabric
  9. Compiler QueryPlanner FlowDef FlowDef FlowDef Hadoop Tez FlowDef Spark User

    Code Translation Optimization Assembly CPU Architecture
  10. User-APIs • Fluid - A Fluent API for Cascading –

    Targeted at application writers – https://github.com/Cascading/fluid • „Raw“ Cascading API – Targeted for library writers, code generators, integration layers – https://github.com/Cascading/cascading
  11. Counting words // configuration String docPath = args[ 0 ];

    String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath ); ...
  12. Counting words (cont.) // specify a regex operation to split

    the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); ...
  13. // connect the taps, pipes, etc., into a flow FlowDef

    flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); Flow wcFlow = flowConnector.connect( flowDef ) wcFlow.complete(); // ← runs the code } Counting words (cont.)
  14. https://driven.cascading.io/driven/871A2C66DA1D 4841B229CDD2B04B9FDA

  15. Impatient Cascading for the Impatient http://docs.cascading.org/impatient/index.html

  16. • Operations – Function – Filter – Regex/Scripts – Boolean

    operators – Count/Limit/Last/First – Scripts – Unique – Asserts – Min/Max – … • Splices – GroupBy – CoGroup – HashJoin – Merge A full toolbox • Joins Left, right, outer, inner, mixed...
  17. A full toolbox data access: JDBC, HBase, elasticsearch, redshift, HDFS,

    S3, Cassandra... data formats: avro, thrift, protobuf, CSV, TSV... integration points: Cascading Lingual (SQL), Apache Hive, classical M/R apps.. not Java?: Scalding (Scala), Cascalog (clojure)
  18. Status quo • Cascading 2.6 – Production release • Hadoop

    2.x • Hadoop 1.x • Local mode • Cascading 3.0 – public wip builds • Tez • Hadoop 2.x • Hadoop 1.x • Local mode • Others (Spark...)
  19. Questions? andre@concurrentinc.com

  20. Link Collection http://www.cascading.org/ https://github.com/Cascading/ http://concurrentinc.com http://cascading.io/driven/ https://groups.google.com/forum/#!forum/cascading-user http://docs.cascading.org/impatient/ http://docs.cascading.org/cascading/2.6/userguide/html/

  21. fin.