Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Cascading (big) data application framework

André Kelpe
November 25, 2014

The Cascading (big) data application framework

A talk about Cascading given at the Hadoop User Group France in Paris in November 2014

André Kelpe

November 25, 2014
Tweet

More Decks by André Kelpe

Other Decks in Programming

Transcript

  1. The Cascading
    (big) data
    application framework
    André Kelpe | HUG France | Paris | 25. November 2014

    View Slide

  2. Who am I?
    André Kelpe
    Senior Software Engineer at Concurrent
    company behind Cascading, Lingual and
    Driven
    http://concurrentinc.com / @concurrent
    [email protected] / @fs111

    View Slide

  3. http://cascading.org
    Apache licensed Java framework for writing data
    oriented applications
    production ready, stable and battle proven
    (soundcloud, twitter, etsy, climate corp + many
    more)

    View Slide

  4. Cascading goals
    developer productivity
    focus on business problems, not distributed
    systems knowledge
    useful abstractions over underlying „fabrics“

    View Slide

  5. Cascading goals
    Testability & robustness
    production quality applications rather than a
    collection of scripts
    (hooks into the core for experts)

    View Slide

  6. https://www.flickr.com/photos/theilr/4283377543/sizes/l

    View Slide

  7. Cascading terminology
    Taps are sources and sinks for data
    Schemes represent the format of the data
    Pipes are connecting Taps

    View Slide

  8. Cascading terminology

    Tuples flow through Pipes

    Fields describe the Tuples

    Operations are executed on Tuples in
    TupleStreams

    FlowConnector uses QueryPlanner to
    translate FlowDef into Flow to run on
    computational fabric

    View Slide

  9. Compiler
    QueryPlanner
    FlowDef
    FlowDef
    FlowDef
    Hadoop
    Tez
    FlowDef
    Spark
    User Code Translation
    Optimization
    Assembly
    CPU Architecture

    View Slide

  10. User-APIs

    Fluid - A Fluent API for Cascading
    – Targeted at application writers
    – https://github.com/Cascading/fluid

    „Raw“ Cascading API
    – Targeted for library writers, code generators,
    integration layers
    – https://github.com/Cascading/cascading

    View Slide

  11. Counting words
    // configuration
    String docPath = args[ 0 ];
    String wcPath = args[ 1 ];
    Properties properties = new Properties();
    AppProps.setApplicationJarClass( properties, Main.class );
    FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties );
    // create source and sink taps
    Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );
    Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
    ...

    View Slide

  12. Counting words (cont.)
    // specify a regex operation to split the "document" text lines into a
    token stream
    Fields token = new Fields( "token" );
    Fields text = new Fields( "text" );
    RegexSplitGenerator splitter =
    new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
    // only returns "token"
    Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
    // determine the word counts
    Pipe wcPipe = new Pipe( "wc", docPipe );
    wcPipe = new GroupBy( wcPipe, token );
    wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
    ...

    View Slide

  13. // connect the taps, pipes, etc., into a flow
    FlowDef flowDef = FlowDef.flowDef()
    .setName( "wc" )
    .addSource( docPipe, docTap )
    .addTailSink( wcPipe, wcTap );
    Flow wcFlow = flowConnector.connect( flowDef )
    wcFlow.complete(); // ← runs the code
    }
    Counting words (cont.)

    View Slide

  14. https://driven.cascading.io/driven/871A2C66DA1D
    4841B229CDD2B04B9FDA

    View Slide

  15. Impatient
    Cascading for the Impatient
    http://docs.cascading.org/impatient/index.html

    View Slide


  16. Operations
    – Function
    – Filter
    – Regex/Scripts
    – Boolean operators
    – Count/Limit/Last/First
    – Scripts
    – Unique
    – Asserts
    – Min/Max
    – …

    Splices
    – GroupBy
    – CoGroup
    – HashJoin
    – Merge
    A full toolbox

    Joins
    Left, right, outer, inner,
    mixed...

    View Slide

  17. A full toolbox
    data access: JDBC, HBase, elasticsearch,
    redshift, HDFS, S3, Cassandra...
    data formats: avro, thrift, protobuf, CSV, TSV...
    integration points: Cascading Lingual (SQL),
    Apache Hive, classical M/R apps..
    not Java?: Scalding (Scala), Cascalog (clojure)

    View Slide

  18. Status quo

    Cascading 2.6
    – Production release

    Hadoop 2.x

    Hadoop 1.x

    Local mode

    Cascading 3.0
    – public wip builds

    Tez

    Hadoop 2.x

    Hadoop 1.x

    Local mode

    Others (Spark...)

    View Slide

  19. Questions?
    [email protected]

    View Slide

  20. Link Collection
    http://www.cascading.org/
    https://github.com/Cascading/
    http://concurrentinc.com
    http://cascading.io/driven/
    https://groups.google.com/forum/#!forum/cascading-user
    http://docs.cascading.org/impatient/
    http://docs.cascading.org/cascading/2.6/userguide/html/

    View Slide

  21. fin.

    View Slide