Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cascading 3 and beyond

André Kelpe
September 28, 2015

Cascading 3 and beyond

Introduction to Cascading 3 given at Apache Big Data Europe 2015 in Budapest.

André Kelpe

September 28, 2015
Tweet

More Decks by André Kelpe

Other Decks in Technology

Transcript

  1. DRIVING
    INNOVATION
    THROUGH
    DATA
    CASCADING 3 AND BEYOND
    André Kelpe | Apache Big Data Europe |
    Budapest, September 28th 2015

    View Slide

  2. SPEAKER
    2
    André Kelpe
    Senior Software Engineer at Concurrent
    company behind Cascading, Lingual and Driven
    http://concurrentinc.com / @concurrent
    [email protected] / @fs111

    View Slide

  3. http://cascading.org
    Apache licensed Java framework for writing
    data oriented applications
    production ready, stable and battle proven
    INTRODUCTION
    3

    View Slide

  4. 4
    PHILOSOPHY

    View Slide

  5. developer productivity
    users focus on business problems, not
    distributed systems knowledge
    predictable runtime behaviour
    fail fast
    PHILOSOPHY
    5

    View Slide

  6. stable user APIs
    safe defaults with knobs for experts
    batch workloads
    PHILOSOPHY
    6

    View Slide

  7. testability & robustness
    production quality applications rather than
    a collection of scripts
    abstractions over interchangeable platforms
    PHILOSOPHY
    7

    View Slide

  8. 8
    TERMINOLOGY

    View Slide

  9. A SERIES OF PIPES
    9
    https://www.flickr.com/photos/theilr/4283377543/sizes/l

    View Slide

  10. CASCADING TERMINOLOGY
    10
    • Taps are sources and sinks for data
    • Schemes represent the format of the data
    • Pipes are connecting Taps

    View Slide

  11. ● Tuples flow through Pipes
    ● Fields describe the Tuples
    ● Operations are executed on Tuples in
    TupleStreams
    ● Pipes can be merged, spliced, joined etc.
    ● Pipe-assemblies are reusable components
    CASCADING TERMINOLOGY
    11

    View Slide

  12. FlowConnector uses QueryPlanner to
    translate FlowDef into Flow to run on
    computational platform
    Flows can be orchestrated via Cascade
    Applications are Directed Acyclic Graphs
    (DAG)
    CASCADING TERMINOLOGY
    12

    View Slide

  13. DAG
    13

    View Slide

  14. 14
    PLATFORMS

    View Slide

  15. CASCADING PLATFORMS
    15
    local
    change 1 line of code, recompile, done.

    View Slide

  16. COMPILER ANALOGY
    16
    User Code Translation
    Optimisation
    Assembly
    CPU Architecture
    QueryPlanner/
    RuleEngine
    MR
    Tez
    Flink
    FlowDef
    FlowDef
    FlowDef
    FlowDef
    FlowDef
    others…

    View Slide

  17. DAG
    17

    View Slide

  18. A DAG RUNNING ON A PLATFORM
    18

    View Slide

  19. REAL WORLD DAG
    19
    https://github.com/cchepelov/wcplus
    https://driven.cascading.io/index.html#/apps/A7544E2B8E7C410397B4AE88F53326D1

    View Slide

  20. 20
    CODE EXAMPLE

    View Slide

  21. ● Fluid - A Fluent API for Cascading
    − Targeted at application writers
    − https://github.com/Cascading/fluid
    ● „Raw“ Cascading API
    − Targeted for library writers, code
    generators, integration layers
    − https://github.com/Cascading/cascading
    APIS
    21

    View Slide

  22. COUNTING WORDS
    22
    String docPath = args[ 0 ];
    String wcPath = args[ 1 ];
    Properties properties = new Properties();
    AppProps.setApplicationJarClass( properties, Main.class );
    FlowConnector flowConnector = new
    Hadoop2MR1FlowConnector( properties );
    // create source and sink taps
    Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );
    Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
    ...

    View Slide

  23. COUNTING WORDS (CONT.)
    23
    // specify a regex operation to split the "document" text lines
    into a token stream
    Fields token = new Fields( "token" );
    Fields text = new Fields( "text" );
    RegexSplitGenerator splitter =
    new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
    // only returns "token"
    Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
    // determine the word counts
    Pipe wcPipe = new Pipe( "wc", docPipe );
    wcPipe = new GroupBy( wcPipe, token );
    wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
    ...

    View Slide

  24. COUNTING WORDS (CONT.)
    24
    // connect the taps, pipes, etc., into a flow
    FlowDef flowDef = FlowDef.flowDef()
    .setName( “word count" )
    .addSource( docPipe, docTap )
    .addTailSink( wcPipe, wcTap );
    Flow wcFlow = flowConnector.connect( flowDef )
    wcFlow.complete(); // ← runs the code
    }

    View Slide

  25. A FULL TOOLBOX
    25
    ● Operations
    − Function
    − Filter
    − Regex/Scripts
    − Boolean operators
    − Count/Limit/Last/First
    − Scripts
    − Unique
    − Asserts
    − Min/Max
    ● Splices
    − GroupBy
    − CoGroup
    − HashJoin
    − Merge
    ● Joins
    Left, right, outer, inner, mixed, custom

    View Slide

  26. A FULL TOOLBOX
    26
    • data access: JDBC, HBase, elasticsearch, redshift,
    HDFS, S3, Cassandra, kinesis, accumulo …
    • data formats: avro, parquet, ORC (+ACID), thrift,
    protobuf, CSV, TSV…
    • integration points: Cascading Lingual (SQL),
    Apache Hive, M/R apps, custom

    View Slide

  27. OUTLOOK TO CASCADING 3.1+
    27
    • improved serialization through strong
    typing
    • Cascading on Apache Flink
    • Cascading on Hazelcast

    View Slide

  28. DON’T LIKE JAVA?
    28
    Clojure/logic programming
    https://github.com/nathanmarz/cascalog
    Clojure
    https://github.com/Netflix/PigPen
    Scala
    https://github.com/twitter/scalding

    View Slide

  29. 29
    QUESTIONS?

    View Slide

  30. LINK COLLECTION
    30
    • http://www.cascading.org/
    • https://github.com/Cascading/
    • http://driven.io/
    • http://concurrentinc.com
    • https://groups.google.com/forum/#!forum/
    cascading-user
    • http://docs.cascading.org/tutorials/etl-log/
    • http://docs.cascading.org/cascading/3.0/
    userguide/html/

    View Slide

  31. DRIVING
    INNOVATION
    THROUGH
    DATA
    CASCADING 3 AND BEYOND
    André Kelpe | Apache Big Data Europe |
    Budapest, September 28th 2015

    View Slide