Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SELECT ALL THE THINGS - Cascading Lingual, ANSI SQL for Apache Hadoop

SELECT ALL THE THINGS - Cascading Lingual, ANSI SQL for Apache Hadoop

Slide deck of a talk I gave about Cascading and Lingual at the TU Berlin.

André Kelpe

January 13, 2014
Tweet

More Decks by André Kelpe

Other Decks in Technology

Transcript

  1. SELECT ALL THE THINGS!
    Cascading Lingual
    ANSI SQL for Apache Hadoop
    TU-Berlin, January 13th 2014
    André Kelpe
    concurrentinc.com

    View Slide

  2. Speaker
    André Kelpe
    Software Engineer at Concurrent
    The company behind Cascading and Lingual
    concurrentinc.com / @concurrent
    [email protected]
    @fs111

    View Slide

  3. Agenda
    Cascading and Lingual
    Lingual: design goals
    Lingual: features
    Demo: Lingual in action
    Q&A

    View Slide

  4. Cascading
    http://cascading.org

    View Slide

  5. Cascading terminology
    Taps are sources and sinks for data
    Schemes represent the format of the data
    Pipes are connecting Taps

    View Slide

  6. Cascading terminology
    Tuples flow through Pipes
    Fields describe the Tuples
    Operations are executed on Tuples in
    TupleStreams
    Flows get scheduled and executed

    View Slide

  7. Word Count 1/4
    // define source and sink Taps.
    Scheme sourceScheme = new TextLine( new Fields( "line" ) );
    Tap source = new Hfs( sourceScheme, inputPath );
    Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
    Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
    // the 'head' of the pipe assembly
    Pipe assembly = new Pipe( "wordcount" );

    View Slide

  8. Word Count 2/4
    // For each input Tuple
    // parse out each word into a new Tuple with the field name "word"
    // regular expressions are optional in Cascading
    String regex = "(?Function function = new RegexGenerator( new Fields( "word" ), regex );
    assembly = new Each( assembly, new Fields( "line" ), function );
    // group the Tuple stream by the "word" value
    assembly = new GroupBy( assembly, new Fields( "word" ) );

    View Slide

  9. // For every Tuple group
    // count the number of occurrences of "word" and store result in
    // a field named "count"
    Aggregator count = new Count( new Fields( "count" ) );
    assembly = new Every( assembly, count );
    // initialize app properties, tell Hadoop which jar file to use
    Properties properties = new Properties();
    FlowConnector.setApplicationJarClass( properties, Main.class );
    Word Count 3/4

    View Slide

  10. // plan a new Flow from the assembly using the source and sink Taps
    // with the above properties
    FlowConnector flowConnector = new FlowConnector( properties );
    Flow flow = flowConnector.connect( "word-count", source, sink, assembly );
    // execute the flow, block until complete
    flow.complete();
    Word Count 4/4

    View Slide

  11. Not just Java!

    View Slide

  12. Scalding
    Scala DSL on top of Cascading
    Developed by twitter
    https://github.com/twitter/scalding/

    View Slide

  13. Cascalog
    Clojure DSL on top of Cascading
    Inspired by datalog
    https://github.com/nathanmarz/cascalog

    View Slide

  14. ANSI SQL via Lingual
    http://www.cascading.org/lingual/

    View Slide

  15. Lingual – design goals 1/3
    Immediate Data Access
    SQL access via Shell or JDBC driver

    View Slide

  16. Lingual – design goals 2/3
    Simplify SQL Migration
    move SQL workflows on your Hadoop Cluster
    via Cascading flows or JDBC driver

    View Slide

  17. Lingual – design goals 3/3
    Simplify System
    & Data Integration
    Read and write from hdfs, jdbc, memcached,
    HBase, redshift...

    View Slide

  18. Lingual – ANSI SQL on Cascading
    http://www.cascading.org/lingual/

    View Slide

  19. Lingual and Cascading are
    about batch processing large
    amounts of data

    View Slide

  20. Demo

    View Slide

  21. SQL in Cascading 1/2
    String statement = "select *\n"
    + "from \"example\".\"sales_fact_1997\" as s\n"
    + "join \"example\".\"employee\" as e\n"
    + "on e.\"EMPID\" = s.\"CUST_ID\"";
    Tap empTap = new FileTap( new SQLTypedTextDelimited( ",", "\"" ),
    "src/main/resources/data/example/employee.tcsv", SinkMode.KEEP );
    Tap salesTap = new FileTap( new SQLTypedTextDelimited( ",", "\"" ),
    "src/main/resources/data/example/sales_fact_1997.tcsv",
    SinkMode.KEEP );
    Tap resultsTap = new FileTap( new SQLTypedTextDelimited( ",", "\"" ),
    "build/test/output/flow/results.tcsv", SinkMode.REPLACE );

    View Slide

  22. SQL in Cascading 2/2
    FlowDef flowDef = FlowDef.flowDef()
    .setName( "sql flow" )
    .addSource( "example.employee", empTap )
    .addSource( "example.sales_fact_1997", salesTap )
    .addSink( "results", resultsTap );
    SQLPlanner sqlPlanner = new SQLPlanner()
    .setSql( statement );
    flowDef.addAssemblyPlanner( sqlPlanner );
    Flow flow = new LocalFlowConnector().connect( flowDef );
    flow.complete();

    View Slide

  23. Lingual – ANSI SQL on Cascading
    http://www.cascading.org/lingual/

    View Slide

  24. Q&A
    @fs111 / http://concurrentinc.com
    http://cascading.org/lingual

    View Slide

  25. Link collection
    http://www.cascading.org/lingual/
    http://www.cascading.org/
    http://docs.cascading.org/lingual/1.0/
    https://github.com/Cascading/
    http://concurrentinc.com
    https://groups.google.com/forum/#!forum/lingual-user
    https://groups.google.com/forum/#!forum/cascading-user
    http://docs.cascading.org/impatient/
    https://github.com/Cascading/vagrant-cascading-hadoop-cluster

    View Slide

  26. View Slide