$30 off During Our Annual Pro Sale. View Details »

Cascalog 2.0: Datalog in Realtime

Sam Ritchie
November 15, 2013

Cascalog 2.0: Datalog in Realtime

Cascalog is a logic programming library for Clojure that allows users to run queries on enormous datasets (using Apache Hadoop). Cascalog's 2.0 release introduces two new query compilation modes - a local mode for in-memory analysis and a Trident mode for running Cascalog queries on Storm. These new features make Cascalog extremely powerful at every corner of the big data trinity; testing, batch-processing and realtime streaming.

This is a talk about the design of Cascalog's flexible, functional logic DSL. I'll discuss the protocol-based design of Cascalog's DSL and show how Clojure's dynamic dispatching made it easy to add these new compilation modes. After you learn Cascalog, you wonder how you ever did data processing any other way.

Sam Ritchie

November 15, 2013
Tweet

More Decks by Sam Ritchie

Other Decks in Programming

Transcript

  1. CASCALOG 2.0
    Sam Ritchie :: @sritchie :: Clojure/Conj 2013
    Datalog in Realtime

    View Slide

  2. CASCALOG 2.0
    Sam Ritchie :: @sritchie :: Clojure/Conj 2013
    Datalog in Realtime

    View Slide

  3. AGENDA

    View Slide

  4. • What is Cascalog? + Examples
    AGENDA

    View Slide

  5. • What is Cascalog? + Examples
    • Why Logic Programming?
    AGENDA

    View Slide

  6. • What is Cascalog? + Examples
    • Why Logic Programming?
    • How Cascalog Compiles Datalog to MapReduce
    AGENDA

    View Slide

  7. • What is Cascalog? + Examples
    • Why Logic Programming?
    • How Cascalog Compiles Datalog to MapReduce
    • Different Compilation Targets
    AGENDA

    View Slide

  8. • What is Cascalog? + Examples
    • Why Logic Programming?
    • How Cascalog Compiles Datalog to MapReduce
    • Different Compilation Targets
    • What’s Next?
    AGENDA

    View Slide

  9. :)

    View Slide

  10. WHAT IS CASCALOG?

    View Slide

  11. WHAT IS CASCALOG?
    • Datalog DSL in that helps you write

    View Slide

  12. WHAT IS CASCALOG?
    • Datalog DSL in that helps you write
    • Tries to write analytics for you, using facts about your data

    View Slide

  13. WHAT IS CASCALOG?
    • Datalog DSL in that helps you write
    • Tries to write analytics for you, using facts about your data
    • Hadoop support enables petabyte scale ETL and analysis

    View Slide

  14. WHAT IS CASCALOG?
    • Datalog DSL in that helps you write
    • Tries to write analytics for you, using facts about your data
    • Hadoop support enables petabyte scale ETL and analysis
    • Batch and Hadoop only (until recently!)

    View Slide

  15. View Slide

  16. (defn word-count [gen]
    (let [split (mapcatfn [^String sentence]
    (.split sentence "\\s+"))]
    (<- [?word ?count]
    (gen ?text)
    (split ?text :> ?word)
    (c/count ?count))))

    View Slide

  17. ;; The above code produces:
    (word-count ["what does the fox say"
    "what does that mean??"])
    ;;=> [["what" 2]
    ["does" 2]
    ["the" 1]
    ["fox" 1]
    ["say" 1]
    ["that" 1]
    ["mean??" 1]]
    (defn word-count [gen]
    (let [get-words (fn [^String sentence]
    (.split sentence "\\s+"))]
    (->> gen
    (mapcat get-words) ;; (“what” “does” “the” “fox” ....)
    (map (fn [word] [word 1])) ;; ([“what” 1] [“does” 1] ....)
    (group-by (fn [[word _]]
    word)) ;; {“what” [[“what” 1] [“what” 1]] ....}
    (map (fn [[k items]]
    [k (reduce (fn [acc [_ count]]
    (+ acc count))
    0
    items)])))))
    MapReduce in Clojure

    View Slide

  18. (defn word-count [gen]
    (let [split (mapcatfn [^String sentence]
    (.split sentence "\\s+"))]
    (<- [?word ?count]
    (split ?text :> ?word)
    (gen ?text)
    (c/count ?count))))
    ;; The above code produces:
    (word-count ["what does the fox say"
    "what does that mean??"])
    ;;=> [["what" 2]
    ["does" 2]
    ["the" 1]
    ["fox" 1]
    ["say" 1]
    ["that" 1]
    ["mean??" 1]]
    MapReduce in Cascalog

    View Slide

  19. (defn modis-chunks
    "Takes a cascading source, and returns a number of tuples that fully
    describe chunks of MODIS data for the supplied datasets. Chunks are
    represented as seqs of floats. Be sure to convert chunks to vector
    before running any sort of data analysis, as seqs require linear
    time for lookups."
    [datasets chunk-size source]
    (let [ks ["SHORTNAME" "TileID" "RANGEBEGINNINGDATE"]
    chunkifier (p/chunkify chunk-size)]
    (<- [?datachunk]
    (source _ ?hdf)
    (unpack-modis [datasets] ?hdf :> ?dataset ?freetile)
    (raster-chunks [chunk-size] ?freetile :> ?chunkid ?chunk)
    (meta-values [ks] ?freetile :> ?productname ?tileid ?date)
    (split-id ?tileid :> ?mod-h ?mod-v)
    ((c/juxt #'spatial-res #'temporal-res) ?productname :> ?s-res ?t-res)
    (chunkifier ?dataset ?date ?s-res ?t-res ?mod-h ?mod-v ?chunkid ?chunk :> ?datachunk))))

    View Slide

  20. ;; Subquery structure
    (<- /* output-variables */
    /* 1-or-more predicates */)

    View Slide

  21. ;; Subquery structure
    (<- /* output-variables */
    /* 1-or-more predicates */)
    ;; outputs
    (<- [?word ?count]

    View Slide

  22. ;; Subquery structure
    (<- /* output-variables */
    /* 1-or-more predicates */)
    ;; outputs
    (<- [?word ?count]
    ;; generator
    (gen ?text)

    View Slide

  23. ;; Subquery structure
    (<- /* output-variables */
    /* 1-or-more predicates */)
    ;; outputs
    (<- [?word ?count]
    ;; generator
    (gen ?text)
    ;; operation
    (split ?text :> ?word)

    View Slide

  24. ;; Subquery structure
    (<- /* output-variables */
    /* 1-or-more predicates */)
    ;; outputs
    (<- [?word ?count]
    ;; generator
    (gen ?text)
    ;; operation
    (split ?text :> ?word)
    ;; aggregation
    (c/count ?count))

    View Slide

  25. ;; Cascalog Predicate
    (split ?text :> ?word)
    “Operation”
    Inputs
    Outputs

    View Slide

  26. ;; Cascalog Predicate
    [split “?text” :> “?word”]
    “Operation”
    Inputs
    Outputs

    View Slide

  27. ;; Function from generator => subquery
    (defn word-count [gen]
    (let [split (mapcatfn [^String sentence]
    (.split sentence "\\s+"))]
    (<- [?word ?count] ;; <- outputs
    (gen ?text) ;; <- generator
    (split ?text :> ?word) ;; <- operation
    (c/count ?count)))) ;; <- aggregation

    View Slide

  28. ;; [follower person]
    (def follows
    [["alice" "david"]
    ["alice" "bob"]
    ["alice" "emily"]
    ["bob" "david"]
    ["bob" "george"]
    ["bob" "luanne"]
    ["david" "alice"]
    ["david" "luanne"]
    ["emily" "alice"]
    ["emily" "bob"]
    ["emily" "george"]
    ["emily" "gary"]
    ["george" "gary"]
    ["harold" "bob"]
    ["luanne" "harold"]
    ["luanne" "gary"]])
    ;; [person age]
    (def age
    [["alice" 28]
    ["bob" 33]
    ["chris" 40]
    ["david" 25]
    ["emily" 25]
    ["george" 31]
    ["gary" 28]
    ["kumar" 27]
    ["luanne" 36]])
    ;; [follower full-name]
    (def full-names
    [["alice" "Alice Smith"]
    ["bob" "Bobby John Johnson"]
    ["chris" "CHRIS"]
    ["david" "A B C D E"]
    ["emily" "Emily Buchanan"]
    ["george" "George Jett"]])
    Find the full name of every person following
    someone under 30.

    View Slide

  29. Find the full name of every person following
    someone under 30.
    (<- [?full-name]
    (follows ?follower ?person)
    (full-names ?follower ?full-name)
    (age ?person ?age)
    (:distinct true)
    (< ?age 30))

    View Slide

  30. Find the full name of every person following
    someone under 30.
    (<- [?full-name]
    (follows ?follower ?person)
    (full-names ?follower ?full-name)
    (age ?person ?age)
    (:distinct true)
    (< ?age 30))
    Gen:
    Gen:
    Gen:
    Agg:
    Filter:

    View Slide

  31. Find the full name of every person following
    someone under 30.
    (<- [?full-name]
    (follows ?follower ?person)
    (full-names ?follower ?full-name)
    (age ?person ?age)
    (:distinct true)
    (< ?age 30))
    Gen:
    Gen:
    Gen:
    Agg:
    Filter: ;; RESULTS
    ;; -----------------------
    ;; A B C D E
    ;; Alice Smith
    ;; Bobby John Johnson
    ;; Emily Buchanan
    ;; George Jett
    ;; -----------------------

    View Slide

  32. ABSTRACTION LAYERS
    :’(

    View Slide

  33. ABSTRACTION LAYERS
    :’(
    :-\

    View Slide

  34. ABSTRACTION LAYERS
    :’(
    :-\
    └ʢ˒̾˒ʣ┐

    View Slide

  35. ABSTRACTION LAYERS
    └ʢ˒̾˒ʣ┐
    ?

    View Slide

  36. ABSTRACTION LAYERS
    └ʢ˒̾˒ʣ┐
    ?

    View Slide

  37. View Slide

  38. Why Datalog?

    View Slide

  39. └ʢ˒̾˒ʣ┐

    View Slide

  40. “When you specify something to be
    designed, tell what properties you
    need, not how they are to be
    achieved.”
    -Fred Brooks, The Design of Design

    View Slide

  41. Compiling Cascalog

    View Slide

  42. Find the full name of every person following
    someone under 30.
    (<- [?full-name]
    (follows ?follower ?person)
    (full-names ?follower ?full-name)
    (age ?person ?age)
    (:distinct true)
    (< ?age 30))
    Gen:
    Gen:
    Gen:
    Agg:
    Filter: ;; RESULTS
    ;; -----------------------
    ;; A B C D E
    ;; Alice Smith
    ;; Bobby John Johnson
    ;; Emily Buchanan
    ;; George Jett
    ;; -----------------------

    View Slide

  43. (<- [?full-name]
    (follows ?follower ?person)
    (full-names ?follower ?full-name)
    (age ?person ?age)
    (:distinct true)
    (< ?age 30))

    View Slide

  44. (<- [?full-name]
    (follows ?follower ?person)
    (full-names ?follower ?full-name)
    (age ?person ?age)
    (:distinct true)
    (< ?age 30))
    follows
    ?follower ?person
    _________________

    View Slide

  45. (<- [?full-name]
    (follows ?follower ?person)
    (full-names ?follower ?full-name)
    (age ?person ?age)
    (:distinct true)
    (< ?age 30))
    follows full-names
    ?follower ?person ?full-name ?person
    _________________
    ____________________

    View Slide

  46. (<- [?full-name]
    (follows ?follower ?person)
    (full-names ?follower ?full-name)
    (age ?person ?age)
    (:distinct true)
    (< ?age 30))
    ___________
    follows full-names age
    ?follower ?person ?full-name ?person ?person ?age
    _________________
    ____________________

    View Slide

  47. (<- [?full-name]
    (follows ?follower ?person)
    (full-names ?follower ?full-name)
    (age ?person ?age)
    (:distinct true)
    (< ?age 30))
    ___________
    follows full-names age
    ?follower ?person ?full-name ?person ?person ?age
    insert 30 ?person ?age ?temp1
    _________________
    ____________________

    View Slide

  48. (<- [?full-name]
    (follows ?follower ?person)
    (full-names ?follower ?full-name)
    (age ?person ?age)
    (:distinct true)
    (< ?age 30))
    ___________
    follows full-names age
    ?follower ?person ?full-name ?person ?person ?age
    (< ?age ?temp1)
    insert 30 ?person ?age ?temp1
    _________________
    ____________________
    ________
    ?person ?age ?temp1

    View Slide

  49. (<- [?full-name]
    (follows ?follower ?person)
    (full-names ?follower ?full-name)
    (age ?person ?age)
    (:distinct true)
    (< ?age 30))
    ___________
    follows full-names age
    ?follower ?person ?full-name ?person ?person ?age
    (< ?age ?temp1)
    insert 30 ?person ?age ?temp1
    _________________
    ____________________
    ________
    ?person ?age ?temp1
    join on ?person ?person ?follower ?full-name ?age ?temp1

    View Slide

  50. (<- [?full-name]
    (follows ?follower ?person)
    (full-names ?follower ?full-name)
    (age ?person ?age)
    (:distinct true)
    (< ?age 30))
    __________
    ___________
    follows full-names age
    ?follower ?person ?full-name ?person ?person ?age
    (< ?age ?temp1)
    insert 30 ?person ?age ?temp1
    _________________
    ____________________
    ________
    ?person ?age ?temp1
    join on ?person ?person ?follower ?full-name ?age ?temp1
    distinct on ?full-name ?full-name

    View Slide

  51. Does this really look like ?

    View Slide

  52. View Slide

  53. Datalog
    Logical Plan
    Optimizers
    CUSTOM PLATFORMS IN 2.0

    View Slide

  54. ;; we dispatch on type here so we can use function metadata.
    (defmulti to-predicate
    (fn [op input output]
    (type op)))
    (defprotocol ICouldFilter
    (filter? [this]
    "filter? returns true if, given no input or output signifier, the
    operation takes inputs by default, false if outputs by
    default."))
    (defprotocol IPlatform
    (generator? [this candidate]
    "Returns true if the supplied candidate can become a generator,
    false otherwise.")
    (generator [this candidate output-fields options]
    "Returns a tuple producer, in the world of the implementing
    Platform.")
    (plan [this query]
    "Accepts a Cascalog subquery and compiles it down into some notion
    of a plan in the target platform's world."))

    View Slide

  55. ;; default filters:
    (extend-protocol ICouldFilter
    Object
    (filter? [_] false)
    clojure.lang.Fn
    (filter? [_] true)
    clojure.lang.Var
    (filter? [v] (fn? @v))
    clojure.lang.MultiFn
    (filter? [_] true))

    View Slide

  56. ;; default filters:
    (extend-protocol ICouldFilter
    Object
    (filter? [_] false)
    clojure.lang.Fn
    (filter? [_] true)
    clojure.lang.Var
    (filter? [v] (fn? @v))
    clojure.lang.MultiFn
    (filter? [_] true))
    ;; Cascading extension:
    (extend-protocol p/ICouldFilter
    cascading.operation.Filter
    (filter? [_] true))

    View Slide

  57. POSSIBLE PLATFORMS

    View Slide

  58. POSSIBLE PLATFORMS
    • “Cascalog in the Small”: Native Clojure

    View Slide

  59. POSSIBLE PLATFORMS
    • “Cascalog in the Small”: Native Clojure
    • ClojureScript?

    View Slide

  60. POSSIBLE PLATFORMS
    • “Cascalog in the Small”: Native Clojure
    • ClojureScript?
    • core.async?

    View Slide

  61. POSSIBLE PLATFORMS
    • “Cascalog in the Small”: Native Clojure
    • ClojureScript?
    • core.async?
    • Storm

    View Slide

  62. TAKEAWAYS

    View Slide

  63. TAKEAWAYS
    • Let the system reduce complexity for you.

    View Slide

  64. TAKEAWAYS
    • Let the system reduce complexity for you.
    • Use the properties of your data

    View Slide

  65. TAKEAWAYS
    • Let the system reduce complexity for you.
    • Use the properties of your data
    • Share data by sharing code!

    View Slide

  66. Sam Ritchie :: @sritchie :: Clojure/Conj 2013
    Questions?

    View Slide