Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NYC Clojure Meetup Apr '14

Jon Sondag
April 14, 2014
1.7k

NYC Clojure Meetup Apr '14

Cascalog at Intent Media

Jon Sondag

April 14, 2014
Tweet

Transcript

  1. Overview • About • Why we use Cascalog • Query

    execution, Example queries • How we work with Cascalog • Alternatives Tuesday, April 15, 14
  2. About Intent Media • Ad tech • 70 people, 35

    engineers, 8 data team • Site optimization for e-commerce Tuesday, April 15, 14
  3. Auction Advertiser Bid Est. Click Through Rate Est. Revenue Airline1

    $10 3% $0.30 Airline2 $5 5% $0.25 Airline3 $8 10% $0.80 Airline4 $2 30% $0.60 Tuesday, April 15, 14
  4. High Level Pipeline e.g. 1 0.5 0.8 0.1 0.2 0.3

    0.4 -1 0.2 -0.5 2.5 -0.2 -0.3 -0.4 1 -0.2 0.5 0.5 -0.1 0.7 0.5 ... Tuesday, April 15, 14
  5. Operations Pipes Taps FileTap Hfs GlobHfs MultiSourceTap MultiSinkTap Schemes TextLine

    TextDelimited SequenceFile WritableSequenceFile Tuesday, April 15, 14
  6. Query Execution • Pre-Aggregation: Join generators, apply operators where possible

    • Aggregation: Fix outputs that we have already, aggregate the rest • Post-Aggregation: Apply remaining operators Tuesday, April 15, 14
  7. Query Execution Pre-Aggregation (Generators) [["impression-1" "buy this product"] ["impression-2" "great

    deal"] ["impression-3" "cheap sale"] ["impression-4" "cheap sale"]] [["click-1" "impression-3" 0] ["click-2" "impression-2" 0] ["click-3" "impression-2" 100]] Tuesday, April 15, 14
  8. Query Execution Pre-Aggregation (Join) [["impression-1 "buy this product" nil nil]

    ["impression-2" "great deal" "click-2" 0] ["impression-2" "great deal" "click-3" 100] ["impression-3" "cheap sale" "click-1" 0] ["impression-4" "cheap sale" nil nil]] Tuesday, April 15, 14
  9. Query Execution Pre-Aggregation (Operations) [["impression-1 ["buy" "this" "product"] nil nil]

    ["impression-2" ["great" "deal"] "click-2" 0] ["impression-2" ["great" "deal"] "click-3" 100] ["impression-3" ["cheap" "sale"] "click-1" 0] ["impression-4" ["cheap" "sale"] nil nil]] Tuesday, April 15, 14
  10. Query Execution Aggregation [["impression-1 ["buy" "this" "product"] 0] ["impression-2" ["great"

    "deal"] 100] ["impression-3" ["cheap" "sale"] 0] ["impression-4" ["cheap" "sale"] 0]] Tuesday, April 15, 14
  11. Query Execution Post-Aggregation (Operations) [["impression-1 ["buy" "this" "product"] -1] ["impression-2"

    ["great" "deal"] 1] ["impression-3" ["cheap" "sale"] -1] ["impression-4" ["cheap" "sale"] -1]] Tuesday, April 15, 14
  12. Built-In Filter Ops • first-n • limit • limit-rank •

    fixed-sample • fixed-sample-agg • re-parse Tuesday, April 15, 14
  13. Built-In Agg Ops • avg • count/!count • distinct-count •

    max • min • sum Tuesday, April 15, 14
  14. Built-In Higher Order Functions • all • any • comp

    • each • negate • partial Tuesday, April 15, 14
  15. Custom Function Types • mapfn • mapcatfn • filterfn •

    bufferfn • aggregatefn Tuesday, April 15, 14
  16. defparallelagg Move execution from the reducer to the combiners 1

    2 3 5 8 11 1 2 3 5 8 11 Map Reduce (+ 1 2 3 5 8 11) Map Reduce (+ 6 24) Tuesday, April 15, 14
  17. midje-cascalog • Macros for testing cascalog queries • Use produces

    function to check outputs Tuesday, April 15, 14
  18. Workflow • On a sampled dataset: • Unit test the

    components • End-to-end test the workflow • Then, test on the cluster Tuesday, April 15, 14