Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NYC Clojure Meetup Apr '14

Avatar for Jon Sondag Jon Sondag
April 14, 2014
1.7k

NYC Clojure Meetup Apr '14

Cascalog at Intent Media

Avatar for Jon Sondag

Jon Sondag

April 14, 2014
Tweet

Transcript

  1. Overview • About • Why we use Cascalog • Query

    execution, Example queries • How we work with Cascalog • Alternatives Tuesday, April 15, 14
  2. About Intent Media • Ad tech • 70 people, 35

    engineers, 8 data team • Site optimization for e-commerce Tuesday, April 15, 14
  3. Auction Advertiser Bid Est. Click Through Rate Est. Revenue Airline1

    $10 3% $0.30 Airline2 $5 5% $0.25 Airline3 $8 10% $0.80 Airline4 $2 30% $0.60 Tuesday, April 15, 14
  4. High Level Pipeline e.g. 1 0.5 0.8 0.1 0.2 0.3

    0.4 -1 0.2 -0.5 2.5 -0.2 -0.3 -0.4 1 -0.2 0.5 0.5 -0.1 0.7 0.5 ... Tuesday, April 15, 14
  5. Operations Pipes Taps FileTap Hfs GlobHfs MultiSourceTap MultiSinkTap Schemes TextLine

    TextDelimited SequenceFile WritableSequenceFile Tuesday, April 15, 14
  6. Query Execution • Pre-Aggregation: Join generators, apply operators where possible

    • Aggregation: Fix outputs that we have already, aggregate the rest • Post-Aggregation: Apply remaining operators Tuesday, April 15, 14
  7. Query Execution Pre-Aggregation (Generators) [["impression-1" "buy this product"] ["impression-2" "great

    deal"] ["impression-3" "cheap sale"] ["impression-4" "cheap sale"]] [["click-1" "impression-3" 0] ["click-2" "impression-2" 0] ["click-3" "impression-2" 100]] Tuesday, April 15, 14
  8. Query Execution Pre-Aggregation (Join) [["impression-1 "buy this product" nil nil]

    ["impression-2" "great deal" "click-2" 0] ["impression-2" "great deal" "click-3" 100] ["impression-3" "cheap sale" "click-1" 0] ["impression-4" "cheap sale" nil nil]] Tuesday, April 15, 14
  9. Query Execution Pre-Aggregation (Operations) [["impression-1 ["buy" "this" "product"] nil nil]

    ["impression-2" ["great" "deal"] "click-2" 0] ["impression-2" ["great" "deal"] "click-3" 100] ["impression-3" ["cheap" "sale"] "click-1" 0] ["impression-4" ["cheap" "sale"] nil nil]] Tuesday, April 15, 14
  10. Query Execution Aggregation [["impression-1 ["buy" "this" "product"] 0] ["impression-2" ["great"

    "deal"] 100] ["impression-3" ["cheap" "sale"] 0] ["impression-4" ["cheap" "sale"] 0]] Tuesday, April 15, 14
  11. Query Execution Post-Aggregation (Operations) [["impression-1 ["buy" "this" "product"] -1] ["impression-2"

    ["great" "deal"] 1] ["impression-3" ["cheap" "sale"] -1] ["impression-4" ["cheap" "sale"] -1]] Tuesday, April 15, 14
  12. Built-In Filter Ops • first-n • limit • limit-rank •

    fixed-sample • fixed-sample-agg • re-parse Tuesday, April 15, 14
  13. Built-In Agg Ops • avg • count/!count • distinct-count •

    max • min • sum Tuesday, April 15, 14
  14. Built-In Higher Order Functions • all • any • comp

    • each • negate • partial Tuesday, April 15, 14
  15. Custom Function Types • mapfn • mapcatfn • filterfn •

    bufferfn • aggregatefn Tuesday, April 15, 14
  16. defparallelagg Move execution from the reducer to the combiners 1

    2 3 5 8 11 1 2 3 5 8 11 Map Reduce (+ 1 2 3 5 8 11) Map Reduce (+ 6 24) Tuesday, April 15, 14
  17. midje-cascalog • Macros for testing cascalog queries • Use produces

    function to check outputs Tuesday, April 15, 14
  18. Workflow • On a sampled dataset: • Unit test the

    components • End-to-end test the workflow • Then, test on the cluster Tuesday, April 15, 14