Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
NYC Clojure Meetup Apr '14
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Jon Sondag
April 14, 2014
1
1.7k
NYC Clojure Meetup Apr '14
Cascalog at Intent Media
Jon Sondag
April 14, 2014
Tweet
Share
Featured
See All Featured
Future Trends and Review - Lecture 12 - Web Technologies (1019888BNR)
signer
PRO
0
3.3k
Building a Modern Day E-commerce SEO Strategy
aleyda
45
9k
Code Review Best Practice
trishagee
74
20k
How Software Deployment tools have changed in the past 20 years
geshan
0
33k
Heart Work Chapter 1 - Part 1
lfama
PRO
5
35k
Fashionably flexible responsive web design (full day workshop)
malarkey
408
66k
Noah Learner - AI + Me: how we built a GSC Bulk Export data pipeline
techseoconnect
PRO
0
150
End of SEO as We Know It (SMX Advanced Version)
ipullrank
3
4.1k
A Guide to Academic Writing Using Generative AI - A Workshop
ks91
PRO
0
240
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
12
1.1k
State of Search Keynote: SEO is Dead Long Live SEO
ryanjones
0
160
Test your architecture with Archunit
thirion
1
2.2k
Transcript
Cascalog at Intent Media Tuesday, April 15, 14
Overview • About • Why we use Cascalog • Query
execution, Example queries • How we work with Cascalog • Alternatives Tuesday, April 15, 14
About Intent Media • Ad tech • 70 people, 35
engineers, 8 data team • Site optimization for e-commerce Tuesday, April 15, 14
Auction Advertiser Bid Est. Click Through Rate Est. Revenue Airline1
$10 3% $0.30 Airline2 $5 5% $0.25 Airline3 $8 10% $0.80 Airline4 $2 30% $0.60 Tuesday, April 15, 14
High Level Pipeline e.g. 1 0.5 0.8 0.1 0.2 0.3
0.4 -1 0.2 -0.5 2.5 -0.2 -0.3 -0.4 1 -0.2 0.5 0.5 -0.1 0.7 0.5 ... Tuesday, April 15, 14
IM & Hadoop • Hadoop • Pig • Cascalog Tuesday,
April 15, 14
Pig: TupleVariance UDF Tuesday, April 15, 14
Cascalog Tuesday, April 15, 14
Operations Pipes Taps FileTap Hfs GlobHfs MultiSourceTap MultiSinkTap Schemes TextLine
TextDelimited SequenceFile WritableSequenceFile Tuesday, April 15, 14
Word Count Tuesday, April 15, 14
Word Count, Cascalog Tuesday, April 15, 14
Word Count, Cascalog Generator Operation Aggregation Output Tap Tuesday, April
15, 14
Word Count, Cascalog Generator Operation Aggregation Output Tap Cascalog Predicates
Tuesday, April 15, 14
Query Execution • Pre-Aggregation: Join generators, apply operators where possible
• Aggregation: Fix outputs that we have already, aggregate the rest • Post-Aggregation: Apply remaining operators Tuesday, April 15, 14
Query Execution Tuesday, April 15, 14
Query Execution Pre-Aggregation (Generators) [["impression-1" "buy this product"] ["impression-2" "great
deal"] ["impression-3" "cheap sale"] ["impression-4" "cheap sale"]] [["click-1" "impression-3" 0] ["click-2" "impression-2" 0] ["click-3" "impression-2" 100]] Tuesday, April 15, 14
Query Execution Pre-Aggregation (Join) [["impression-1 "buy this product" nil nil]
["impression-2" "great deal" "click-2" 0] ["impression-2" "great deal" "click-3" 100] ["impression-3" "cheap sale" "click-1" 0] ["impression-4" "cheap sale" nil nil]] Tuesday, April 15, 14
Query Execution Pre-Aggregation (Operations) [["impression-1 ["buy" "this" "product"] nil nil]
["impression-2" ["great" "deal"] "click-2" 0] ["impression-2" ["great" "deal"] "click-3" 100] ["impression-3" ["cheap" "sale"] "click-1" 0] ["impression-4" ["cheap" "sale"] nil nil]] Tuesday, April 15, 14
Query Execution Aggregation [["impression-1 ["buy" "this" "product"] 0] ["impression-2" ["great"
"deal"] 100] ["impression-3" ["cheap" "sale"] 0] ["impression-4" ["cheap" "sale"] 0]] Tuesday, April 15, 14
Query Execution Post-Aggregation (Operations) [["impression-1 ["buy" "this" "product"] -1] ["impression-2"
["great" "deal"] 1] ["impression-3" ["cheap" "sale"] -1] ["impression-4" ["cheap" "sale"] -1]] Tuesday, April 15, 14
Built-In Filter Ops • first-n • limit • limit-rank •
fixed-sample • fixed-sample-agg • re-parse Tuesday, April 15, 14
Built-In Agg Ops • avg • count/!count • distinct-count •
max • min • sum Tuesday, April 15, 14
Built-In Higher Order Functions • all • any • comp
• each • negate • partial Tuesday, April 15, 14
Custom Function Types • mapfn • mapcatfn • filterfn •
bufferfn • aggregatefn Tuesday, April 15, 14
defparallelagg Move execution from the reducer to the combiners 1
2 3 5 8 11 1 2 3 5 8 11 Map Reduce (+ 1 2 3 5 8 11) Map Reduce (+ 6 24) Tuesday, April 15, 14
midje-cascalog • Macros for testing cascalog queries • Use produces
function to check outputs Tuesday, April 15, 14
Examples • Feature variance • Linear regression Tuesday, April 15,
14
Workflow • On a sampled dataset: • Unit test the
components • End-to-end test the workflow • Then, test on the cluster Tuesday, April 15, 14
Checkpoint Tuesday, April 15, 14
Checkpoint Tuesday, April 15, 14
Checkpoint Tuesday, April 15, 14
Other Hadoop Projects • Pigpen • Parkour Tuesday, April 15,
14
Other execution backends • Storm • Spark Tuesday, April 15,
14
Thanks Tuesday, April 15, 14