Notification d'offres d'emploi avec elasticsearch et la percolation

Elastic Paris Meetup #18 Wed, 3rd February 2016 @ Viadeo
Job offer notification using elasticsearch and percolation

Who are we? Grégoire Nicolle Software Engineer Viadeo Github: greg-nicolle
Nicolas Colomer Senior Software Engineer Viadeo Twitter: @n_colomer Github: ncolomer

The job alert feature BEFORE AFTER

- 200k live job offers (20k for France) - 16k
new job offers per day (2k for France) - 550k+ alerts registered (1.4 avg per user) - 10k/month alerts created - Alert frequency distribution: daily 480k weekly 77k monthly 1k Some figures

Legacy job offer/alert implementation legacy daemon legacy SolR index legacy
webapp legacy joboffer integration pipeline legacy emails legacy job alerts legacy job offers

The new job offer implementation MQ event sourcing API Platform
Index Batch full reindexation Streaming incremental indexation job offers Sqoop

- answer question: “which alerts this job offer match?” -
process the stream of new job offers - percolation fits by design! - results sets are precalculated - smooth load Percolation = good feeling queries index document matching queries ids

Does it perf? - 1x EC2 c3.4xlarge instance 16 vCPU,
30Go mem - 550k percolator queries extracted from prod data - no multi percolate - iterate over prod job offers - measure the “took” time GO! Goal: validate the streaming use case

Batch Modelisation Event Bus created job offer events percolation job
offer event flow

Job offer match enqueuing job offer MQ event sourcing Index
hydrating job alert Percolate index

Job offer match dequeuing Job Alert Dequeue hydrating Scoop

Optimizations - percolate an existing document (but not available in
1.1.x) www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html#_percolating_an_existing_document - don’t have to re-build the document - less network bandwith - use a dedicated percolation index www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html#_dedicated_percolator_index - different index lifecycle - optimized index settings (eg. “all” replication => distributed workload) - needs original index mapping! - use multi percolate API (aka. bulk processing) www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html#_multi_percolate_api - less network roundtrips - optimized computing cluster side - retrieve only matched alert ids

Bugs & caveats - Don’t use “now” in range query
your percolate query (fixed in 1.7.2+) - the timestamp is stored at index time - thus won’t be dynamic at query time - No score at percolation time! - forced us to requery percolated docs to get the top-n best job offers - would be nice to have… opened issue!

Why this doesn’t fit - warmup of the system with
existing job offers is painful - acceptable for new job offer prod stream, not batch - warmup stresses the backends a lot! - 3k w/s on ElastiCache (redis) - 1:30m to percolate 7k job offers 200k set/min warmup nominal Redis throughput - alert queries does not filter enough - data were not validated - some of them have no criteria - should at least have a country and a sector - the system produce too much data - 1k avg job offers per alert after warmup - some alerts got 1k+ more match a day - we only need 10 representative job offer max per mail!

Multi Search Refactoring snapshot Hydrating EC2 ES ES ES pop
EC2 ES

For ~200k alert batch - EC2 instances cost: $1.6 (2x
r3.4xlarge 16CPU, 120Gmem) - Total duration: 17m - snapshot: 2m - initialize EC2 instances: 4m - index restoration: 20s - spark batch computation: 10m - dump result as AVRO file: 2s - notifying platform: 22s Does it work?

First runs feedback!

Take away - Don’t over-design things - Do representative benchmarks
- Keep the KISS principle in mind - ES rocks for searching things massively!

Elastic Paris Meetup #18 Wed, 3rd February 2016 @ Viadeo
Questions?

Notification d'offres d'emploi avec elasticsear...

Notification d'offres d'emploi avec elasticsearch et la percolation

Viadeo

More Decks by Viadeo

Other Decks in Technology

Featured

Transcript

Elastic Paris Meetup #18 Wed, 3rd February 2016 @ Viadeo

Who are we? Grégoire Nicolle Software Engineer Viadeo Github: greg-nicolle

The job alert feature BEFORE AFTER

- 200k live job offers (20k for France) - 16k

Legacy job offer/alert implementation legacy daemon legacy SolR index legacy

The new job offer implementation MQ event sourcing API Platform

- answer question: “which alerts this job offer match?” -

Does it perf? - 1x EC2 c3.4xlarge instance 16 vCPU,

Batch Modelisation Event Bus created job offer events percolation job

Job offer match enqueuing job offer MQ event sourcing Index

Job offer match dequeuing Job Alert Dequeue hydrating Scoop

Optimizations - percolate an existing document (but not available in

Bugs & caveats - Don’t use “now” in range query

Why this doesn’t fit - warmup of the system with

Multi Search Refactoring snapshot Hydrating EC2 ES ES ES pop

For ~200k alert batch - EC2 instances cost: $1.6 (2x

First runs feedback!

Take away - Don’t over-design things - Do representative benchmarks

Elastic Paris Meetup #18 Wed, 3rd February 2016 @ Viadeo