Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Notification d'offres d'emploi avec elasticsearch et la percolation

February 03, 2016

Notification d'offres d'emploi avec elasticsearch et la percolation

Venez découvrir comment nous avons choisi d'utiliser Spark et la percolation elasticsearch pour construire un système d'alerte d'offre d'emploi basé sur les critères de nos utilisateurs.

par Nicolas Colomer et Grégoire Nicolle, Full stack engineers @Viadeo


February 03, 2016

More Decks by Viadeo

Other Decks in Technology


  1. Elastic Paris Meetup #18 Wed, 3rd February 2016 @ Viadeo

    Job offer notification using elasticsearch and percolation
  2. Who are we? Grégoire Nicolle Software Engineer Viadeo Github: greg-nicolle

    Nicolas Colomer Senior Software Engineer Viadeo Twitter: @n_colomer Github: ncolomer
  3. - 200k live job offers (20k for France) - 16k

    new job offers per day (2k for France) - 550k+ alerts registered (1.4 avg per user) - 10k/month alerts created - Alert frequency distribution: daily 480k weekly 77k monthly 1k Some figures
  4. Legacy job offer/alert implementation legacy daemon legacy SolR index legacy

    webapp legacy joboffer integration pipeline legacy emails legacy job alerts legacy job offers
  5. The new job offer implementation MQ event sourcing API Platform

    Index Batch full reindexation Streaming incremental indexation job offers Sqoop
  6. - answer question: “which alerts this job offer match?” -

    process the stream of new job offers - percolation fits by design! - results sets are precalculated - smooth load Percolation = good feeling queries index document matching queries ids
  7. Does it perf? - 1x EC2 c3.4xlarge instance 16 vCPU,

    30Go mem - 550k percolator queries extracted from prod data - no multi percolate - iterate over prod job offers - measure the “took” time GO! Goal: validate the streaming use case
  8. Optimizations - percolate an existing document (but not available in

    1.1.x) www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html#_percolating_an_existing_document - don’t have to re-build the document - less network bandwith - use a dedicated percolation index www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html#_dedicated_percolator_index - different index lifecycle - optimized index settings (eg. “all” replication => distributed workload) - needs original index mapping! - use multi percolate API (aka. bulk processing) www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html#_multi_percolate_api - less network roundtrips - optimized computing cluster side - retrieve only matched alert ids
  9. Bugs & caveats - Don’t use “now” in range query

    your percolate query (fixed in 1.7.2+) - the timestamp is stored at index time - thus won’t be dynamic at query time - No score at percolation time! - forced us to requery percolated docs to get the top-n best job offers - would be nice to have… opened issue!
  10. Why this doesn’t fit - warmup of the system with

    existing job offers is painful - acceptable for new job offer prod stream, not batch - warmup stresses the backends a lot! - 3k w/s on ElastiCache (redis) - 1:30m to percolate 7k job offers 200k set/min warmup nominal Redis throughput - alert queries does not filter enough - data were not validated - some of them have no criteria - should at least have a country and a sector - the system produce too much data - 1k avg job offers per alert after warmup - some alerts got 1k+ more match a day - we only need 10 representative job offer max per mail!
  11. For ~200k alert batch - EC2 instances cost: $1.6 (2x

    r3.4xlarge 16CPU, 120Gmem) - Total duration: 17m - snapshot: 2m - initialize EC2 instances: 4m - index restoration: 20s - spark batch computation: 10m - dump result as AVRO file: 2s - notifying platform: 22s Does it work?
  12. Take away - Don’t over-design things - Do representative benchmarks

    - Keep the KISS principle in mind - ES rocks for searching things massively!