$30 off During Our Annual Pro Sale. View Details »

Notification d'offres d'emploi avec elasticsearch et la percolation

Viadeo
February 03, 2016

Notification d'offres d'emploi avec elasticsearch et la percolation

Venez découvrir comment nous avons choisi d'utiliser Spark et la percolation elasticsearch pour construire un système d'alerte d'offre d'emploi basé sur les critères de nos utilisateurs.

par Nicolas Colomer et Grégoire Nicolle, Full stack engineers @Viadeo

Viadeo

February 03, 2016
Tweet

More Decks by Viadeo

Other Decks in Technology

Transcript

  1. Elastic Paris Meetup #18 Wed, 3rd February 2016 @ Viadeo

    Job offer notification using elasticsearch and percolation
  2. Who are we? Grégoire Nicolle Software Engineer Viadeo Github: greg-nicolle

    Nicolas Colomer Senior Software Engineer Viadeo Twitter: @n_colomer Github: ncolomer
  3. The job alert feature BEFORE AFTER

  4. - 200k live job offers (20k for France) - 16k

    new job offers per day (2k for France) - 550k+ alerts registered (1.4 avg per user) - 10k/month alerts created - Alert frequency distribution: daily 480k weekly 77k monthly 1k Some figures
  5. Legacy job offer/alert implementation legacy daemon legacy SolR index legacy

    webapp legacy joboffer integration pipeline legacy emails legacy job alerts legacy job offers
  6. The new job offer implementation MQ event sourcing API Platform

    Index Batch full reindexation Streaming incremental indexation job offers Sqoop
  7. - answer question: “which alerts this job offer match?” -

    process the stream of new job offers - percolation fits by design! - results sets are precalculated - smooth load Percolation = good feeling queries index document matching queries ids
  8. Does it perf? - 1x EC2 c3.4xlarge instance 16 vCPU,

    30Go mem - 550k percolator queries extracted from prod data - no multi percolate - iterate over prod job offers - measure the “took” time GO! Goal: validate the streaming use case
  9. Batch Modelisation Event Bus created job offer events percolation job

    offer event flow
  10. Job offer match enqueuing job offer MQ event sourcing Index

    hydrating job alert Percolate index
  11. Job offer match dequeuing Job Alert Dequeue hydrating Scoop

  12. Optimizations - percolate an existing document (but not available in

    1.1.x) www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html#_percolating_an_existing_document - don’t have to re-build the document - less network bandwith - use a dedicated percolation index www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html#_dedicated_percolator_index - different index lifecycle - optimized index settings (eg. “all” replication => distributed workload) - needs original index mapping! - use multi percolate API (aka. bulk processing) www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html#_multi_percolate_api - less network roundtrips - optimized computing cluster side - retrieve only matched alert ids
  13. Bugs & caveats - Don’t use “now” in range query

    your percolate query (fixed in 1.7.2+) - the timestamp is stored at index time - thus won’t be dynamic at query time - No score at percolation time! - forced us to requery percolated docs to get the top-n best job offers - would be nice to have… opened issue!
  14. Why this doesn’t fit - warmup of the system with

    existing job offers is painful - acceptable for new job offer prod stream, not batch - warmup stresses the backends a lot! - 3k w/s on ElastiCache (redis) - 1:30m to percolate 7k job offers 200k set/min warmup nominal Redis throughput - alert queries does not filter enough - data were not validated - some of them have no criteria - should at least have a country and a sector - the system produce too much data - 1k avg job offers per alert after warmup - some alerts got 1k+ more match a day - we only need 10 representative job offer max per mail!
  15. Multi Search Refactoring snapshot Hydrating EC2 ES ES ES pop

    EC2 ES
  16. For ~200k alert batch - EC2 instances cost: $1.6 (2x

    r3.4xlarge 16CPU, 120Gmem) - Total duration: 17m - snapshot: 2m - initialize EC2 instances: 4m - index restoration: 20s - spark batch computation: 10m - dump result as AVRO file: 2s - notifying platform: 22s Does it work?
  17. First runs feedback!

  18. Take away - Don’t over-design things - Do representative benchmarks

    - Keep the KISS principle in mind - ES rocks for searching things massively!
  19. Elastic Paris Meetup #18 Wed, 3rd February 2016 @ Viadeo

    Questions?