Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Notification d'offres d'emploi avec elasticsearch et la percolation

Viadeo
February 03, 2016

Notification d'offres d'emploi avec elasticsearch et la percolation

Venez découvrir comment nous avons choisi d'utiliser Spark et la percolation elasticsearch pour construire un système d'alerte d'offre d'emploi basé sur les critères de nos utilisateurs.

par Nicolas Colomer et Grégoire Nicolle, Full stack engineers @Viadeo

Viadeo

February 03, 2016
Tweet

More Decks by Viadeo

Other Decks in Technology

Transcript

  1. Elastic Paris Meetup #18
    Wed, 3rd February 2016 @ Viadeo
    Job offer notification using
    elasticsearch and percolation

    View full-size slide

  2. Who are we?
    Grégoire Nicolle
    Software Engineer
    Viadeo
    Github: greg-nicolle
    Nicolas Colomer
    Senior Software Engineer
    Viadeo
    Twitter: @n_colomer
    Github: ncolomer

    View full-size slide

  3. The job alert feature
    BEFORE AFTER

    View full-size slide

  4. - 200k live job offers (20k for France)
    - 16k new job offers per day (2k for France)
    - 550k+ alerts registered (1.4 avg per user)
    - 10k/month alerts created
    - Alert frequency distribution:
    daily 480k
    weekly 77k
    monthly 1k
    Some figures

    View full-size slide

  5. Legacy job offer/alert implementation
    legacy
    daemon
    legacy SolR
    index
    legacy webapp
    legacy joboffer
    integration pipeline
    legacy emails
    legacy
    job alerts
    legacy job offers

    View full-size slide

  6. The new job offer implementation
    MQ
    event
    sourcing
    API Platform
    Index
    Batch
    full reindexation
    Streaming
    incremental
    indexation
    job offers
    Sqoop

    View full-size slide

  7. - answer question: “which alerts this job offer match?”
    - process the stream of new job offers
    - percolation fits by design!
    - results sets are precalculated
    - smooth load
    Percolation = good feeling
    queries
    index
    document matching
    queries
    ids

    View full-size slide

  8. Does it perf?
    - 1x EC2 c3.4xlarge instance
    16 vCPU, 30Go mem
    - 550k percolator queries
    extracted from prod data
    - no multi percolate
    - iterate over prod job offers
    - measure the “took” time
    GO!
    Goal: validate the
    streaming use case

    View full-size slide

  9. Batch
    Modelisation
    Event Bus
    created
    job offer
    events
    percolation
    job offer
    event flow

    View full-size slide

  10. Job offer match enqueuing
    job offer
    MQ
    event
    sourcing
    Index
    hydrating
    job alert
    Percolate
    index

    View full-size slide

  11. Job offer match dequeuing
    Job Alert
    Dequeue hydrating
    Scoop

    View full-size slide

  12. Optimizations
    - percolate an existing document (but not available in 1.1.x)
    www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html#_percolating_an_existing_document
    - don’t have to re-build the document
    - less network bandwith
    - use a dedicated percolation index
    www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html#_dedicated_percolator_index
    - different index lifecycle
    - optimized index settings (eg. “all” replication => distributed workload)
    - needs original index mapping!
    - use multi percolate API (aka. bulk processing)
    www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html#_multi_percolate_api
    - less network roundtrips
    - optimized computing cluster side
    - retrieve only matched alert ids

    View full-size slide

  13. Bugs & caveats
    - Don’t use “now” in range query your percolate query (fixed in 1.7.2+)
    - the timestamp is stored at index time
    - thus won’t be dynamic at query time
    - No score at percolation time!
    - forced us to requery percolated docs to get the top-n best job offers
    - would be nice to have… opened issue!

    View full-size slide

  14. Why this doesn’t fit
    - warmup of the system with existing job offers is painful
    - acceptable for new job offer prod stream, not batch
    - warmup stresses the backends a lot!
    - 3k w/s on ElastiCache (redis)
    - 1:30m to percolate 7k job offers
    200k set/min
    warmup nominal
    Redis throughput
    - alert queries does not filter enough
    - data were not validated
    - some of them have no criteria
    - should at least have a country and a sector
    - the system produce too much data
    - 1k avg job offers per alert after warmup
    - some alerts got 1k+ more match a day
    - we only need 10 representative job offer max per mail!

    View full-size slide

  15. Multi
    Search
    Refactoring
    snapshot Hydrating
    EC2
    ES
    ES
    ES
    pop EC2
    ES

    View full-size slide

  16. For ~200k alert batch
    - EC2 instances cost: $1.6 (2x r3.4xlarge 16CPU, 120Gmem)
    - Total duration: 17m
    - snapshot: 2m
    - initialize EC2 instances: 4m
    - index restoration: 20s
    - spark batch computation: 10m
    - dump result as AVRO file: 2s
    - notifying platform: 22s
    Does it work?

    View full-size slide

  17. First runs feedback!

    View full-size slide

  18. Take away
    - Don’t over-design things
    - Do representative benchmarks
    - Keep the KISS principle in mind
    - ES rocks for searching things massively!

    View full-size slide

  19. Elastic Paris Meetup #18
    Wed, 3rd February 2016 @ Viadeo
    Questions?

    View full-size slide