Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[EN] Fast and relevant jobs with Elasticsearch

Jobrapido
February 05, 2016

[EN] Fast and relevant jobs with Elasticsearch

Analytics and DevOps meetup organized by Kiratech -
Milan, 5 February 2016 - S. Vadacca

Jobrapido

February 05, 2016
Tweet

More Decks by Jobrapido

Other Decks in Technology

Transcript

  1. 1   Salvatore Vadacca – Head of Product Development @

    Jobrapido Analytics and Devops Meetup – Milano - February 5, 2016 FAST AND RELEVANT JOBS WITH ELASTICSEARCH
  2. COMPANY WEBSITE www.jobrapido.com ABOUT ME ROLE Head of Product Development

    @ Jobrapido EMAIL [email protected] TWITTER @totovadacca LINKEDIN https://it.linkedin.com/in/salvatorevadacca NAME Salvatore Vadacca
  3. WEBSITES IN 58 COUNTRIES Head office Milan + office in

    Amsterdam Jobrapido is the world's leading jobsearch engine that analyses and collects all job posts on the web, giving jobseekers all offers available, ordered for relevance based on the search they’ve done Analysis Aggregation Response * Clicks on job listings (organic + sponsored) and clicks on contextual ads WHO WE ARE UNIQUE VISITORS 35 Mio Uvs / month SUBSCRIBERS 60+ Mio subs users (current stock) PAGEVIEWS / CLICKS* 280 Mio PVs / month & 130 Mio clicks / month JOBS 20+ Mio jobs at any given time VISITORS 1.0 BN visits / year PEOPLE 100+
  4. DOCUMENTS >200 M OUR ELASTIC NUMBERS (MAIN CLUSTER) DATA 1TB

    AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, 3 masters, 65 client CURRENT VERSION 1.7.3
  5. MOBILE APP MY  SEARCHES   MY  JOBS   MENU  

    CNT  SELECTION   SIGN  UP   SIGN  IN  
  6. SEARCH AND TECH PEOPLE @ JOBRAPIDO STEFANO  MASSERA   MATTIA

     PICCINETTI   ARNALDO  DE  MAIO   FABIO  ARCARI   STEFANO  FRIGERIO   GIANLUCA  TOMASINO   MAGDA  SWIERCZEK   MICHELE  PINTO   VALENTINO  MIAZZO   FABIO  RANFI   LORENZO  SANTI   STEFANO  ZANIN   VALENTINA  MISTRANGELO   RAFFAELE  SCOZZAFAVA   ANDREA  CHIAROT   SHARATH  PERAVALI   ANDREA  VAGHI   GIUSEPPE  LA  TONA   ARTURO  GATTO   LUIGI  SAGGESE   MARCELLO  GRECO   MAURIZIO  BATTAGHINI   ALBERTO  BASOLI   MARCO  SIVIERO   MICHELE  PATERNITI   PAOLO  ZITELLI   MARCO  LOCATELLI   ALINA  CHELMUS   FABIO  PIZZATO   GABRIELE  TONINELLI   ANTONELLA  CIPRESSO   FRANCESCO  CARANTE  
  7. A BIT OF HISTORY 2006 2007 2008 2009 2010 2011

    2012 2013 2014 2015 Apr 2012 - Jobrapido joins DMGT Mar 2014 - Jobrapido joins STG May 2014 – 1st Elasticsearch spike live on ao.jobrapido.com Jul 2014 – All Jobrapido websites moved to Elasticsearch Jul 2013 – Jobrapido websites migrated to Java Mar 2015 – Full-text relevance on all our websites Dec 2014 – Locations moved to Elasticsearch May 2015 – Dynamic sitemaps live Jun 2006 – Jobrapido founded Jan 2011 – Jobrapido counts 50 employees Dec 2009 – Jobrapido goes US and AU Jun 2011 – New Jobrapido HQ in Milan Dec 2007 – Jobrapido in Latin America (AR, CL and MX) Nov 2008 –Jobrapido goes France Dec 2006 – Jobrapido covers IT, UK, DE, AT, CH and ES Jun 2010 – Jobrapido goes Asia and Africa Sep 2010 – Jobrapido serves 50 countries Jul 2008 – Jobrapido reaches 3M unique visitors Aug 2009 – Jobrapido reaches 10M unique visitors Dec 2013 – Search&Match team established Dec 2015 – Jobrapido Jobsearch API Oct 2015 – Jobrapido counts 100 employees
  8. THE NEED FOR A NEW SEARCH ENGINE •  Result sorting

    limited to CPC and publish date •  Debug and troubleshooting nearly impossible •  Exact match was the only option •  Slow reindex time (up to 10 days) •  Custom and inaccurate language analysis •  No high availability •  Hard to scale 10  
  9. A TWO-STEP MIGRATION •  Step 1: Elasticsearch as a key-value

    store •  Performance •  Scalability •  High availability •  Faster reindex •  Step 2: full-text search •  Relevance •  Built-in multi-language support •  Configurable sort options 11  
  10. 13   ITALIAN SPANISH FRENCH ENGLISH POLISH HUNGARIAN JAPANESE GERMAN

    ROMANIAN PORTUGUESE SWEDISH RUSSIAN DANISH DUTCH CZECH TURKISH KOREAN CHINESE
  11. JOB INDICES •  One index per country •  Two aliases

    with filter (organic vs. sponsored jobs) •  Each index implements country and language-specific analysers •  A country may support more than one language •  e.g., Canada, Switzerland, etc. 14   IT CH FR UK DE US IT.JOBRAPIDO.COM CH.JOBRAPIDO.COM DE.JOBRAPIDO.COM UK.JOBRAPIDO.COM FR.JOBRAPIDO.COM US.JOBRAPIDO.COM
  12. THE ANATOMY OF AN ANALYZER •  Strip HTML •  Tokenization

    •  Lowercase •  Stopwords 1.  _german_, _french, _english_, … (built-in) 2.  Language-specific (file) 3.  Country-specific (file) •  Stemming 1.  light_german, light_french, english, … (built-in) 2.  Language-specific exceptions (file) 3.  Country-specific exceptions (file) •  Language-specific filters (e.g., elision, possessive) •  Synonyms •  Shingles 15  
  13. MULTI-LANGUAGE PROPERTIES 16   BODY GERMAN STANDARD ENGLISH FRENCH ITALIAN

    SHINGLE HTML STRIP CHAR FILTER STANDARD LOWERCASE, GERMAN STOPWORDS, GERMAN STEMMER TOKENIZER FILTER HTML STRIP STANDARD STANDARD, LOWERCASE, GERMAN STOPWORDS HTML STRIP STANDARD POSSESSIVE, LOWERCASE, ENGLISH STOPWORDS, ENGLISH STEMMER HTML STRIP STANDARD ELISION, LOWERCASE, FRENCH STOPWORDS, FRENCH STEMMER HTML STRIP STANDARD ELISION, LOWERCASE, ITALIAN STOPWORDS, ITALIAN STEMMER HTML STRIP STANDARD LOWERCASE, GERMAN STOPWORDS, GERMAN STEMMER, SHINGLE
  14. MULTI-LANGUAGE PROPERTIES 18   "query": { "filtered": { "query": {

    "bool": { "must": { "multi_match": { "query": "product manager", "fields": [ "body^3", //german "body.standard^3”, "body.english^2”, "body.french^1”, "body.italian^1” ], "type": "most_fields", "operator": "AND” } } } }, "filter": { … } } } Application-side configurations allow us to define search fields and their individual boost We constantly run A/B-test to improve matching rate and tune relevance
  15. SITEMAP BY JOB TITLES •  Industry is an information you

    cannot easily find in structured documents •  Only few websites explicitly show job titles and industry •  What if we build a taxonomy of job titles/industry represented by queries? •  That would allow enriching documents at index time by means of percolators 21  
  16. JOURNEY OF A JOB DOCUMENT 23   CRAWLER CRAWLED JOB

    DOCUMENT ELASTICSEARCH .PERCOLATOR JOBS ENRICHED JOB DOCUMENT SITEMAP CRAWL ENRICH INDEX WEBSITE
  17. PERCOLATOR EXAMPLE 24   "query": { "filtered": { "query": {

    "bool": { "must": { "multi_match": { "query": "Account Director", "fields": [ "headline^2", "headline.standard^1", "body^1", "body.standard^1", "company_name^1” ], "type": "most_fields", "operator": "AND” } } } }, "filter": { "bool": { … } } }, "jobtitle": "Account Director", "sector": "Sales" } Percolator is a standard query (multi match in search fields) Jobtitle and sector are attached to the query and indexed together with the document (nested)
  18. PROS AND CONS •  Live document enrichment (+) •  Job

    classification based on keywords (+) •  Aggregate by industry and sub-aggregate by location (+) •  Slower reindex time (-) •  Reindex all 10x slower •  Aggregations are heavy (-) •  Caching required •  Inaccurate since the population is dynamic •  Try to be consistent with your queries •  e.g., percolators do not support min_score, whereas queries do 25  
  19. PROS AND CONS •  Live document enrichment (+) •  Job

    classification based on keywords (+) •  Aggregate by industry and sub-aggregate by location (+) •  Slower reindex time (-) •  Reindex all 10x slower •  Aggregations are heavy (-) •  Caching required •  Inaccurate since the population is dynamic •  Try to be consistent with your queries •  e.g., percolators do not support min_score, whereas queries do 26  
  20. WHAT’S NEXT •  Sitemaps change frequently •  Job import and

    lifecycle cause link churn •  Sitemaps are heavy •  Tons of jobtitle and locations •  Google periodically crawls sitemaps •  Google allows pushing sitemap changes •  We do not want to push unstable changes 27  
  21. SITEMAP CHANGES 28   S1   S2   SX  

    S3   SY   Δ1 Δ2 ΔX ΔY MEMORY DAY N DAY 1 CHANGES DAY N CHANGES DAY 1 ELASTICSEARCH WEEK 1 WEEKLY ALIAS CHANGES ANALYZER AGGREGATE CHANGES SITEMAP PUSH INDEX
  22. CONSIDERATIONS •  Bucket aggregations allow filtering min_doc_count buckets •  Unfortunately

    there is no max_doc_count filter •  A small and separate use-case allowed us to test ES 2.0 •  ES 2.0 provides pipeline aggregations (experimental) •  Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets •  After the pipeline aggregation we query back the changes index to get only the documents we need 29  
  23. 31   UNITED KINGDOM SCOTLAND ENGLAND WALES NORTHERN IRELAND NORTH

    EAST ENGLAND NORTH WEST ENGLAND SOUTH EAST ENGLAND SOUTH WEST ENGLAND … … BERKSHIRE SURREY LONDON KENT … … NORTH EAST LONDON NORTH LONDON WEST LONDON SOUTH EAST LONDON … … LONDON (GREENWICH) LONDON (LEWISHAM) LONDON (BROMLEY) LONDON (SOUTHWARK) … … LONDON (CHARLTON) LONDON (WOOLWICH) LONDON (ELTHAM) LONDON (MAZE HILL) … … /UNITED KINGDOM/ENGLAND/SOUTH EAST ENGLAND/LONDON/SOUTH EAST LONDON/LONDON (GREENWICH)/LONDON (WOOLWICH) /UNITED KINGDOM/ENGLAND/SOUTH EAST ENGLAND/LONDON/SOUTH EAST LONDON/LONDON (GREENWICH)/LONDON (CHARLTON)
  24. LOCATION MAPPING 32   LOCATION CANONICAL NAME CANONICAL PATH GEO

    COORDINATES LOCATION DEPTH ORGANIC PATH SEARCH PATH SPECIAL PATH SYNONYMS WEAK SYNONYMS LOCATION LONDON /UNITED KINGDOM/ENGLAND/SOUTH EAST ENGLAND/LONDON POINT (-0.130714000141, 51.498555) 3 /UNITED KINGDOM/ENGLAND/SOUTH EAST ENGLAND/LONDON {…} /LONDON LONDRES, LONDRA, GREATER LONDON, SE1 1PP, EC2A 4JU, … []
  25. LOCATION SEARCH AND INDEXING 33   WHERE LOCATIONS CRAWLED LOCATION

    CRAWLED TEXT LOCATIONS JOB QUERY SYNONYMS + WEAK SYNONYMS SYNONYMS + WEAK SYNONYMS SYNONYMS SEARCH PATH + GEO COORDINATES CANONICAL PATH + GEO COORDINATES PATH HIERARCHY ANALYZER LOCATION DOMAIN (e.g., LONDON) PATH DOMAIN (e.g., /UNITED KINGDOM/ENGLAND/SOUTH EAST ENGLAND/LONDON) SEARCH SEARCH INDEX
  26. LOCATION SEARCH 34   "or": { "filters": [ { "terms":

    { "location": [ "/hartsville, sc", "/united states/southern united states/south atlantic/south carolina/darlington county, sc/hartsville, sc” ], "_cache": false } }, { "geo_shape": { "geo_coordinates": { "indexed_shape": { "id": "443498", "type": "location", "index": "us_geo_shapes", "path": "geo_coordinates" }, "relation": "within” }, "_cache": false } } ], "_cache": true, "_cache_key": "443498" } We search locations by path and coordinates Caching is performed only on the or filter (sub-filters always depend on it and we may avoid caching) Cache keys allow saving memory
  27. WHAT IS JOBSEEKER DELIVERY •  Customers partner with Jobrapido to

    get •  Applications (CVs) •  Traffic •  How do we provide qualified candidates on demand? •  We should notify only relevant jobseekers •  Interested •  Active •  No pressure •  We want to maximize the chance of delivering the right candidate 36  
  28. ELASTIC TARGETING 37   SEARCHES JOBSEEKERS DB JOBSEEKERS INDEX JOB

    KEYWORD EXTRACTION KEYWORDS EMAIL PROVIDER INDEX VISIT EMAIL CAMPAIGN CONTENT
  29. HOW DO WE SEARCH JOBSEEKERS •  Full-text search •  Search

    keywords on the jobseeker’s saved searches •  Apply same mapping of the pure search scenario (be consistent with the user search experience) •  Synonyms •  Apply synonyms to increase matching rate •  Fuzziness •  Users often misspell words (and sometimes advertisers do) •  Aggregation •  Give more weight to jobseekers with more than one search matching •  First test with Elasticsearch 2.1.1 38  
  30. CONCLUSIONS •  Jobrapido covers 58 countries and 18 languages • 

    Percolations and aggregations allow for document enrichment and dynamic sitemap creation •  Pipeline aggregations ease push of significant sitemap changes to Google •  Path hierarchies to cleverly represent location structure •  We search not only jobs but also jobseekers •  Index and search like a pro J •  Documentation: https://www.elastic.co/guide/index.html •  Training: thanks Luca and Karel J •  Book: Elasticsearch – The Definitive Guide •  Support: Jobrapido is a proud platinum customer (production and development advice) – thanks Antonio J •  Thanks to Michele Solazzo and Kiratech for their commercial support 39  
  31. WE ARE HIRING! •  Back-End Engineer •  Tracking Engineer • 

    Full-Stack Engineer •  ETL Engineer •  Search Specialist – German speaker •  Job aggregation Product Manager •  http://corporate.jobrapido.com (Careers) 40