[EN] Fast and relevant jobs with Elasticsearch

1 Salvatore Vadacca – Head of Product Development @
Jobrapido Analytics and Devops Meetup – Milano - February 5, 2016 FAST AND RELEVANT JOBS WITH ELASTICSEARCH

COMPANY WEBSITE www.jobrapido.com ABOUT ME ROLE Head of Product Development
@ Jobrapido EMAIL [email protected] TWITTER @totovadacca LINKEDIN https://it.linkedin.com/in/salvatorevadacca NAME Salvatore Vadacca

WEBSITES IN 58 COUNTRIES Head office Milan + office in
Amsterdam Jobrapido is the world's leading jobsearch engine that analyses and collects all job posts on the web, giving jobseekers all offers available, ordered for relevance based on the search they’ve done Analysis Aggregation Response * Clicks on job listings (organic + sponsored) and clicks on contextual ads WHO WE ARE UNIQUE VISITORS 35 Mio Uvs / month SUBSCRIBERS 60+ Mio subs users (current stock) PAGEVIEWS / CLICKS* 280 Mio PVs / month & 130 Mio clicks / month JOBS 20+ Mio jobs at any given time VISITORS 1.0 BN visits / year PEOPLE 100+

DOCUMENTS >200 M OUR ELASTIC NUMBERS (MAIN CLUSTER) DATA 1TB
AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, 3 masters, 65 client CURRENT VERSION 1.7.3

MOBILE APP MY SEARCHES MY JOBS MENU
CNT SELECTION SIGN UP SIGN IN

WHERE WE ARE

SEARCH AND TECH PEOPLE @ JOBRAPIDO STEFANO MASSERA MATTIA
PICCINETTI ARNALDO DE MAIO FABIO ARCARI STEFANO FRIGERIO GIANLUCA TOMASINO MAGDA SWIERCZEK MICHELE PINTO VALENTINO MIAZZO FABIO RANFI LORENZO SANTI STEFANO ZANIN VALENTINA MISTRANGELO RAFFAELE SCOZZAFAVA ANDREA CHIAROT SHARATH PERAVALI ANDREA VAGHI GIUSEPPE LA TONA ARTURO GATTO LUIGI SAGGESE MARCELLO GRECO MAURIZIO BATTAGHINI ALBERTO BASOLI MARCO SIVIERO MICHELE PATERNITI PAOLO ZITELLI MARCO LOCATELLI ALINA CHELMUS FABIO PIZZATO GABRIELE TONINELLI ANTONELLA CIPRESSO FRANCESCO CARANTE

A BIT OF HISTORY 2006 2007 2008 2009 2010 2011
2012 2013 2014 2015 Apr 2012 - Jobrapido joins DMGT Mar 2014 - Jobrapido joins STG May 2014 – 1st Elasticsearch spike live on ao.jobrapido.com Jul 2014 – All Jobrapido websites moved to Elasticsearch Jul 2013 – Jobrapido websites migrated to Java Mar 2015 – Full-text relevance on all our websites Dec 2014 – Locations moved to Elasticsearch May 2015 – Dynamic sitemaps live Jun 2006 – Jobrapido founded Jan 2011 – Jobrapido counts 50 employees Dec 2009 – Jobrapido goes US and AU Jun 2011 – New Jobrapido HQ in Milan Dec 2007 – Jobrapido in Latin America (AR, CL and MX) Nov 2008 –Jobrapido goes France Dec 2006 – Jobrapido covers IT, UK, DE, AT, CH and ES Jun 2010 – Jobrapido goes Asia and Africa Sep 2010 – Jobrapido serves 50 countries Jul 2008 – Jobrapido reaches 3M unique visitors Aug 2009 – Jobrapido reaches 10M unique visitors Dec 2013 – Search&Match team established Dec 2015 – Jobrapido Jobsearch API Oct 2015 – Jobrapido counts 100 employees

THE NEED FOR A NEW SEARCH ENGINE •  Result sorting
limited to CPC and publish date •  Debug and troubleshooting nearly impossible •  Exact match was the only option •  Slow reindex time (up to 10 days) •  Custom and inaccurate language analysis •  No high availability •  Hard to scale 10

A TWO-STEP MIGRATION •  Step 1: Elasticsearch as a key-value
store •  Performance •  Scalability •  High availability •  Faster reindex •  Step 2: full-text search •  Relevance •  Built-in multi-language support •  Configurable sort options 11

MULTI-LANGUAGE SUPPORT 58 countries, 18 languages 12

13 ITALIAN SPANISH FRENCH ENGLISH POLISH HUNGARIAN JAPANESE GERMAN
ROMANIAN PORTUGUESE SWEDISH RUSSIAN DANISH DUTCH CZECH TURKISH KOREAN CHINESE

JOB INDICES •  One index per country •  Two aliases
with filter (organic vs. sponsored jobs) •  Each index implements country and language-specific analysers •  A country may support more than one language •  e.g., Canada, Switzerland, etc. 14 IT CH FR UK DE US IT.JOBRAPIDO.COM CH.JOBRAPIDO.COM DE.JOBRAPIDO.COM UK.JOBRAPIDO.COM FR.JOBRAPIDO.COM US.JOBRAPIDO.COM

THE ANATOMY OF AN ANALYZER •  Strip HTML •  Tokenization
•  Lowercase •  Stopwords 1.  _german_, _french, _english_, … (built-in) 2.  Language-specific (file) 3.  Country-specific (file) •  Stemming 1.  light_german, light_french, english, … (built-in) 2.  Language-specific exceptions (file) 3.  Country-specific exceptions (file) •  Language-specific filters (e.g., elision, possessive) •  Synonyms •  Shingles 15

MULTI-LANGUAGE PROPERTIES 16 BODY GERMAN STANDARD ENGLISH FRENCH ITALIAN
SHINGLE HTML STRIP CHAR FILTER STANDARD LOWERCASE, GERMAN STOPWORDS, GERMAN STEMMER TOKENIZER FILTER HTML STRIP STANDARD STANDARD, LOWERCASE, GERMAN STOPWORDS HTML STRIP STANDARD POSSESSIVE, LOWERCASE, ENGLISH STOPWORDS, ENGLISH STEMMER HTML STRIP STANDARD ELISION, LOWERCASE, FRENCH STOPWORDS, FRENCH STEMMER HTML STRIP STANDARD ELISION, LOWERCASE, ITALIAN STOPWORDS, ITALIAN STEMMER HTML STRIP STANDARD LOWERCASE, GERMAN STOPWORDS, GERMAN STEMMER, SHINGLE

MULTI-LANGUAGE PROPERTIES 17 JOB ELASTICSEARCH LAVORO
EMPLOI JOB

MULTI-LANGUAGE PROPERTIES 18 "query": { "filtered": { "query": {
"bool": { "must": { "multi_match": { "query": "product manager", "fields": [ "body^3", //german "body.standard^3”, "body.english^2”, "body.french^1”, "body.italian^1” ], "type": "most_fields", "operator": "AND” } } } }, "filter": { … } } } Application-side configurations allow us to define search fields and their individual boost We constantly run A/B-test to improve matching rate and tune relevance

SITEMAPS Percolators and aggregations 19

SITEMAP BY JOB TITLES •  Industry is an information you
cannot easily find in structured documents •  Only few websites explicitly show job titles and industry •  What if we build a taxonomy of job titles/industry represented by queries? •  That would allow enriching documents at index time by means of percolators 21

JOURNEY OF A JOB DOCUMENT 23 CRAWLER CRAWLED JOB
DOCUMENT ELASTICSEARCH .PERCOLATOR JOBS ENRICHED JOB DOCUMENT SITEMAP CRAWL ENRICH INDEX WEBSITE

PERCOLATOR EXAMPLE 24 "query": { "filtered": { "query": {
"bool": { "must": { "multi_match": { "query": "Account Director", "fields": [ "headline^2", "headline.standard^1", "body^1", "body.standard^1", "company_name^1” ], "type": "most_fields", "operator": "AND” } } } }, "filter": { "bool": { … } } }, "jobtitle": "Account Director", "sector": "Sales" } Percolator is a standard query (multi match in search fields) Jobtitle and sector are attached to the query and indexed together with the document (nested)

PROS AND CONS •  Live document enrichment (+) •  Job
classification based on keywords (+) •  Aggregate by industry and sub-aggregate by location (+) •  Slower reindex time (-) •  Reindex all 10x slower •  Aggregations are heavy (-) •  Caching required •  Inaccurate since the population is dynamic •  Try to be consistent with your queries •  e.g., percolators do not support min_score, whereas queries do 25

PROS AND CONS •  Live document enrichment (+) •  Job
classification based on keywords (+) •  Aggregate by industry and sub-aggregate by location (+) •  Slower reindex time (-) •  Reindex all 10x slower •  Aggregations are heavy (-) •  Caching required •  Inaccurate since the population is dynamic •  Try to be consistent with your queries •  e.g., percolators do not support min_score, whereas queries do 26

WHAT’S NEXT •  Sitemaps change frequently •  Job import and
lifecycle cause link churn •  Sitemaps are heavy •  Tons of jobtitle and locations •  Google periodically crawls sitemaps •  Google allows pushing sitemap changes •  We do not want to push unstable changes 27

SITEMAP CHANGES 28 S1 S2 SX
S3 SY Δ1 Δ2 ΔX ΔY MEMORY DAY N DAY 1 CHANGES DAY N CHANGES DAY 1 ELASTICSEARCH WEEK 1 WEEKLY ALIAS CHANGES ANALYZER AGGREGATE CHANGES SITEMAP PUSH INDEX

CONSIDERATIONS •  Bucket aggregations allow filtering min_doc_count buckets •  Unfortunately
there is no max_doc_count filter •  A small and separate use-case allowed us to test ES 2.0 •  ES 2.0 provides pipeline aggregations (experimental) •  Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets •  After the pipeline aggregation we query back the changes index to get only the documents we need 29

LOCATIONS Implementing hierarchies by means of path analysers 30

31 UNITED KINGDOM SCOTLAND ENGLAND WALES NORTHERN IRELAND NORTH
EAST ENGLAND NORTH WEST ENGLAND SOUTH EAST ENGLAND SOUTH WEST ENGLAND … … BERKSHIRE SURREY LONDON KENT … … NORTH EAST LONDON NORTH LONDON WEST LONDON SOUTH EAST LONDON … … LONDON (GREENWICH) LONDON (LEWISHAM) LONDON (BROMLEY) LONDON (SOUTHWARK) … … LONDON (CHARLTON) LONDON (WOOLWICH) LONDON (ELTHAM) LONDON (MAZE HILL) … … /UNITED KINGDOM/ENGLAND/SOUTH EAST ENGLAND/LONDON/SOUTH EAST LONDON/LONDON (GREENWICH)/LONDON (WOOLWICH) /UNITED KINGDOM/ENGLAND/SOUTH EAST ENGLAND/LONDON/SOUTH EAST LONDON/LONDON (GREENWICH)/LONDON (CHARLTON)

LOCATION MAPPING 32 LOCATION CANONICAL NAME CANONICAL PATH GEO
COORDINATES LOCATION DEPTH ORGANIC PATH SEARCH PATH SPECIAL PATH SYNONYMS WEAK SYNONYMS LOCATION LONDON /UNITED KINGDOM/ENGLAND/SOUTH EAST ENGLAND/LONDON POINT (-0.130714000141, 51.498555) 3 /UNITED KINGDOM/ENGLAND/SOUTH EAST ENGLAND/LONDON {…} /LONDON LONDRES, LONDRA, GREATER LONDON, SE1 1PP, EC2A 4JU, … []

LOCATION SEARCH AND INDEXING 33 WHERE LOCATIONS CRAWLED LOCATION
CRAWLED TEXT LOCATIONS JOB QUERY SYNONYMS + WEAK SYNONYMS SYNONYMS + WEAK SYNONYMS SYNONYMS SEARCH PATH + GEO COORDINATES CANONICAL PATH + GEO COORDINATES PATH HIERARCHY ANALYZER LOCATION DOMAIN (e.g., LONDON) PATH DOMAIN (e.g., /UNITED KINGDOM/ENGLAND/SOUTH EAST ENGLAND/LONDON) SEARCH SEARCH INDEX

LOCATION SEARCH 34 "or": { "filters": [ { "terms":
{ "location": [ "/hartsville, sc", "/united states/southern united states/south atlantic/south carolina/darlington county, sc/hartsville, sc” ], "_cache": false } }, { "geo_shape": { "geo_coordinates": { "indexed_shape": { "id": "443498", "type": "location", "index": "us_geo_shapes", "path": "geo_coordinates" }, "relation": "within” }, "_cache": false } } ], "_cache": true, "_cache_key": "443498" } We search locations by path and coordinates Caching is performed only on the or filter (sub-filters always depend on it and we may avoid caching) Cache keys allow saving memory

TARGETING USERS Incrementing customer’s delivery searching jobseekers 35

WHAT IS JOBSEEKER DELIVERY •  Customers partner with Jobrapido to
get •  Applications (CVs) •  Traffic •  How do we provide qualified candidates on demand? •  We should notify only relevant jobseekers •  Interested •  Active •  No pressure •  We want to maximize the chance of delivering the right candidate 36

ELASTIC TARGETING 37 SEARCHES JOBSEEKERS DB JOBSEEKERS INDEX JOB
KEYWORD EXTRACTION KEYWORDS EMAIL PROVIDER INDEX VISIT EMAIL CAMPAIGN CONTENT

HOW DO WE SEARCH JOBSEEKERS •  Full-text search •  Search
keywords on the jobseeker’s saved searches •  Apply same mapping of the pure search scenario (be consistent with the user search experience) •  Synonyms •  Apply synonyms to increase matching rate •  Fuzziness •  Users often misspell words (and sometimes advertisers do) •  Aggregation •  Give more weight to jobseekers with more than one search matching •  First test with Elasticsearch 2.1.1 38

CONCLUSIONS •  Jobrapido covers 58 countries and 18 languages • 
Percolations and aggregations allow for document enrichment and dynamic sitemap creation •  Pipeline aggregations ease push of significant sitemap changes to Google •  Path hierarchies to cleverly represent location structure •  We search not only jobs but also jobseekers •  Index and search like a pro J •  Documentation: https://www.elastic.co/guide/index.html •  Training: thanks Luca and Karel J •  Book: Elasticsearch – The Definitive Guide •  Support: Jobrapido is a proud platinum customer (production and development advice) – thanks Antonio J •  Thanks to Michele Solazzo and Kiratech for their commercial support 39

WE ARE HIRING! •  Back-End Engineer •  Tracking Engineer • 
Full-Stack Engineer •  ETL Engineer •  Search Specialist – German speaker •  Job aggregation Product Manager •  http://corporate.jobrapido.com (Careers) 40

41 FAST AND RELEVANT JOBS WITH ELASTICSEARCH CURIOUS? PLEASE
ASK! OR JUST VISIT www.jobrapido.com

[EN] Fast and relevant jobs with Elasticsearch

[EN] Fast and relevant jobs with Elasticsearch

More Decks by Jobrapido

Other Decks in Technology

Featured

Transcript