Slide 1

Slide 1 text

Vector scoring for term embeddings in Elasticsearch 9 October 2019 Matt Williams Search Engineer, Cookpad @voxmjw

Slide 2

Slide 2 text

TL;DR As an experiment, we introduced a term embedding model to substitute our hand-curated search dictionary for automated query expansion in our recipe search, to see how far we could go with a completely automated approach. we didn’t know if it would work ‍♂ a machine learning technique ‍♂ synonyms ‘n’ stuff = eggplant = aubergine

Slide 3

Slide 3 text

Elasticsearch

Slide 4

Slide 4 text

Elasticsearch Cookpad 1997 Launched in Japan 2003 One million users in Japan 2013 Launched outside Japan 2017 Global HQ launched in Bristol 2019... 2014 Elasticsearch

Slide 5

Slide 5 text

Elasticsearch Cookpad 1997 Launched in Japan 2003 One million users in Japan 2013 Launched outside Japan 2017 Global HQ launched in Bristol 2019... 2014 Elasticsearch [Icon made by Smashicons from www.flaticon.com] the world’s most popular recipe search engine!

Slide 6

Slide 6 text

Elasticsearch Cookpad 1997 Launched in Japan 2003 One million users in Japan 2013 Launched outside Japan 2017 Global HQ launched in Bristol 2019... 2014 Elasticsearch 100M users (60M in Japan) 5M recipes 75 countries 28 languages [Icon made by Smashicons from www.flaticon.com] the world’s most popular recipe search engine!

Slide 7

Slide 7 text

Query expansion for recipe search [Roast courgettes with spring veg | Cookpad | Chris Jacobs] [Spicy Roast Zucchini | Cookpad | Gaia Riva ] Roast Courgettes with Spring Veg roast courgettes User query Text search Spicy Roast Zucchini ... Hits Elasticsearch roast courgettes Terms ❌

Slide 8

Slide 8 text

Query expansion for recipe search [Roast courgettes with spring veg | Cookpad | Chris Jacobs] [Spicy Roast Zucchini | Cookpad | Gaia Riva ] Roast Courgettes with Spring Veg roast courgettes User query Text search Spicy Roast Zucchini ... Hits Elasticsearch roast courgettes Terms roast courgettes Expanded terms zucchini Query expansion ✅

Slide 9

Slide 9 text

Query expansion: Search dictionary Zucchini [Ingr] courgette zucchini Vegetable [Dish] vegetable veggie Eggplant [Ingr] aubergine eggplant Baba Ghanoush [Dish, Ingr] baba ghanoush Dip [Dish] dip dipping sauce An admin-curated knowledge graph of dishes, ingredients, skills, and other cooking-related concepts ● Used in query expansion to find synonyms, hyponyms/hypernyms, and other related terms ● Improves recall ● Also encodes some cultural rules for some countries; e.g., only suggest pork if explicitly searched for

Slide 10

Slide 10 text

Where do “vectors” and “term embeddings” fit in all this?

Slide 11

Slide 11 text

First, a word game… Fill in the blank... ...internationally acclaimed best seller edition

Slide 12

Slide 12 text

Fill in the blank (internationally acclaimed best seller edition) “There are no shortcuts—everything is reps, reps, reps ” ? - Arnie From Total Recall: My Unbelievably True Life Story by Arnold Schwarzenegger B: naps D: jazz A: reps C: steps

Slide 13

Slide 13 text

Fill in the blank (internationally acclaimed best seller edition) “I kept my eyes down on the reading list the teacher had given me. It was fairly basic: Bronte, Shakespeare, Chaucer, and Faulkner. I’d already read everything.” ? - Bella From Twilight by Stephenie Meyer B: Shakespeare D: Xylophone A: Carrots C: Terrance

Slide 14

Slide 14 text

Fill in the blank (internationally acclaimed best seller edition) “Why are breakfast foods breakfast foods… Like, why don’t we have curry for breakfast?” ? - Hazel From The Fault in Our Stars by John Green B: Curry D: Jerry A: Shoes C: Cereal

Slide 15

Slide 15 text

Fill in the blank (internationally acclaimed best seller edition) “My inner goddess is doing the merengue with some salsa moves.” ? - Anastasia From Fifty Shades of Grey by E. L. James B: Banana D: Merenge A: Macarena C: Gangagam

Slide 16

Slide 16 text

Term embeddings (also known as word embeddings) the drizzle then oil over the tomatoes the drizzle then lemon over the tomatoes the drizzle then chorizo over the tomatoes the drizzle then pasta over the tomatoes the drizzle then knife over the tomatoes ✅✅ ✅ ❌ ❌❌ ❌❌❌

Slide 17

Slide 17 text

● A self-supervised machine learning (ML) technique ● Model learns to discriminate words based on their context ● Output: n-dimensional vector for each term ● n is a constant, usually between 50 and 200 Term embeddings (also known as word embeddings) [https://nathanrooy.github.io/posts/2018-03-22/word2vec-from-scratch-with-python-and-numpy/ | Nathan Rooy | 2018]

Slide 18

Slide 18 text

zuchini courgette eggplant aubergine baba ghanoush dip dipping sauce Term embedding space Shown in two dimensions for visualisation Usually vectors have many dimensions: E.g., 50 to 200. Distance between two terms gives us an idea of how related those terms are.

Slide 19

Slide 19 text

Shown in two dimensions for visualisation purposes zuchini courgette eggplant aubergine baba ghanoush dip dipping sauce Related terms… “Nearest neighbours” Vectors that are geometrically close in the embedding space Search dictionary Zucchini [Ingr] courgette zuchini Vegetable [Dish] vegetable veggie Eggplant [Ingr] aubergine eggplant Baba Ghanoush [Dish, Ingr] baba ghanoush Dip [Dish] dip dipping sauce Term embedding space Query expansion: Search dict vs term embeddings Related terms... Synonyms from dictionary look up, and/or neighbours in the knowledge graph

Slide 20

Slide 20 text

Experiment: What if we replace the search dictionary with term embeddings? Let’s try it out in our English language recipe search...

Slide 21

Slide 21 text

Embedding-driven search Recipes Term Embeddings Train Text Embedding Model Recipe Search Text search Nearest neighbour search 7.3+ ML Frameworks: TensorFlow, PyTorch, KubeFlow, Apache Spark, ... Word Embedding Models: FastText, word2vec, GloVe, BERT, ELMo, ... Pre-trained model (optional)

Slide 22

Slide 22 text

Approximate nearest neighbour (ANN) tools: ● Annoy (Spotify) ● Faiss (Facebook) Exact nearest neighbour tools: ● Dense vectors (Elasticsearch 7.3+) ● Fast cosine similarity plugin (Elasticsearch 6.4+) Nearest neighbour search The task: For a given term t, find the k nearest term vectors within a distance of d. [Github: cookpad/fast-cosine-similarity] [Github: StaySense/fast-cosine-similarity] Exact computation => scaling limit. No features or plugins for Elasticsearch (yet?!) Approximate computation => Scalable to large numbers of embeddings.

Slide 23

Slide 23 text

Exact nearest neighbours in ES 7.3+ Create index mapping Add vector document Search for nearby vectors Note: Python Elasticsearch client code See: https://github.com/mattjw/elasticsearch-nn-benchmarks

Slide 24

Slide 24 text

Nearest neighbour search performance ● Exact NN algorithms: ○ dense: elasticsearch dense vector (7.3) ○ fcs: fast cosine similarity ● Note: Single replica, single shard ● Performance OK for typical vocabluary size (10k-100k terms). ~10ms. Our recipes experiment: 30k term embeddings. ● Not so good for 100k+ ○ Scale the index: sharding / replication ○ Does this rule out per-recipe embeddings? ○ Good for re-ranking a results shortlist Github: mattjw/elasticsearch-nn-benchmarks

Slide 25

Slide 25 text

Search dictionary vs term embeddings Search dictionary Term embeddings Construction Human effort Machine effort Train ML model Control and oversight Human-in-the-loop Automated To change expansion behaviour, need to tune the ML model Explainability Strong Weak Prequisites Dictionary administrators Community Managers, with regional cooking knowledge, supported by Search Engineers Large corpus

Slide 26

Slide 26 text

When can you use DL/RL for search? Task Min. number of docs Learning word representations 1k - 10k Learning document representations 1k - 10k Text generation 10k - 100k Machine translation 10k - 100k Learning image representations 10k - 100k A minimum requirement for deep learning / representation learning: How many docs do you have to train your models on? [Deep Learning for Search | Tommaso Teofili | 2019]

Slide 27

Slide 27 text

● Can we scale NN search to many more documents? E.g., for recipe embeddings – related recipes feature? ○ Scale the cluster? ○ Implement an approximate NN algorithm in Elastic? ⏱ ○ Use an external database (Annoy, Faiss)? ● Dense vectors allows us to do nearest neighbour searches in Elasticsearch Note: It’s currently an experimental feature. We hope it will go GA ● Can a pure ML approach beat a curated search dictionary? In some cases! A lot more work needed for a firm conclusion. Better approach: Combine the both? ● Still unclear what factors affect how easy it is to beat the search dictionary. More investigation needed. Summary

Slide 28

Slide 28 text

Further reading AI Powered Search Trey Grainger 2020 (early access avail.) Deep Learning for Search Tommaso Teofili 2019 Using Word Embeddings for Automatic Query Expansion Roy, Paul, Mitra, Garain 2016 SIRIR Workshop on NIR We’re hiring! https://cookpad.workable.com ● Senior Search Engineer ● Machine Learning Infrastructure Engineer ● Senior Site Reliability Engineer ● Machine Learning Researcher ● ...and more! Thanks for listening! Any questions?

Slide 29

Slide 29 text

Vector scoring for term embeddings in Elasticsearch 9 October 2019 Matt Williams Search Engineer, Cookpad @voxmjw