Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Vector scoring for term embeddings in Elasticsearch

Vector scoring for term embeddings in Elasticsearch

Talk given at the October 2019 meetup of the South West Elasticsearch Community. Bristol, UK. 9 Oct 2019. https://www.meetup.com/South-West-Elastic-Fantastics/events/263778609/

Matt J Williams

October 09, 2019
Tweet

More Decks by Matt J Williams

Other Decks in Technology

Transcript

  1. Vector scoring for term embeddings
    in Elasticsearch
    9 October 2019
    Matt Williams
    Search Engineer, Cookpad
    @voxmjw

    View full-size slide

  2. TL;DR
    As an experiment, we introduced a
    term embedding model to substitute our hand-curated
    search dictionary for automated query expansion in
    our recipe search, to see how far we could go with a
    completely automated approach.
    we didn’t know if it would work ‍♂
    a machine learning technique ‍♂
    synonyms ‘n’ stuff
    = eggplant = aubergine

    View full-size slide

  3. Elasticsearch

    View full-size slide

  4. Elasticsearch Cookpad
    1997 Launched in Japan
    2003 One million users in Japan
    2013 Launched outside Japan
    2017 Global HQ launched in Bristol
    2019...
    2014 Elasticsearch

    View full-size slide

  5. Elasticsearch Cookpad
    1997 Launched in Japan
    2003 One million users in Japan
    2013 Launched outside Japan
    2017 Global HQ launched in Bristol
    2019...
    2014 Elasticsearch
    [Icon made by Smashicons from www.flaticon.com]
    the world’s most
    popular recipe
    search engine!

    View full-size slide

  6. Elasticsearch Cookpad
    1997 Launched in Japan
    2003 One million users in Japan
    2013 Launched outside Japan
    2017 Global HQ launched in Bristol
    2019...
    2014 Elasticsearch
    100M users (60M in Japan)
    5M recipes
    75 countries
    28 languages
    [Icon made by Smashicons from www.flaticon.com]
    the world’s most
    popular recipe
    search engine!

    View full-size slide

  7. Query expansion for recipe search
    [Roast courgettes with spring veg | Cookpad | Chris Jacobs] [Spicy Roast Zucchini | Cookpad | Gaia Riva ]
    Roast Courgettes with
    Spring Veg
    roast courgettes
    User query
    Text search
    Spicy Roast Zucchini
    ...
    Hits
    Elasticsearch
    roast courgettes
    Terms

    View full-size slide

  8. Query expansion for recipe search
    [Roast courgettes with spring veg | Cookpad | Chris Jacobs] [Spicy Roast Zucchini | Cookpad | Gaia Riva ]
    Roast Courgettes with
    Spring Veg
    roast courgettes
    User query
    Text search
    Spicy Roast Zucchini
    ...
    Hits
    Elasticsearch
    roast courgettes
    Terms
    roast courgettes
    Expanded terms
    zucchini
    Query expansion

    View full-size slide

  9. Query expansion: Search dictionary
    Zucchini [Ingr]
    courgette
    zucchini
    Vegetable [Dish]
    vegetable
    veggie
    Eggplant [Ingr]
    aubergine
    eggplant
    Baba Ghanoush
    [Dish, Ingr]
    baba ghanoush
    Dip [Dish]
    dip
    dipping sauce
    An admin-curated knowledge graph of
    dishes, ingredients, skills, and other
    cooking-related concepts
    ● Used in query expansion to find
    synonyms, hyponyms/hypernyms, and
    other related terms
    ● Improves recall
    ● Also encodes some cultural rules for
    some countries; e.g., only suggest pork
    if explicitly searched for

    View full-size slide

  10. Where do “vectors” and “term
    embeddings” fit in all this?

    View full-size slide

  11. First, a word game…
    Fill in the blank...
    ...internationally acclaimed best seller edition

    View full-size slide

  12. Fill in the blank (internationally acclaimed best seller edition)
    “There are no shortcuts—everything is reps, reps, reps ”
    ?
    - Arnie
    From Total Recall: My Unbelievably True Life Story
    by Arnold Schwarzenegger
    B: naps
    D: jazz
    A: reps
    C: steps

    View full-size slide

  13. Fill in the blank (internationally acclaimed best seller edition)
    “I kept my eyes down on the reading list the teacher
    had given me. It was fairly basic: Bronte, Shakespeare,
    Chaucer, and Faulkner. I’d already read everything.”
    ?
    - Bella
    From Twilight by Stephenie Meyer
    B: Shakespeare
    D: Xylophone
    A: Carrots
    C: Terrance

    View full-size slide

  14. Fill in the blank (internationally acclaimed best seller edition)
    “Why are breakfast foods breakfast foods… Like, why don’t we
    have curry for breakfast?”
    ?
    - Hazel
    From The Fault in Our Stars by John Green
    B: Curry
    D: Jerry
    A: Shoes
    C: Cereal

    View full-size slide

  15. Fill in the blank (internationally acclaimed best seller edition)
    “My inner goddess is doing the merengue with some salsa
    moves.”
    ?
    - Anastasia
    From Fifty Shades of Grey by E. L. James
    B: Banana
    D: Merenge
    A: Macarena
    C: Gangagam

    View full-size slide

  16. Term embeddings (also known as word embeddings)
    the
    drizzle
    then oil over the tomatoes
    the
    drizzle
    then lemon over the tomatoes
    the
    drizzle
    then chorizo over the tomatoes
    the
    drizzle
    then pasta over the tomatoes
    the
    drizzle
    then knife over the tomatoes
    ✅✅


    ❌❌
    ❌❌❌

    View full-size slide

  17. ● A self-supervised machine learning (ML)
    technique
    ● Model learns to discriminate words
    based on their context
    ● Output: n-dimensional vector for each
    term
    ● n is a constant, usually between 50
    and 200
    Term embeddings (also known as word embeddings)
    [https://nathanrooy.github.io/posts/2018-03-22/word2vec-from-scratch-with-python-and-numpy/ | Nathan Rooy | 2018]

    View full-size slide

  18. zuchini
    courgette
    eggplant
    aubergine
    baba ghanoush dip
    dipping sauce
    Term embedding space
    Shown in two dimensions for visualisation
    Usually vectors have many dimensions: E.g., 50 to 200.
    Distance between two terms gives us an
    idea of how related those terms are.

    View full-size slide

  19. Shown in two dimensions for visualisation purposes
    zuchini
    courgette
    eggplant
    aubergine
    baba ghanoush dip
    dipping sauce
    Related terms…
    “Nearest neighbours”
    Vectors that are geometrically close in the
    embedding space
    Search dictionary
    Zucchini [Ingr]
    courgette
    zuchini
    Vegetable [Dish]
    vegetable
    veggie
    Eggplant [Ingr]
    aubergine
    eggplant
    Baba Ghanoush
    [Dish, Ingr]
    baba ghanoush
    Dip [Dish]
    dip
    dipping sauce
    Term embedding space
    Query expansion: Search dict vs term embeddings
    Related terms...
    Synonyms from dictionary look up, and/or
    neighbours in the knowledge graph

    View full-size slide

  20. Experiment: What if we replace the
    search dictionary with term
    embeddings?
    Let’s try it out in our English language recipe search...

    View full-size slide

  21. Embedding-driven search
    Recipes Term Embeddings
    Train Text
    Embedding Model
    Recipe Search
    Text search Nearest
    neighbour search
    7.3+
    ML Frameworks: TensorFlow, PyTorch,
    KubeFlow, Apache Spark, ...
    Word Embedding Models: FastText,
    word2vec, GloVe, BERT, ELMo, ...
    Pre-trained
    model
    (optional)

    View full-size slide

  22. Approximate nearest neighbour (ANN) tools:
    ● Annoy (Spotify)
    ● Faiss (Facebook)
    Exact nearest neighbour tools:
    ● Dense vectors (Elasticsearch 7.3+)
    ● Fast cosine similarity plugin (Elasticsearch 6.4+)
    Nearest neighbour search
    The task: For a given term t, find the k nearest term
    vectors within a distance of d.
    [Github: cookpad/fast-cosine-similarity] [Github: StaySense/fast-cosine-similarity]
    Exact computation => scaling limit.
    No features or plugins for Elasticsearch (yet?!)
    Approximate computation => Scalable to large numbers of
    embeddings.

    View full-size slide

  23. Exact nearest neighbours in ES 7.3+
    Create index mapping
    Add vector document
    Search for nearby vectors
    Note: Python Elasticsearch client code
    See: https://github.com/mattjw/elasticsearch-nn-benchmarks

    View full-size slide

  24. Nearest neighbour search performance
    ● Exact NN algorithms:
    ○ dense: elasticsearch dense vector (7.3)
    ○ fcs: fast cosine similarity
    ● Note: Single replica, single shard
    ● Performance OK for typical vocabluary size
    (10k-100k terms). ~10ms.
    Our recipes experiment: 30k term embeddings.
    ● Not so good for 100k+
    ○ Scale the index: sharding / replication
    ○ Does this rule out per-recipe
    embeddings?
    ○ Good for re-ranking a results shortlist
    Github: mattjw/elasticsearch-nn-benchmarks

    View full-size slide

  25. Search dictionary vs term embeddings
    Search dictionary Term embeddings
    Construction Human effort Machine effort
    Train ML model
    Control and
    oversight
    Human-in-the-loop Automated
    To change expansion behaviour,
    need to tune the ML model
    Explainability Strong Weak
    Prequisites Dictionary administrators
    Community Managers, with regional cooking
    knowledge, supported by Search Engineers
    Large corpus

    View full-size slide

  26. When can you use DL/RL for search?
    Task Min. number of docs
    Learning word representations 1k - 10k
    Learning document representations 1k - 10k
    Text generation 10k - 100k
    Machine translation 10k - 100k
    Learning image representations 10k - 100k
    A minimum requirement for deep learning / representation learning: How many docs do you have to
    train your models on?
    [Deep Learning for Search | Tommaso Teofili | 2019]

    View full-size slide

  27. ● Can we scale NN search to many more documents?
    E.g., for recipe embeddings – related recipes
    feature?
    ○ Scale the cluster?
    ○ Implement an approximate NN algorithm in
    Elastic? ⏱
    ○ Use an external database (Annoy, Faiss)?
    ● Dense vectors allows us to do nearest neighbour
    searches in Elasticsearch
    Note: It’s currently an experimental feature. We
    hope it will go GA
    ● Can a pure ML approach beat a curated search
    dictionary?
    In some cases!
    A lot more work needed for a firm conclusion.
    Better approach: Combine the both?
    ● Still unclear what factors affect how easy it is to
    beat the search dictionary.
    More investigation needed.
    Summary

    View full-size slide

  28. Further reading
    AI Powered Search
    Trey Grainger
    2020 (early access avail.)
    Deep Learning for Search
    Tommaso Teofili
    2019
    Using Word Embeddings for
    Automatic Query Expansion
    Roy, Paul, Mitra, Garain
    2016
    SIRIR Workshop on NIR
    We’re hiring! https://cookpad.workable.com
    ● Senior Search Engineer
    ● Machine Learning Infrastructure Engineer
    ● Senior Site Reliability Engineer
    ● Machine Learning Researcher
    ● ...and more!
    Thanks for listening! Any questions?

    View full-size slide

  29. Vector scoring for term embeddings
    in Elasticsearch
    9 October 2019
    Matt Williams
    Search Engineer, Cookpad
    @voxmjw

    View full-size slide