Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Vector search with Elastic

Avatar for Abdon Pijpelink Abdon Pijpelink
September 06, 2022
52

Vector search with Elastic

Avatar for Abdon Pijpelink

Abdon Pijpelink

September 06, 2022
Tweet

Transcript

  1. Vector search with Elastic Arno van de Velde & Abdon

    Pijpelink Amsterdam Meetup - September 6th, 2022
  2. Agenda • Relevance in Elasticsearch ◦ a quick recap •

    Vector search ◦ what is it? • Vector search with Elasticsearch ◦ indexing dense vectors ◦ exact k-nearest neighbor (kNN) search ◦ approximate kNN (ANN) search • Demo
  3. Full text queries "hits": [ { "_score": 0.92848843, "_source": {

    "message": "Elastic #MeetUp ft. @elastic coming your way! 6th September 17:30, next to Amsterdam Central!" } }, { "_score": 0.6133945, "_source": { "message": "I love visiting Amsterdam!" } }, { "_score": 0.47697234, "_source": { "message": "Are you preparing for your Elastic Certified Engineer exam?" } } ] GET tweets/_search { "query": { "match": { "message": "elastic amsterdam" } } }
  4. Score • Search results are ranked by score ◦ represents

    the relevance of a hit with respect to the query • Default scoring algorithm is BM25/BM25F • Three elements determine a document’s score: ◦ TF (term frequency): the more a term appears in a field, the more important it is ◦ IDF (inverse document frequency): the more documents that contain the term, the less important the term is ◦ field length: shorter fields are more likely to be relevant than longer fields
  5. TF and IDF The more documents contain the term, the

    less important the term is The more a term appears in a field, the more important it is
  6. Vector search: beyond full text queries • Semantic search using

    natural language processing (NLP) • Image search • Audio search • Finding similar users or products ◦ recommendation engines ◦ personalization (ads, etc)
  7. Comparing vectors • What makes vectors similar or dissimilar? •

    Supported similarity measures in Elasticsearch: ◦ cosine similarity ◦ dot product ◦ Manhattan distance (L1 norm) ◦ Euclidean distance (L2 norm)
  8. Encode your data as vectors Using deep learning models for

    relevance ranking Natural Language Processing Model Text Convolutional Neural Network Image Embeddings Feature vectors a 1 a 2 … a n a 1 a 2 … a n 0.0167327… 0.3458967… 0.0547893… 0.0324981… 0.0135497… 0.0216549…
  9. Configure index for vector search • Map a field as

    type dense_vector • Introduced in Elasticsearch 7 • Configure the number of dimensions • Field does not need to be indexed PUT product-index { "mappings": { "properties": { "product-vector": { "type": "dense_vector", "dims": 5, "index": false } } } }
  10. Index data POST product-index/_bulk { "index": { "_id": "1" }

    } { "product-vector": [230.0, 300.33, -34.8988, 15.555, -200.0]} { "index": { "_id": "2" } } { "product-vector": [-0.5, 100.0, -13.0, 14.8, -156.0] } { "index": { "_id": "3" } } { "product-vector": [0.5, 111.3, -13.0, 14.8, -156.0] }
  11. Query the data using a script score query POST product-index/_search

    { "query": { "script_score": { "query" : { "match_all": {} }, "script": { "source": "cosineSimilarity(params.queryVector, 'product-vector') + 1.0", "params": { "queryVector": [-0.5, 90.0, -10, 14.8, -156.0] } } } } }
  12. Approximate kNN search (ANN) • Exact kNN does not scale

    to large datasets ◦ script score query needs to evaluate all documents ◦ order: (size of dataset) x (number of vector dimensions) • Reduce computational complexity with approximate kNN search ◦ introduced in Elasticsearch 8 ◦ retrieve a ”good guess” of the k nearest neighbors ◦ sacrifice a little accuracy for large performance gains ◦ better performance on very large datasets: Approach Queries per second Recall (k=10) Exact 5.257 1.000 Approximate 849.286 0.945
  13. Hierarchical Navigable Small Worlds (HNSW) • Native ANN in Lucene

    9 • Performs very well on large datasets
  14. Configure the index • Use the dense_vector type ◦ set

    index to true ◦ configure similarity: ᐨ l2_norm ᐨ dot_product ᐨ cosine PUT image-index { "mappings": { "properties": { "image-vector": { "type": "dense_vector", "dims": 3, "index": true, "similarity": "l2_norm" }, "title": { "type": "text" }, "file-type": { "type": "keyword" } } } }
  15. Index the data POST image-index/_bulk { "index": { "_id": "1"

    } } { "image-vector": [1, 5, -20], "title": "moose family", "file-type": "jpg" } { "index": { "_id": "2" } } { "image-vector": [42, 8, -15], "title": "alpine lake", "file-type": "png" } { "index": { "_id": "3" } } { "image-vector": [15, 11, 23], "title": "full moon", "file-type": "jpg" }
  16. Query the data POST image-index/_search { "knn": { "field": "image-vector",

    "query_vector": [-5, 9, -12], "k": 10, "num_candidates": 100 } } • Use _search endpoint with knn option • Returns top k nearest neighbors • Considers num_candidates per shard ◦ increase for more accurate but slower searches ◦ decrease for less accurate but faster searches
  17. Filtered kNN search • Returns the top k documents that

    also match the filter POST image-index/_search { "knn": { "field": "image-vector", "query_vector": [54, 10, -2], "k": 5, "num_candidates": 50, "filter": { "term": { "file-type": "png" } } } }
  18. Hybrid retrieval POST image-index/_search { "query": { "match": { "title":

    { "query": "mountain lake", "boost": 0.9 } } }, "knn": { "field": "image-vector", "query_vector": [54, 10, -2], "k": 5, "num_candidates": 50, "boost": 0.1 }, "size": 10 } • Combine query and knn • Combines top k vector matches with results from query ◦ behaves like a boolean OR ◦ score is a linear combination of the two subscores score = 0.9 * match_score + 0.1 * knn_score
  19. kNN and aggregations POST image-index/_search { "knn": { "field": "image-vector",

    "query_vector": [54, 10, -2], "k": 5, "num_candidates": 50, "boost": 0.1 }, "aggs": { "top-file-types": { "terms": { "field": "file-type", "size": 10 } } } } • Combine kNN searches with aggregations to create facets ◦ calculated on the top k nearest documents
  20. Tune ANN searches • Ensure data nodes have enough memory

    • Warm up the filesystem cache • Reduce vector dimensionality • Use dot_product instead of cosine similarity ◦ requires normalizing the vectors to length 1 • Exclude vector fields from _source • Force merge to one segment • Avoid heavy indexing during searches
  21. • Blog post with step by step tutorial: ◦ using

    a text embedding model to generate vector representations of textual contents ◦ vector similarity search on the generated vectors elastic.co/blog/how-to-deploy-nlp-text-embeddings-and-vector-search Try it yourself
  22. Useful links • Blog post: elastic.co/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0 • Youtube video "Zalando:

    Enriching E-Commerce Search with Elasticsearch 8's k-Nearest Neighbours" youtube.com/watch?v=YthR1ROX2g8 • kNN search: elastic.co/guide/en/elasticsearch/reference/current/knn-search.html • Dense vector field type: elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html • Script score query: elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-score-query.html • Vector functions: elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-score-query.html#vector-functions • Search API knn option: elastic.co/guide/en/elasticsearch/reference/current/search-search.html#search-api-knn • Follow the work in the Github issue "ANN search improvements": github.com/elastic/elasticsearch/issues/84324