Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modernizing Search with OpenSearch: Beyond Keyw...

Avatar for Itamar Syn-Hershko Itamar Syn-Hershko
February 26, 2026
0

Modernizing Search with OpenSearch: Beyond Keyword Matching

Avatar for Itamar Syn-Hershko

Itamar Syn-Hershko

February 26, 2026

Transcript

  1. • Father of 4 girls • 15+ years in the

    search space • Elasticsearch & OpenSearch Consultant • OpenSearch Ambassador https://opensearch.org/ambassadors/ • Maintainer for the OpenSearch Kubernetes Operator • CTO & Founder of BigData Boutique https://bigdataboutique.com/ • Creator of Pulse for OpenSearch https://pulse.support/ About Me
  2. • OpenSearch and Keyword Search • Vector Search • Semantic

    Search and Hybrid Search • Query Intent Understanding • Measuring and Tuning Relevance Agenda
  3. • Current version 3.5 • Origins in Elasticsearch (forked at

    7.10) • 100% Open Source (Apache 2.0) • Notable OpenSearch Foundation members • AI and Vector Search Ready (and improving) • Rich Ecosystem & Tools • Well Understood and Supported • Managed Service Options: AWS, Aiven, Instaclustr https://docs.opensearch.org/latest/vector-search/ai-search/index/ Why OpenSearch?
  4. Keyword Search The basics: • Normalization and Stemming • tf/idf

    and BM25 • Fast Where it falls short: • Query has to exactly match terms in docs ◦ no semantic understanding ◦ synonyms • Scoring challenges • Can’t support AI / Agentic workloads
  5. Sparse • Captures predefined attributes • Often faster to search

    on • Cheap to compute Dense • Generated by large models (LLMs) • Could be costly • Captures semantic meaning well Vectors: Sparse vs Dense
  6. sparse_vector = { 0: 0.35, # "machine" — appears in

    doc, moderate IDF 1: 0.35, # "learning" — appears in doc, moderate IDF 2: 0.72, # "subset" — rare term across corpus, high IDF 3: 0.68, # "artificial" 4: 0.68, # "intelligence" 5: 0.55, # "models" 6: 0.50, # "learn" 7: 0.61, # "patterns" 8: 0.45, # "training" 9: 0.30, # "data" — very common, low IDF } Sparse Vectors - TF/IDF & BM25
  7. # Conceptual SPLADE output for "affordable electric cars" sparse_vector =

    { "affordable": 2.1, "electric": 1.9, "cars": 1.8, "vehicle": 1.3, # not in original text! "ev": 1.1, # not in original text! "cheap": 0.9, # not in original text! "price": 0.7, # not in original text! "battery": 0.5, # not in original text! "tesla": 0.3, # not in original text! } Sparse Vectors - SPLADE
  8. Dense Vectors • Open-Source / Open-Weight Models ◦ Snowflae (arctic-embed),

    e5 from MS, Alibaba and more • Proprietary or API-based ◦ Cohere Embed v4 ◦ OpenAI ◦ AWS Titan ◦ …
  9. User query: "How do black holes form?" Step 1 —

    LLM generates a hypothetical document: "Black holes form when massive stars exhaust their nuclear fuel and undergo gravitational collapse. When a star with a mass greater than approximately 20-25 solar masses reaches the end of its life, the core collapses under its own gravity, and if the remaining mass exceeds the Tolman-Oppenheimer-Volkoff limit, no known force can prevent complete collapse into a singularity..." Step 2 — Embed this fake document (not the original query) Step 3 — Use that embedding to search your corpus via ANN HyDE (Hypothetical Document Embeddings)
  10. Precision: "Of all items retrieved items, how many were actually

    positive?" (focusing on avoiding false positives) Recall: "Of all actual positive items, how many did the search find?" (focusing on avoiding false negatives) Precision and Recall
  11. • Combines Keyword + Vector Search • Helps with improving

    precision • Often involves additional methods • Reciprocal Rank Fusion (RRF) to merge highly ranked results from multiple search methods (like keyword and vector search) into a single, more relevant list, giving higher importance to documents appearing high in multiple lists, making it ideal for hybrid search. • BTW, we really like Cohere Embed for text embeddings: https://bigdataboutique.com/blog/cohere-embed-4-reducing-memory-foo tprint-with-no-loss-in-search-quality-dfb1d7 Hybrid Search
  12. • Filter by label, category, location, source, etc. • Scope

    to user, product, role • Not to be confused with Hybrid Search Semantic Search + Metadata
  13. Business Signals and Boosting • Decay functions on numeric values

    • Use function_score to prioritize recent documents, boost on popularity, etc https://docs.opensearch.org/latest/query-dsl/compound/function-score/
  14. Query Intent Understanding User query: Red Dress Translates into OpenSearch

    query: -> color: red AND ( -> “dress” in category or “dress” in description OR -> category: Clothing )
  15. Search Relevance Evaluation • Once a Baseline is established •

    A Judgment List is needed: • Human Evaluation ◦ Crowd sourcing ◦ Click streams ◦ Thumbs up / down • Production Feedback • LLM as a judge
  16. Common Evaluation Metrics nDCG@10 is a typical choice, but different

    use-case would have different concerns: • Internal knowledge base (freshness, good recall metric) • eCommerce (heavy on business signals, query intent understanding)
  17. • Use an AI model to reorder results • Based

    on relevancy to the question • Learning To Rank (LTR) https://docs.opensearch.org/latest/search-plugins/ltr/index/ Rerank
  18. Summary • Keyword Search is not enough in 2026 search

    applications • Getting started with vectors never been so easy • Use Hybrid Search and RRF • Don’t forget Relevance Evaluation • Query Intent understanding is easy to start with and highly recommended Tutorial you could follow: https://bigdataboutique.com/blog/recipes-to-vectors-using-opensearch-as-vector -database-aba607