Modernizing Search with OpenSearch: Beyond Keyword Matching

• Father of 4 girls • 15+ years in the
search space • Elasticsearch & OpenSearch Consultant • OpenSearch Ambassador https://opensearch.org/ambassadors/ • Maintainer for the OpenSearch Kubernetes Operator • CTO & Founder of BigData Boutique https://bigdataboutique.com/ • Creator of Pulse for OpenSearch https://pulse.support/ About Me

bigdataboutique.com World Class OpenSearch Experts

• OpenSearch and Keyword Search • Vector Search • Semantic
Search and Hybrid Search • Query Intent Understanding • Measuring and Tuning Relevance Agenda

• Current version 3.5 • Origins in Elasticsearch (forked at
7.10) • 100% Open Source (Apache 2.0) • Notable OpenSearch Foundation members • AI and Vector Search Ready (and improving) • Rich Ecosystem & Tools • Well Understood and Supported • Managed Service Options: AWS, Aiven, Instaclustr https://docs.opensearch.org/latest/vector-search/ai-search/index/ Why OpenSearch?

Keyword Search The basics: • Normalization and Stemming • tf/idf
and BM25 • Fast Where it falls short: • Query has to exactly match terms in docs ◦ no semantic understanding ◦ synonyms • Scoring challenges • Can’t support AI / Agentic workloads

Semantic Search

Vector Search (ANN / kNN)

Sparse • Captures predeﬁned attributes • Often faster to search
on • Cheap to compute Dense • Generated by large models (LLMs) • Could be costly • Captures semantic meaning well Vectors: Sparse vs Dense

sparse_vector = { 0: 0.35, # "machine" — appears in
doc, moderate IDF 1: 0.35, # "learning" — appears in doc, moderate IDF 2: 0.72, # "subset" — rare term across corpus, high IDF 3: 0.68, # "artificial" 4: 0.68, # "intelligence" 5: 0.55, # "models" 6: 0.50, # "learn" 7: 0.61, # "patterns" 8: 0.45, # "training" 9: 0.30, # "data" — very common, low IDF } Sparse Vectors - TF/IDF & BM25

# Conceptual SPLADE output for "affordable electric cars" sparse_vector =
{ "affordable": 2.1, "electric": 1.9, "cars": 1.8, "vehicle": 1.3, # not in original text! "ev": 1.1, # not in original text! "cheap": 0.9, # not in original text! "price": 0.7, # not in original text! "battery": 0.5, # not in original text! "tesla": 0.3, # not in original text! } Sparse Vectors - SPLADE

Dense Vectors • Open-Source / Open-Weight Models ◦ Snowﬂae (arctic-embed),
e5 from MS, Alibaba and more • Proprietary or API-based ◦ Cohere Embed v4 ◦ OpenAI ◦ AWS Titan ◦ …

User query: "How do black holes form?" Step 1 —
LLM generates a hypothetical document: "Black holes form when massive stars exhaust their nuclear fuel and undergo gravitational collapse. When a star with a mass greater than approximately 20-25 solar masses reaches the end of its life, the core collapses under its own gravity, and if the remaining mass exceeds the Tolman-Oppenheimer-Volkoff limit, no known force can prevent complete collapse into a singularity..." Step 2 — Embed this fake document (not the original query) Step 3 — Use that embedding to search your corpus via ANN HyDE (Hypothetical Document Embeddings)

Precision: "Of all items retrieved items, how many were actually
positive?" (focusing on avoiding false positives) Recall: "Of all actual positive items, how many did the search find?" (focusing on avoiding false negatives) Precision and Recall

• Combines Keyword + Vector Search • Helps with improving
precision • Often involves additional methods • Reciprocal Rank Fusion (RRF) to merge highly ranked results from multiple search methods (like keyword and vector search) into a single, more relevant list, giving higher importance to documents appearing high in multiple lists, making it ideal for hybrid search. • BTW, we really like Cohere Embed for text embeddings: https://bigdataboutique.com/blog/cohere-embed-4-reducing-memory-foo tprint-with-no-loss-in-search-quality-dfb1d7 Hybrid Search

• Filter by label, category, location, source, etc. • Scope
to user, product, role • Not to be confused with Hybrid Search Semantic Search + Metadata

• Pre Processing • Retrieve • Generate The RAG Use-Case

Example: Image Search https://bigdataboutique.com/blog/implementing-semantic-image-search-with-amazo n-opensearch-service-2e365d

Business Signals and Boosting • Decay functions on numeric values
• Use function_score to prioritize recent documents, boost on popularity, etc https://docs.opensearch.org/latest/query-dsl/compound/function-score/

Query Intent

Keyword Extraction

Example: Synonym Expansion with LLM https://bigdataboutique.com/blog/innovating-search-experience-with-amazon-ope nsearch-and-amazon-bedrock-d045bc

Query Intent Understanding User query: Red Dress Translates into OpenSearch
query: -> color: red AND ( -> “dress” in category or “dress” in description OR -> category: Clothing )

Search Relevance Evaluation • Once a Baseline is established •
A Judgment List is needed: • Human Evaluation ◦ Crowd sourcing ◦ Click streams ◦ Thumbs up / down • Production Feedback • LLM as a judge

Common Evaluation Metrics nDCG@10 is a typical choice, but different
use-case would have different concerns: • Internal knowledge base (freshness, good recall metric) • eCommerce (heavy on business signals, query intent understanding)

The Relevance Workbench https://docs.opensearch.org/latest/search-plugins/search-relevance/using-search-rele vance-workbench/

• Use an AI model to reorder results • Based
on relevancy to the question • Learning To Rank (LTR) https://docs.opensearch.org/latest/search-plugins/ltr/index/ Rerank

Scaling Vector Search https://bigdataboutique.com/blog/scaling-vector-search-performance-from-millions- to-billions-8d50a1

Summary • Keyword Search is not enough in 2026 search
applications • Getting started with vectors never been so easy • Use Hybrid Search and RRF • Don’t forget Relevance Evaluation • Query Intent understanding is easy to start with and highly recommended Tutorial you could follow: https://bigdataboutique.com/blog/recipes-to-vectors-using-opensearch-as-vector -database-aba607

bigdataboutique.com [email protected] Contact

Modernizing Search with OpenSearch: Beyond Keyw...

Modernizing Search with OpenSearch: Beyond Keyword Matching

Itamar Syn-Hershko

Featured

Transcript

• Father of 4 girls • 15+ years in the

bigdataboutique.com World Class OpenSearch Experts

• OpenSearch and Keyword Search • Vector Search • Semantic

• Current version 3.5 • Origins in Elasticsearch (forked at

Keyword Search The basics: • Normalization and Stemming • tf/idf

Semantic Search

Vector Search (ANN / kNN)

Sparse • Captures predeﬁned attributes • Often faster to search

sparse_vector = { 0: 0.35, # "machine" — appears in

# Conceptual SPLADE output for "affordable electric cars" sparse_vector =

Dense Vectors • Open-Source / Open-Weight Models ◦ Snowﬂae (arctic-embed),

User query: "How do black holes form?" Step 1 —

Precision: "Of all items retrieved items, how many were actually

• Combines Keyword + Vector Search • Helps with improving

• Filter by label, category, location, source, etc. • Scope

• Pre Processing • Retrieve • Generate The RAG Use-Case

Example: Image Search https://bigdataboutique.com/blog/implementing-semantic-image-search-with-amazo n-opensearch-service-2e365d

Business Signals and Boosting • Decay functions on numeric values

Query Intent

Keyword Extraction

Example: Synonym Expansion with LLM https://bigdataboutique.com/blog/innovating-search-experience-with-amazon-ope nsearch-and-amazon-bedrock-d045bc

Example: Synonym Expansion with LLM https://bigdataboutique.com/blog/innovating-search-experience-with-amazon-ope nsearch-and-amazon-bedrock-d045bc

Query Intent Understanding User query: Red Dress Translates into OpenSearch

Search Relevance Evaluation • Once a Baseline is established •

Common Evaluation Metrics nDCG@10 is a typical choice, but different

The Relevance Workbench https://docs.opensearch.org/latest/search-plugins/search-relevance/using-search-rele vance-workbench/

• Use an AI model to reorder results • Based

Scaling Vector Search https://bigdataboutique.com/blog/scaling-vector-search-performance-from-millions- to-billions-8d50a1

Summary • Keyword Search is not enough in 2026 search

bigdataboutique.com [email protected] Contact