Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reduce LLM Calls with Vector Search

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.
Avatar for Raphael De Lio Raphael De Lio
September 30, 2025

Reduce LLM Calls with Vector Search

LLMs are powerful, but calling them for everything gets expensive, slow, and energy-hungry fast. What if you could handle common tasks like classification, routing, and caching without reaching for a massive model every time?

In this session, I’ll show you how to use vector search and semantic patterns to build smarter systems that skip unnecessary LLM calls and still deliver. We’ll cover:

• How semantic classification can match intent without tokens or prompts
• How to route requests based on meaning, not brittle rules
• How semantic caching helps you reuse answers and cut costs

You’ll see how to replace brute-force prompting with clean, efficient logic using embeddings, similarity, and lightweight decision-making. No complex ML pipelines, no GPU bills, just smart patterns that save time, money, and energy.

This session will help you do it better with fewer calls, less waste, and a lot more control.

Avatar for Raphael De Lio

Raphael De Lio

September 30, 2025

More Decks by Raphael De Lio

Other Decks in Technology

Transcript

  1. ⓒ 2026 Redis Ltd. All rights reserved. 1 Reducing LLM

    Calls with Vector Search Raphael De Lio
  2. ⓒ 2026 Redis Ltd. All rights reserved. 2 The new

    stack for AI agents Redis leads as the most-used tool for agent data and vector search. Check the full survey at: https://survey.stackoverflow.co/2025/
  3. ⓒ 2026 Redis Ltd. All rights reserved. 3 Not all

    context is good context GPT-5 API Price: $1.25 / 1M Input tokens - $10 / 1M Output tokens Source: https://openai.com/index/introducing-gpt-5-for-developers/
  4. ⓒ 2026 Redis Ltd. All rights reserved. 4 Precision is

    not improving GPT-5.4 API Price: $1.50 / 1M Input tokens - $15 / 1M Output tokens Source: https://openai.com/index/introducing-gpt-5-4/
  5. ⓒ 2025 Redis Ltd. All rights reserved. 5 Semantic classification

    Semantic tool calling Semantic caching What we’re covering Vector search patterns for faster, cheaper, and greener performance.
  6. ⓒ 2025 Redis Ltd. All rights reserved. 7 What is

    a vector? A (-110, 500) B (465, -497) C (-167, -500) D (-178, -200) E (-195, -454)
  7. ⓒ 2025 Redis Ltd. All rights reserved. 8 What is

    a vector? 0 500 -500 500 -500 E (-195, -454) D (-178, -200) C (167, -500) A (-110, 500) B (465, -497) Each vector is a point in multi-dimensional space
  8. ⓒ 2025 Redis Ltd. All rights reserved. 9 Vector Search

    0 500 -500 500 -500 Finding similarity means measuring the distance between vectors E (-195, -454) D (-178, -200) C (-167, -500) A (-110, 500) B (465, -497)
  9. ⓒ 2025 Redis Ltd. All rights reserved. 10 Embedding model

    Embedding models turn unstructured data into vectors Images Text [0.0234, -0.1456, 0.0891, -0.2143, 0.1678, 0.0456, -0.0567, 0.2890, 0.0345, -0.1789, 0.0912, 0.1567, 0.1345, -0.0789, 0.0456, 0.1823, -0.0567, 0.0234, 0.0678, 0.1234, -0.0345, 0.0789, 0.1567, -0.0234, -0.1678, 0.0345, 0.1234, -0.0567, 0.0789, 0.1456, # ... continues for 384 total dimensions 0.0456, -0.0823, 0.1234, 0.0567, -0.1789, 0.0345]
  10. ⓒ 2025 Redis Ltd. All rights reserved. 11 Vector representations

    enable similarity search What's the capital of Spain? is similar to Which city is the capital of Spain?
  11. ⓒ 2025 Redis Ltd. All rights reserved. 12 Vector Search

    0 500 -500 500 -500 Finding similarity means measuring the distance between vectors E (-195, -454) D (-178, -200) C (-167, -500) A (-110, 500) B (465, -497)
  12. ⓒ 2025 Redis Ltd. All rights reserved. 13 The essentials

    of vector search Learn the fundamentals on our YouTube channel Exact vs approximate nearest neighbors in vector databases What is a vector database? What is an embedding model? Redis Redis Redis
  13. ⓒ 2025 Redis Ltd. All rights reserved. 16 Approach #1:

    Using an LLM Is this about Redis? LLM True/false Response Every repeated LLM call is money on fire. Redis 8 semantic caching understands meaning, not just keys. open.substack.com/pub/systemde... Social Media Post Prompt Every query runs through the model—simple, but expensive
  14. ⓒ 2025 Redis Ltd. All rights reserved. 17 Token consumption

    Disadvantages Time spent High token consumption and wasted time add up quickly.
  15. ⓒ 2025 Redis Ltd. All rights reserved. 18 Approach #2:

    Using a vector database LLM Pro tip: Use SCAN instead of KEYS in production. KEYS blocks the entire server while SCAN is non-blocking. Remember when everyone said Redis is just a cache? Now it powers real-time leaderboards, pub/sub systems, full applications. Evolution in action. PostgreSQL vs Redis for caching debate misses the point. Use Redis as L1 cache, PG as source of truth. Why choose when you can have both? Our Redis instance has been running 847 days without restart. Rock solid stability 💪 #redis #uptime Generate 150 social media posts about Redis [...]
  16. ⓒ 2025 Redis Ltd. All rights reserved. 19 Approach #2:

    Using a vector database Redis is the fastest tool for performing semantic caching Remember when everyone said Redis is just a cache? Now it powers real-time leaderboards, pub/sub systems, full applications. Evolution in action. PostgreSQL vs Redis for caching debate misses the point. Use Redis as L1 cache, PG as source of truth. Why choose when you can have both? Our Redis instance has been running 847 days without restart. Rock solid stability 💪 #redis #uptime [...] Embedding model Embed references Vector database Store embeddings
  17. ⓒ 2025 Redis Ltd. All rights reserved. 20 Approach #2:

    Using a vector database New real post Embedding model Embed Vector database Similarity search Redis is the fastest tool for performing semantic caching Similarity score: 0.2843 Is it similar enough? Every repeated LLM call is money on fire. Redis 8 semantic caching understands meaning, not just keys. open.substack.com/pub/systemde...
  18. ⓒ 2025 Redis Ltd. All rights reserved. 22 Classification self

    improvement New real post Embedding model Embed Vector database Similarity search Redis is the fastest tool for performing semantic caching Similarity score: 0.2843 Add as new route reference Every repeated LLM call is money on fire. Redis 8 semantic caching understands meaning, not just keys. open.substack.com/pub/systemde...
  19. ⓒ 2025 Redis Ltd. All rights reserved. 23 Classification hybrid

    approach Redis 8 can scale to 1 billion vectors while keeping a median latency of 200ms New real post Embedding Model Embed Vector Database Similarity search Our Redis instance has been running 847 days without restart. Rock solid stability 💪 #redis #uptime Similarity score: 0.693 Add as new route reference If not similar enough Fallback to classification with LLM LLM
  20. ⓒ 2025 Redis Ltd. All rights reserved. 26 Approach #1:

    Using an LLM Agent These are the available tools LLM Call tool X
  21. ⓒ 2025 Redis Ltd. All rights reserved. 28 Approach #2:

    Using a vector database What's the weather like? Tool: get_weather_default_city Will it rain today? Hello! What can you do? Tool: greeting_and_help Hello! How can you help me? Tool: greeting_and_help Do I have any notifications? Tool: new_notifications Read my notifications Tool: new_notifications Turn on the lights Tool: turn_on_the_lights_room Make the lights light Tool: turn_on_the_lights_room Tool: get_weather_default_city
  22. ⓒ 2025 Redis Ltd. All rights reserved. 29 Approach #2:

    Using a vector database Embedding Model Embed references Vector Database Store embeddings [...] Tool: [...] What's the weather like? Tool: get_weather_default_city Will it rain today? Hello! What can you do? Tool: greeting_and_help Tool: get_weather_default_city
  23. ⓒ 2025 Redis Ltd. All rights reserved. 30 Approach #2:

    Using a vector database Hey!! What are you capable of?? User prompt Hello! What can you do? Similarity Score: 0.0459 Is it similar enough? Tool: greeting_and_help Embedding Model Embed Vector Database Similarity search
  24. ⓒ 2025 Redis Ltd. All rights reserved. 32 Tool calling

    chunking Hey. I had a bad day yesterday. The weather was terrible and I crashed my bike onto a tree. Anyway, will it also rain today? User prompt Embedding Model Embed Vector Database Similarity search Hello! What can you do? Similarity Score: 0.928 Tool: greeting_and_help
  25. ⓒ 2025 Redis Ltd. All rights reserved. 33 Tool calling

    chunking Hey. I had a bad day yesterday. The weather was terrible and I crashed my bike onto a tree. Anyway, will it also rain today? User prompt Embedding Model Embed Vector Database Similarity search Will it rain today? Similarity score: 0.274 Tool: get_weather_default_city Hey. Anyway, will it also rain today? I had a bad day yesterday. The weather was terrible and I crashed my bike onto a tree.
  26. ⓒ 2025 Redis Ltd. All rights reserved. 36 Approach #1:

    Regular flow (no caching) Chatbot Input Read LLM Generate Output
  27. ⓒ 2025 Redis Ltd. All rights reserved. 37 Token consumption

    Is there room for improvement? Time spent
  28. ⓒ 2025 Redis Ltd. All rights reserved. 38 Approach #2:

    Using a vector database What are the colors available for the Chevy Colorado? User Prompt Embedding Model Embed Vector Database Similarity Search In what color is the Colorado available? Similarity Score: 0.274 Is it similar enough? Return response to user Regular Agentic Pipeline Yes No
  29. ⓒ 2025 Redis Ltd. All rights reserved. 44 Spring AI

    Advisors Advisors are interceptors that handle requests and responses in our AI applications. We can use them to perform actions before and/or after the request is sent like enriching the request with more information or even cancelling the request to the chat model entirely. before(1) after(1) Advisor 1 before(N) after(N) Advisor N Chat Model Prompt Advised Prompt Advised Prompt Advised Response Advised Response Response
  30. ⓒ 2025 Redis Ltd. All rights reserved. 45 Spring AI

    Advisors Before Checks if allowed After Does nothing SemanticGuardrailAdvisor Before Checks the Cache After Store in Cache SemanticCachingAdvisor Chat Model Prompt Allowed No Cache Hit Response
  31. ⓒ 2025 Redis Ltd. All rights reserved. 51 Retrieval optimizer

    The retrieval optimizer is an open-source framework for systematically improving the performance of search and retrieval systems that run on Redis. It is designed to take you from “my search seems okay” to “I can prove this configuration is optimal” by combining benchmarking, experimentation, and automated optimization.
  32. ⓒ 2025 Redis Ltd. All rights reserved. 53 Check the

    full article at: https://redis.io/blog/benchmarking-results-for-vector-databases/
  33. ⓒ 2025 Redis Ltd. All rights reserved. 55 In January

    alone, Redis LangCache saved our customers more than 5 Billion LLM tokens. According to ChatGPT that's approximately two trees saved per day. Our goal is to save a million trees this year.