Slide 1

Slide 1 text

MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Marco Frodl [email protected] Principal Consultant for Generative AI @marcofrodl

Slide 2

Slide 2 text

MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions About Me Marco Frodl Principal Consultant for Generative AI Thinktecture AG X: @marcofrodl E-Mail: [email protected] LinkedIn: https://www.linkedin.com/in/marcofrodl/ https://www.thinktecture.com/thinktects/marco-frodl/

Slide 3

Slide 3 text

MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Turbo 🚀 https://www.aurelio.ai/semantic-router Semantic Router is a superfast decision-making layer for your LLMs and agents. Rather than waiting for slow, unreliable LLM generations to make tool-use or safety decisions, we use the magic of semantic vector space — routing our requests using semantic meaning.

Slide 4

Slide 4 text

MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Turbo 🚀 https://www.aurelio.ai/semantic-router Semantic Router is a superfast decision-making layer for your LLMs and agents. Rather than waiting for slow, unreliable LLM generations to make tool-use or safety decisions, we use the magic of semantic vector space — routing our requests using semantic meaning. It’s perfect for: input guarding, topic routing, tool-use decisions.

Slide 5

Slide 5 text

MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Turbo 🚀 in Numbers In my RAG example, Semantic Router using remote services is 3.4 times faster than an LLM and it is 30 times less expensive. Semantic Router with local model is 7.7 times faster and 60 times less expensive vs LLM.

Slide 6

Slide 6 text

MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Really? More Safety? Better Speed? Less Budget?

Slide 7

Slide 7 text

MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Refresher: What is RAG? “Retrieval-Augmented Generation (RAG) extends the capabilities of LLMs to an organization's internal knowledge, all without the need to retrain the model.

Slide 8

Slide 8 text

MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Refresher: What is RAG? https://aws.amazon.com/what-is/retrieval-augmented-generation/ “Retrieval-Augmented Generation (RAG) extends the capabilities of LLMs to an organization's internal knowledge, all without the need to retrain the model. It references an authoritative knowledge base outside of its training data sources before generating a response”

Slide 9

Slide 9 text

Ask me anything MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Simple RAG Question Prepare Search Search Results Question Answer LLM Vector DB Embedding Model Question as Vector Workflow Terms - Retriever - Chain Elements Embedding- Model Vector- DB Python LLM LangChain

Slide 10

Slide 10 text

MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Demo: Simple RAG

Slide 11

Slide 11 text

Our sample content MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Simple RAG in a nutshell

Slide 12

Slide 12 text

Which retriever do you want? MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Multiple Retriever

Slide 13

Slide 13 text

Best source determination before the search MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Advanced RAG Question Retriever Selection 0-N Search Results Question Answer LLM Embedding Model Vector DB A Question as Vector Vector DB B LLM Prepare Search or

Slide 14

Slide 14 text

MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Demo: Dynamic Retriever Selection with LLM

Slide 15

Slide 15 text

MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Embedding Model

Slide 16

Slide 16 text

Best source determination before the search MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Advanced RAG Question Retriever Selection 0-N Search Results Question Answer LLM Embedding Model Vector DB A Question as Vector Vector DB B LLM Prepare Search or

Slide 17

Slide 17 text

Best source determination before the search MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Advanced RAG w/ Semantic Router Question Retriever Selection 0-N Search Results Question Answer Embedding Model Vector DB A Question as Vector Vector DB B LLM Prepare Search or Embedding Model

Slide 18

Slide 18 text

MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Demo: Semantic Router with RAG

Slide 19

Slide 19 text

LLM as Router MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Turbo 🐌

Slide 20

Slide 20 text

Semantic Router with remote embedding model MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Turbo 🚀

Slide 21

Slide 21 text

MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Demo: Semantic Router running locally

Slide 22

Slide 22 text

Semantic Router with local embedding model MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Turbo 🚀

Slide 23

Slide 23 text

MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Speed & Budget in Numbers SR Remote is 3.4 times faster than LLM (0,62s vs 0,18s) SR Local is 7.75 times faster than LLM (0,62s vs 0,08s) SR Remote is 30 times cheaper than LLM ($0,60 vs $0,02) SR Local is 60 times cheaper than LLM ($0,60 vs $0,01)

Slide 24

Slide 24 text

Tool-Call needed? Tool Calling Workflow with LLM MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Tool-use Decisions Answer Execute Tool-Call Prepare Tool-Call More Tools? Regular LLM-Call Question LLM LLM LLM

Slide 25

Slide 25 text

Tool-Call needed? Tool Calling Workflow with LLM + Semantic Router MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Tool-use Decisions Answer Execute Tool-Call Prepare Tool-Call More Tools? Regular LLM-Call Embedding Model LLM LLM Question

Slide 26

Slide 26 text

MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Demo: Semantic Router tool-use decisions

Slide 27

Slide 27 text

MLCon 2025 Beyond LLMs: Using Embedding Models for Input Guarding, Semantic Routing, and Tool Decisions Yes, please! More Safety! Better Speed! Less Budget!

Slide 28

Slide 28 text

Thank you! Any questions? Marco Frodl @marcofrodl Principal Consultant for Generative AI