Slide 1

Slide 1 text

Beyond the Prompt: Evaluating, Testing, and Securing LLM Applications Mete Atamel Developer Advocate @ Google @meteatamel atamel.dev speakerdeck.com/meteatamel

Slide 2

Slide 2 text

Very easy To get LLMs generate content

Slide 3

Slide 3 text

Very difficult To make sure the LLM output is “good”

Slide 4

Slide 4 text

What is “good” LLM output? Structured Correct Relevant Grounded Non-toxic Unbiased …

Slide 5

Slide 5 text

Structure LLM outputs with Pydantic from pydantic import BaseModel class Recipe(BaseModel): name: str description: str ingredients: list[str] response = client.models.generate_content( model=MODEL_ID, contents="List a few popular cookie recipes and their ingredients.", config=GenerateContentConfig( response_mime_type="application/json", response_schema=Recipe))

Slide 6

Slide 6 text

Structure LLM outputs with Pydantic { "description": "Classic chocolate chip cookies with a soft chewy center and crisp edges.", "name": "Chocolate Chip Cookies", "ingredients": [ "1 cup (2 sticks) unsalted butter, softened", "3/4 cup granulated sugar", "3/4 cup packed brown sugar", "2 large eggs", "1 teaspoon vanilla extract", "2 1/4 cups all-purpose flour", "1 teaspoon baking soda", "1 teaspoon salt", "2 cups chocolate chips" ] }

Slide 7

Slide 7 text

How do you know the output is “good”? You need to measure What do you measure and how? Welcome to this talk :-)

Slide 8

Slide 8 text

LLM evaluation frameworks

Slide 9

Slide 9 text

Metrics RAGAS

Slide 10

Slide 10 text

https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation

Slide 11

Slide 11 text

Deterministic/statistical metrics ● Equals, contains, contains-any, starts-with, … ● BLEU: How closely the response matches with the reference? ● ROUGE: How closely the response summarization matches the reference? Problems: 1. You need a reference dataset 2. Fall short in capturing semantic nuances

Slide 12

Slide 12 text

Model-graded metrics (general) ● Similar ● G-Eval ● Hallucination ● Answer relevancy ● Bias, Toxicity ● Json correctness ● Summarization ● Tool correctness ● Prompt alignment ● … Problem: Rely on one LLM to grade another LLM

Slide 13

Slide 13 text

Model-graded metrics (RAG)

Slide 14

Slide 14 text

Retriever metrics ● Contextual relevance - How relevant is the context for the input? ● Contextual recall* - Did it fetch all the relevant information? ● Contextual precision* - Do relevant nodes in the context rank higher than the irrelevant ones? Generator metrics ● Answer relevance - How relevant is the output to the input? ● Faithfulness / Groundedness - Does the output factually align with the context? Model-graded metrics (RAG) *require an expected output

Slide 15

Slide 15 text

RAG Triad

Slide 16

Slide 16 text

We now have an idea on how to measure Kind of :-) What about bad inputs and outputs? We need to detect and block them

Slide 17

Slide 17 text

OWASP Top 10 for LLM Applications 2025 https://owasp.org/www-project-top-10-for-large-language-model-applications/ LLM01: Prompt Injection LLM02: Sensitive Information Disclosure LLM03: Supply Chain LLM04: Data and Model Poisoning LLM05: Improper Output Handling LLM06: Excessive Agency LLM07: System Prompt Leakage LLM08: Vector and Embedding Weaknesses LLM09: Misinformation LLM10: Unbounded Consumption

Slide 18

Slide 18 text

LLM security frameworks LLM Guard Guardrails AI https://www.promptfoo.dev/docs/red-team/owasp-llm-top-10/

Slide 19

Slide 19 text

Thank you! Mete Atamel Developer Advocate at Google @meteatamel atamel.dev speakerdeck.com/meteatamel