Beyond_the_PromptEvaluatingTesting__and_Securing_LLM_Applications.pdf

Beyond the Prompt: Evaluating, Testing, and Securing LLM Applications Mete
Atamel Developer Advocate @ Google @meteatamel atamel.dev speakerdeck.com/meteatamel

Very easy To get LLMs generate content

Very difficult To make sure the LLM output is “good”

What is “good” LLM output? Structured Correct Relevant Grounded Non-toxic
Unbiased …

Structure LLM outputs with Pydantic from pydantic import BaseModel class
Recipe(BaseModel): name: str description: str ingredients: list[str] response = client.models.generate_content( model=MODEL_ID, contents="List a few popular cookie recipes and their ingredients.", config=GenerateContentConfig( response_mime_type="application/json", response_schema=Recipe))

Structure LLM outputs with Pydantic { "description": "Classic chocolate chip
cookies with a soft chewy center and crisp edges.", "name": "Chocolate Chip Cookies", "ingredients": [ "1 cup (2 sticks) unsalted butter, softened", "3/4 cup granulated sugar", "3/4 cup packed brown sugar", "2 large eggs", "1 teaspoon vanilla extract", "2 1/4 cups all-purpose flour", "1 teaspoon baking soda", "1 teaspoon salt", "2 cups chocolate chips" ] }

How do you know the output is “good”? You need
to measure What do you measure and how? Welcome to this talk :-)

LLM evaluation frameworks

Metrics RAGAS

https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation

Deterministic/statistical metrics • Equals, contains, contains-any, starts-with, … • BLEU:
How closely the response matches with the reference? • ROUGE: How closely the response summarization matches the reference? Problems: 1. You need a reference dataset 2. Fall short in capturing semantic nuances

Model-graded metrics (general) • Similar • G-Eval • Hallucination •
Answer relevancy • Bias, Toxicity • Json correctness • Summarization • Tool correctness • Prompt alignment • … Problem: Rely on one LLM to grade another LLM

Model-graded metrics (RAG)

Retriever metrics • Contextual relevance - How relevant is the
context for the input? • Contextual recall* - Did it fetch all the relevant information? • Contextual precision* - Do relevant nodes in the context rank higher than the irrelevant ones? Generator metrics • Answer relevance - How relevant is the output to the input? • Faithfulness / Groundedness - Does the output factually align with the context? Model-graded metrics (RAG) *require an expected output

RAG Triad

We now have an idea on how to measure Kind
of :-) What about bad inputs and outputs? We need to detect and block them

OWASP Top 10 for LLM Applications 2025 https://owasp.org/www-project-top-10-for-large-language-model-applications/ LLM01: Prompt
Injection LLM02: Sensitive Information Disclosure LLM03: Supply Chain LLM04: Data and Model Poisoning LLM05: Improper Output Handling LLM06: Excessive Agency LLM07: System Prompt Leakage LLM08: Vector and Embedding Weaknesses LLM09: Misinformation LLM10: Unbounded Consumption

LLM security frameworks LLM Guard Guardrails AI https://www.promptfoo.dev/docs/red-team/owasp-llm-top-10/

Thank you! Mete Atamel Developer Advocate at Google @meteatamel atamel.dev
speakerdeck.com/meteatamel

Beyond_the_PromptEvaluatingTesting__and_Sec...

Beyond_the_PromptEvaluatingTesting__and_Securing_LLM_Applications.pdf

Mete Atamel

More Decks by Mete Atamel

Other Decks in Programming

Featured

Transcript

Beyond the Prompt: Evaluating, Testing, and Securing LLM Applications Mete

Very easy To get LLMs generate content

Very difficult To make sure the LLM output is “good”

What is “good” LLM output? Structured Correct Relevant Grounded Non-toxic

Structure LLM outputs with Pydantic from pydantic import BaseModel class

Structure LLM outputs with Pydantic { "description": "Classic chocolate chip

How do you know the output is “good”? You need

LLM evaluation frameworks

Metrics RAGAS

https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation

Deterministic/statistical metrics • Equals, contains, contains-any, starts-with, … • BLEU:

Model-graded metrics (general) • Similar • G-Eval • Hallucination •

Model-graded metrics (RAG)

Retriever metrics • Contextual relevance - How relevant is the

RAG Triad

We now have an idea on how to measure Kind

OWASP Top 10 for LLM Applications 2025 https://owasp.org/www-project-top-10-for-large-language-model-applications/ LLM01: Prompt

LLM security frameworks LLM Guard Guardrails AI https://www.promptfoo.dev/docs/red-team/owasp-llm-top-10/

Thank you! Mete Atamel Developer Advocate at Google @meteatamel atamel.dev

Beyond_the_Prompt__Evaluating__Testing__and_Sec...

Beyond_the_Prompt__Evaluating__Testing__and_Securing_LLM_Applications.pdf

More Decks by Mete Atamel

Other Decks in Programming

Featured

Transcript

Beyond_the_PromptEvaluatingTesting__and_Sec...

Beyond_the_PromptEvaluatingTesting__and_Securing_LLM_Applications.pdf