Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beyond_the_Prompt__Evaluating__Testing__and_Sec...

Mete Atamel
March 03, 2025
3

 Beyond_the_Prompt__Evaluating__Testing__and_Securing_LLM_Applications.pdf

Mete Atamel

March 03, 2025
Tweet

Transcript

  1. Beyond the Prompt: Evaluating, Testing, and Securing LLM Applications Mete

    Atamel Developer Advocate @ Google @meteatamel atamel.dev speakerdeck.com/meteatamel
  2. Structure LLM outputs with Pydantic from pydantic import BaseModel class

    Recipe(BaseModel): name: str description: str ingredients: list[str] response = client.models.generate_content( model=MODEL_ID, contents="List a few popular cookie recipes and their ingredients.", config=GenerateContentConfig( response_mime_type="application/json", response_schema=Recipe))
  3. Structure LLM outputs with Pydantic { "description": "Classic chocolate chip

    cookies with a soft chewy center and crisp edges.", "name": "Chocolate Chip Cookies", "ingredients": [ "1 cup (2 sticks) unsalted butter, softened", "3/4 cup granulated sugar", "3/4 cup packed brown sugar", "2 large eggs", "1 teaspoon vanilla extract", "2 1/4 cups all-purpose flour", "1 teaspoon baking soda", "1 teaspoon salt", "2 cups chocolate chips" ] }
  4. How do you know the output is “good”? You need

    to measure What do you measure and how? Welcome to this talk :-)
  5. Deterministic/statistical metrics • Equals, contains, contains-any, starts-with, … • BLEU:

    How closely the response matches with the reference? • ROUGE: How closely the response summarization matches the reference? Problems: 1. You need a reference dataset 2. Fall short in capturing semantic nuances
  6. Model-graded metrics (general) • Similar • G-Eval • Hallucination •

    Answer relevancy • Bias, Toxicity • Json correctness • Summarization • Tool correctness • Prompt alignment • … Problem: Rely on one LLM to grade another LLM
  7. Retriever metrics • Contextual relevance - How relevant is the

    context for the input? • Contextual recall* - Did it fetch all the relevant information? • Contextual precision* - Do relevant nodes in the context rank higher than the irrelevant ones? Generator metrics • Answer relevance - How relevant is the output to the input? • Faithfulness / Groundedness - Does the output factually align with the context? Model-graded metrics (RAG) *require an expected output
  8. We now have an idea on how to measure Kind

    of :-) What about bad inputs and outputs? We need to detect and block them
  9. OWASP Top 10 for LLM Applications 2025 https://owasp.org/www-project-top-10-for-large-language-model-applications/ LLM01: Prompt

    Injection LLM02: Sensitive Information Disclosure LLM03: Supply Chain LLM04: Data and Model Poisoning LLM05: Improper Output Handling LLM06: Excessive Agency LLM07: System Prompt Leakage LLM08: Vector and Embedding Weaknesses LLM09: Misinformation LLM10: Unbounded Consumption