GitHub Universe: Evaluating RAG apps in GitHub Actions

Save the pass/fail for unit tests— there’s a better way
to evaluate AI apps Pamela Fox Python Advocate, Microsoft pamelafox.org / @pamelafox

Agenda What is RAG? Bulk evaluation Try it yourself! Best
practices for evals Going to production

Generative AI apps: Two flavors Prompt-only: Primarily relies on customizing
the prompt and few shots to get answers in a certain tone/format. https://github.com/f/awesome-chatgpt-prompts https://github.com/Azure-Samples/azure-search-openai-demo/ RAG (Retrieval Augmented Generation): Combines the prompt with domain data from a knowledge base to get grounded answers

RAG: Retrieval Augmented Generation Search title: Save the pass/fail for
unit tests— there’s a better way to evaluate AI apps description: We all know how to run unit tests on our apps in CI/CD workflows, but we're now in a new world of AI apps with non- deterministic output.... Large Language Model In the Python sessions at GitHub Universe, you can expect to see a showcase of GitHub Copilot being used in various ways... User Question Which GitHub services will be showcased in the python sessions at GitHub Universe?

Azure OpenAI + PostgreSQL Code: aka.ms/rag-postgres-ghu Based on upstream: aka.ms/rag-postgres
Deployed demo: aka.ms/rag-postgres-ghu/demo Example RAG application

Are they clear and understandable? Are they correct? (relative to
the knowledge base) Are they formatted in the desired manner? Are the answers high quality? You will see GitHub Copilot being used in various ways. In one session, you'll learn how to leverage GitHub Copilot for advanced data solutions. Another session will focus on developing and testing APIs faster with GitHub Copilot and Postman's Postbot. [Session10] [Session21] In the Python sessions at GitHub Universe, you can expect to see a showcase of GitHub Copilot being used in various ways. In one session, you'll learn how to leverage GitHub Copilot to enhance data engineering projects with advanced features to identify data-related solutions and create automations that increase high data quality. Another session will focus on the synergy between GitHub Copilot and Postman’s Postbot, an extension that transforms API development and testing, allowing you to speed up the creation and testing of APIs. These sessions will provide valuable insights on how to integrate these tools to streamline your workflow and ensure the delivery of high-quality work. Which GitHub services will be showcased in the python sessions at GitHub Universe? The Python sessions at GitHub Universe will showcase tools for AI app evaluation, leveraging GitHub Copilot for data engineering projects, and integrating GitHub Copilot with Postman's Postbot to speed up API development and testing.

What affects the quality? • Search engine/database type • Search
mode (keyword, vector, ...) • # of results returned • Search query cleaning step • Data preparation Knowledge Search Large Language Model Question • Model • Temperature • Max tokens • Message history • System prompt

1. Generate ground truth data (~200 QA pairs) 2. Evaluate
with different parameters 3. Compare the metrics and answers across evaluations Bulk evaluation

Step 1: Generate ground truth data The ground truth data
is the ideal answer for a question. One option: generate synthetic ground truth data, then do manual curation. Azure OpenAI results Q/A pairs Database Human review Other options: generate based on help desk tickets, support emails, search requests, questions from beta testers, etc.

Step 2: Evaluate app against ground truth App endpoint question
GPT metrics gpt_coherence gpt_groundedness gpt_relevance Compute LLM metrics and custom metrics for every question in ground truth. length citation_match latency code metrics response answer Python code azure-ai-evaluation SDK truth.jsonl Azure OpenAI prompt

azure-ai-evaluation: GPT evaluators Question | Context | Answer | Ground
truth RelevanceEvaluator CoherenceEvaluator FluencyEvaluator GroundednessEvaluator SimilarityEvaluator aka.ms/azure-ai-eval-sdk

Step 3: Compare metrics across evals Results can be stored
in a repo, a storage container, or a specialized tool like Azure AI Studio.

Evaluation in CI When a PR changes anything that could
affect app quality, we run a bulk evaluation and compare the results to the checked in baseline. Evaluate action needs: • Azure OpenAI model access • Sufficient model capacity Example PR: https://github.com/pamelafox/rag- postgres-openai-python-ghu/pull/7

1. Navigate to the repository: aka.ms/rag-postgres-ghu 2. Make a change
by editing one of the files in the repo: 1. evals/eval_config.json: Change API parameter settings for retrieval and mode 2. src/backend/fastapi_app/prompts/answer.txt: Change the main prompt 3. Submit a PR with the code change 4. Add comment with "/evaluate" (Don't expect it to autocomplete, since it's custom) 5. Wait 1 minute for a confirmation that the evaluation has begun 6. Wait ~10 minutes for a comment with a summary of your evaluation metrics While you wait: check other PRs, aka.ms/azure-ai-eval-sdk, aka.ms/rag/eval Try it! Run evals on your PR

 Evaluate data that you know; LLMs can be convincingly
wrong.  What works better for 3 Qs doesn't always work better for 200.  Don't trust absolute metrics for GPT evals, but pay attention to any relative changes.  Remember that LLMs are non-deterministic. Best practices for evaluating

 Good, complete retrieval = good, complete answers.  Irrelevant,
distracting results = wrong answers.  Related: Vector search can be noisy! Use wisely.  Model choice makes a huge difference. What moves the needle? What doesn't move the needle? • Remove fluff from prompts; it's usually inconsequential.

Identify business use case Run app against sample questions Try
different parameters Satisfied? Run flow against larger dataset Evaluate answers Satisfied? Deploy app to users No No Yes Yes Change defaults Run online evaluations Improve the prompt and orchestration Connect to your data Customize prompt for domain Collect user feedback 1. Ideating/exploring 2. Building/augmenting 3. Productionizing Productionizing a RAG app Run A/B experiments

Improving ground truth Add a / button with feedback dialog
to your live app: Then you can: Manually debug the answers that got rated Add questions to ground truth data

You can run evaluations on deployed applications on live answers,
as long as your metric doesn't require ground truth. Evaluation metric Type Can run without ground truth? FluencyEvaluator GPT metric (azure-ai-evaluation) CoherenceEvaluator GPT metric (azure-ai-evaluation) RelevanceEvaluator GPT metric (azure-ai-evaluation) GroundednessEvaluator GPT metric (azure-ai-evaluation) SimilarityEvaluator GPT metric (azure-ai-evaluation) F1ScoreEvaluator Code metric (azure-ai-evaluation) has_citation Code metric (custom) citation_match Code metric (custom) answer_length Code metric (custom) Running online evaluations

Conducting A/B experiments Excited by a change but not sure
it will be best for users in production? Run an A/B experiment, and measure metrics across experiment groups. Private preview sign-up: https://aka.ms/genAI-CI-CD-private-preview

Thank you! Pamela Fox Python Advocate, Microsoft www.pamelafox.org / @pamelafox

We want to hear from you! Take the session survey
by visiting the attendee portal so we can continue to make your Universe experience cosmic!

GitHub Universe: Evaluating RAG apps in GitHub ...

GitHub Universe: Evaluating RAG apps in GitHub Actions

Pamela Fox

More Decks by Pamela Fox

Other Decks in Technology

Featured

Transcript

Save the pass/fail for unit tests— there’s a better way

Agenda What is RAG? Bulk evaluation Try it yourself! Best

Generative AI apps: Two flavors Prompt-only: Primarily relies on customizing

RAG: Retrieval Augmented Generation Search title: Save the pass/fail for

Azure OpenAI + PostgreSQL Code: aka.ms/rag-postgres-ghu Based on upstream: aka.ms/rag-postgres

Are they clear and understandable? Are they correct? (relative to

What affects the quality? • Search engine/database type • Search

1. Generate ground truth data (~200 QA pairs) 2. Evaluate

Step 1: Generate ground truth data The ground truth data

Step 2: Evaluate app against ground truth App endpoint question

azure-ai-evaluation: GPT evaluators Question | Context | Answer | Ground

Step 3: Compare metrics across evals Results can be stored

Evaluation in CI When a PR changes anything that could

1. Navigate to the repository: aka.ms/rag-postgres-ghu 2. Make a change

 Evaluate data that you know; LLMs can be convincingly

 Good, complete retrieval = good, complete answers.  Irrelevant,

Identify business use case Run app against sample questions Try

Improving ground truth Add a / button with feedback dialog

You can run evaluations on deployed applications on live answers,

Conducting A/B experiments Excited by a change but not sure

Thank you! Pamela Fox Python Advocate, Microsoft www.pamelafox.org / @pamelafox

We want to hear from you! Take the session survey