GitHub Universe: Evaluating RAG apps in GitHub Actions

Slide 1

Slide 1 text

Save the pass/fail for unit tests— there’s a better way to evaluate AI apps Pamela Fox Python Advocate, Microsoft pamelafox.org / @pamelafox

Slide 2

Slide 2 text

Agenda What is RAG? Bulk evaluation Try it yourself! Best practices for evals Going to production

Slide 3

Slide 3 text

Generative AI apps: Two flavors Prompt-only: Primarily relies on customizing the prompt and few shots to get answers in a certain tone/format. https://github.com/f/awesome-chatgpt-prompts https://github.com/Azure-Samples/azure-search-openai-demo/ RAG (Retrieval Augmented Generation): Combines the prompt with domain data from a knowledge base to get grounded answers

Slide 4

Slide 4 text

RAG: Retrieval Augmented Generation Search title: Save the pass/fail for unit tests— there’s a better way to evaluate AI apps description: We all know how to run unit tests on our apps in CI/CD workflows, but we're now in a new world of AI apps with non- deterministic output.... Large Language Model In the Python sessions at GitHub Universe, you can expect to see a showcase of GitHub Copilot being used in various ways... User Question Which GitHub services will be showcased in the python sessions at GitHub Universe?

Slide 5

Slide 5 text

Azure OpenAI + PostgreSQL Code: aka.ms/rag-postgres-ghu Based on upstream: aka.ms/rag-postgres Deployed demo: aka.ms/rag-postgres-ghu/demo Example RAG application

Slide 6

Slide 6 text

Are they clear and understandable? Are they correct? (relative to the knowledge base) Are they formatted in the desired manner? Are the answers high quality? You will see GitHub Copilot being used in various ways. In one session, you'll learn how to leverage GitHub Copilot for advanced data solutions. Another session will focus on developing and testing APIs faster with GitHub Copilot and Postman's Postbot. [Session10] [Session21] In the Python sessions at GitHub Universe, you can expect to see a showcase of GitHub Copilot being used in various ways. In one session, you'll learn how to leverage GitHub Copilot to enhance data engineering projects with advanced features to identify data-related solutions and create automations that increase high data quality. Another session will focus on the synergy between GitHub Copilot and Postman’s Postbot, an extension that transforms API development and testing, allowing you to speed up the creation and testing of APIs. These sessions will provide valuable insights on how to integrate these tools to streamline your workflow and ensure the delivery of high-quality work. Which GitHub services will be showcased in the python sessions at GitHub Universe? The Python sessions at GitHub Universe will showcase tools for AI app evaluation, leveraging GitHub Copilot for data engineering projects, and integrating GitHub Copilot with Postman's Postbot to speed up API development and testing.

Slide 7

Slide 7 text

What affects the quality? • Search engine/database type • Search mode (keyword, vector, ...) • # of results returned • Search query cleaning step • Data preparation Knowledge Search Large Language Model Question • Model • Temperature • Max tokens • Message history • System prompt

Slide 8

Slide 8 text

1. Generate ground truth data (~200 QA pairs) 2. Evaluate with different parameters 3. Compare the metrics and answers across evaluations Bulk evaluation

Slide 9

Slide 9 text

Step 1: Generate ground truth data The ground truth data is the ideal answer for a question. One option: generate synthetic ground truth data, then do manual curation. Azure OpenAI results Q/A pairs Database Human review Other options: generate based on help desk tickets, support emails, search requests, questions from beta testers, etc.

Slide 10

Slide 10 text

Step 2: Evaluate app against ground truth App endpoint question GPT metrics gpt_coherence gpt_groundedness gpt_relevance Compute LLM metrics and custom metrics for every question in ground truth. length citation_match latency code metrics response answer Python code azure-ai-evaluation SDK truth.jsonl Azure OpenAI prompt

Slide 11

Slide 11 text

azure-ai-evaluation: GPT evaluators Question | Context | Answer | Ground truth RelevanceEvaluator CoherenceEvaluator FluencyEvaluator GroundednessEvaluator SimilarityEvaluator aka.ms/azure-ai-eval-sdk

Slide 12

Slide 12 text

Step 3: Compare metrics across evals Results can be stored in a repo, a storage container, or a specialized tool like Azure AI Studio.

Slide 13

Slide 13 text

Evaluation in CI When a PR changes anything that could affect app quality, we run a bulk evaluation and compare the results to the checked in baseline. Evaluate action needs: • Azure OpenAI model access • Sufficient model capacity Example PR: https://github.com/pamelafox/rag- postgres-openai-python-ghu/pull/7

Slide 14

Slide 14 text

1. Navigate to the repository: aka.ms/rag-postgres-ghu 2. Make a change by editing one of the files in the repo: 1. evals/eval_config.json: Change API parameter settings for retrieval and mode 2. src/backend/fastapi_app/prompts/answer.txt: Change the main prompt 3. Submit a PR with the code change 4. Add comment with "/evaluate" (Don't expect it to autocomplete, since it's custom) 5. Wait 1 minute for a confirmation that the evaluation has begun 6. Wait ~10 minutes for a comment with a summary of your evaluation metrics While you wait: check other PRs, aka.ms/azure-ai-eval-sdk, aka.ms/rag/eval Try it! Run evals on your PR

Slide 15

Slide 15 text

 Evaluate data that you know; LLMs can be convincingly wrong.  What works better for 3 Qs doesn't always work better for 200.  Don't trust absolute metrics for GPT evals, but pay attention to any relative changes.  Remember that LLMs are non-deterministic. Best practices for evaluating

Slide 16

Slide 16 text

 Good, complete retrieval = good, complete answers.  Irrelevant, distracting results = wrong answers.  Related: Vector search can be noisy! Use wisely.  Model choice makes a huge difference. What moves the needle? What doesn't move the needle? • Remove fluff from prompts; it's usually inconsequential.

Slide 17

Slide 17 text

Identify business use case Run app against sample questions Try different parameters Satisfied? Run flow against larger dataset Evaluate answers Satisfied? Deploy app to users No No Yes Yes Change defaults Run online evaluations Improve the prompt and orchestration Connect to your data Customize prompt for domain Collect user feedback 1. Ideating/exploring 2. Building/augmenting 3. Productionizing Productionizing a RAG app Run A/B experiments

Slide 18

Slide 18 text

Improving ground truth Add a / button with feedback dialog to your live app: Then you can: Manually debug the answers that got rated Add questions to ground truth data

Slide 19

Slide 19 text

You can run evaluations on deployed applications on live answers, as long as your metric doesn't require ground truth. Evaluation metric Type Can run without ground truth? FluencyEvaluator GPT metric (azure-ai-evaluation) CoherenceEvaluator GPT metric (azure-ai-evaluation) RelevanceEvaluator GPT metric (azure-ai-evaluation) GroundednessEvaluator GPT metric (azure-ai-evaluation) SimilarityEvaluator GPT metric (azure-ai-evaluation) F1ScoreEvaluator Code metric (azure-ai-evaluation) has_citation Code metric (custom) citation_match Code metric (custom) answer_length Code metric (custom) Running online evaluations

Slide 20

Slide 20 text

Conducting A/B experiments Excited by a change but not sure it will be best for users in production? Run an A/B experiment, and measure metrics across experiment groups. Private preview sign-up: https://aka.ms/genAI-CI-CD-private-preview

Slide 21

Slide 21 text

Thank you! Pamela Fox Python Advocate, Microsoft www.pamelafox.org / @pamelafox

Slide 22

Slide 22 text

We want to hear from you! Take the session survey by visiting the attendee portal so we can continue to make your Universe experience cosmic!