Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GitHub Universe: Evaluating RAG apps in GitHub ...

Pamela Fox
October 29, 2024

GitHub Universe: Evaluating RAG apps in GitHub Actions

Pamela Fox

October 29, 2024
Tweet

More Decks by Pamela Fox

Other Decks in Technology

Transcript

  1. Save the pass/fail for unit tests— there’s a better way

    to evaluate AI apps Pamela Fox Python Advocate, Microsoft pamelafox.org / @pamelafox
  2. Agenda What is RAG? Bulk evaluation Try it yourself! Best

    practices for evals Going to production
  3. Generative AI apps: Two flavors Prompt-only: Primarily relies on customizing

    the prompt and few shots to get answers in a certain tone/format. https://github.com/f/awesome-chatgpt-prompts https://github.com/Azure-Samples/azure-search-openai-demo/ RAG (Retrieval Augmented Generation): Combines the prompt with domain data from a knowledge base to get grounded answers
  4. RAG: Retrieval Augmented Generation Search title: Save the pass/fail for

    unit tests— there’s a better way to evaluate AI apps description: We all know how to run unit tests on our apps in CI/CD workflows, but we're now in a new world of AI apps with non- deterministic output.... Large Language Model In the Python sessions at GitHub Universe, you can expect to see a showcase of GitHub Copilot being used in various ways... User Question Which GitHub services will be showcased in the python sessions at GitHub Universe?
  5. Azure OpenAI + PostgreSQL Code: aka.ms/rag-postgres-ghu Based on upstream: aka.ms/rag-postgres

    Deployed demo: aka.ms/rag-postgres-ghu/demo Example RAG application
  6. Are they clear and understandable? Are they correct? (relative to

    the knowledge base) Are they formatted in the desired manner? Are the answers high quality? You will see GitHub Copilot being used in various ways. In one session, you'll learn how to leverage GitHub Copilot for advanced data solutions. Another session will focus on developing and testing APIs faster with GitHub Copilot and Postman's Postbot. [Session10] [Session21] In the Python sessions at GitHub Universe, you can expect to see a showcase of GitHub Copilot being used in various ways. In one session, you'll learn how to leverage GitHub Copilot to enhance data engineering projects with advanced features to identify data-related solutions and create automations that increase high data quality. Another session will focus on the synergy between GitHub Copilot and Postman’s Postbot, an extension that transforms API development and testing, allowing you to speed up the creation and testing of APIs. These sessions will provide valuable insights on how to integrate these tools to streamline your workflow and ensure the delivery of high-quality work. Which GitHub services will be showcased in the python sessions at GitHub Universe? The Python sessions at GitHub Universe will showcase tools for AI app evaluation, leveraging GitHub Copilot for data engineering projects, and integrating GitHub Copilot with Postman's Postbot to speed up API development and testing.
  7. What affects the quality? • Search engine/database type • Search

    mode (keyword, vector, ...) • # of results returned • Search query cleaning step • Data preparation Knowledge Search Large Language Model Question • Model • Temperature • Max tokens • Message history • System prompt
  8. 1. Generate ground truth data (~200 QA pairs) 2. Evaluate

    with different parameters 3. Compare the metrics and answers across evaluations Bulk evaluation
  9. Step 1: Generate ground truth data The ground truth data

    is the ideal answer for a question. One option: generate synthetic ground truth data, then do manual curation. Azure OpenAI results Q/A pairs Database Human review Other options: generate based on help desk tickets, support emails, search requests, questions from beta testers, etc.
  10. Step 2: Evaluate app against ground truth App endpoint question

    GPT metrics gpt_coherence gpt_groundedness gpt_relevance Compute LLM metrics and custom metrics for every question in ground truth. length citation_match latency code metrics response answer Python code azure-ai-evaluation SDK truth.jsonl Azure OpenAI prompt
  11. azure-ai-evaluation: GPT evaluators Question | Context | Answer | Ground

    truth RelevanceEvaluator CoherenceEvaluator FluencyEvaluator GroundednessEvaluator SimilarityEvaluator aka.ms/azure-ai-eval-sdk
  12. Step 3: Compare metrics across evals Results can be stored

    in a repo, a storage container, or a specialized tool like Azure AI Studio.
  13. Evaluation in CI When a PR changes anything that could

    affect app quality, we run a bulk evaluation and compare the results to the checked in baseline. Evaluate action needs: • Azure OpenAI model access • Sufficient model capacity Example PR: https://github.com/pamelafox/rag- postgres-openai-python-ghu/pull/7
  14. 1. Navigate to the repository: aka.ms/rag-postgres-ghu 2. Make a change

    by editing one of the files in the repo: 1. evals/eval_config.json: Change API parameter settings for retrieval and mode 2. src/backend/fastapi_app/prompts/answer.txt: Change the main prompt 3. Submit a PR with the code change 4. Add comment with "/evaluate" (Don't expect it to autocomplete, since it's custom) 5. Wait 1 minute for a confirmation that the evaluation has begun 6. Wait ~10 minutes for a comment with a summary of your evaluation metrics While you wait: check other PRs, aka.ms/azure-ai-eval-sdk, aka.ms/rag/eval Try it! Run evals on your PR
  15.  Evaluate data that you know; LLMs can be convincingly

    wrong.  What works better for 3 Qs doesn't always work better for 200.  Don't trust absolute metrics for GPT evals, but pay attention to any relative changes.  Remember that LLMs are non-deterministic. Best practices for evaluating
  16.  Good, complete retrieval = good, complete answers.  Irrelevant,

    distracting results = wrong answers.  Related: Vector search can be noisy! Use wisely.  Model choice makes a huge difference. What moves the needle? What doesn't move the needle? • Remove fluff from prompts; it's usually inconsequential.
  17. Identify business use case Run app against sample questions Try

    different parameters Satisfied? Run flow against larger dataset Evaluate answers Satisfied? Deploy app to users No No Yes Yes Change defaults Run online evaluations Improve the prompt and orchestration Connect to your data Customize prompt for domain Collect user feedback 1. Ideating/exploring 2. Building/augmenting 3. Productionizing Productionizing a RAG app Run A/B experiments
  18. Improving ground truth Add a / button with feedback dialog

    to your live app: Then you can: Manually debug the answers that got rated Add questions to ground truth data
  19. You can run evaluations on deployed applications on live answers,

    as long as your metric doesn't require ground truth. Evaluation metric Type Can run without ground truth? FluencyEvaluator GPT metric (azure-ai-evaluation) CoherenceEvaluator GPT metric (azure-ai-evaluation) RelevanceEvaluator GPT metric (azure-ai-evaluation) GroundednessEvaluator GPT metric (azure-ai-evaluation) SimilarityEvaluator GPT metric (azure-ai-evaluation) F1ScoreEvaluator Code metric (azure-ai-evaluation) has_citation Code metric (custom) citation_match Code metric (custom) answer_length Code metric (custom) Running online evaluations
  20. Conducting A/B experiments Excited by a change but not sure

    it will be best for users in production? Run an A/B experiment, and measure metrics across experiment groups. Private preview sign-up: https://aka.ms/genAI-CI-CD-private-preview
  21. We want to hear from you! Take the session survey

    by visiting the attendee portal so we can continue to make your Universe experience cosmic!