the prompt and few shots to get answers in a certain tone/format. https://github.com/f/awesome-chatgpt-prompts https://github.com/Azure-Samples/azure-search-openai-demo/ RAG (Retrieval Augmented Generation): Combines the prompt with domain data from a knowledge base to get grounded answers
unit tests— there’s a better way to evaluate AI apps description: We all know how to run unit tests on our apps in CI/CD workflows, but we're now in a new world of AI apps with non- deterministic output.... Large Language Model In the Python sessions at GitHub Universe, you can expect to see a showcase of GitHub Copilot being used in various ways... User Question Which GitHub services will be showcased in the python sessions at GitHub Universe?
the knowledge base) Are they formatted in the desired manner? Are the answers high quality? You will see GitHub Copilot being used in various ways. In one session, you'll learn how to leverage GitHub Copilot for advanced data solutions. Another session will focus on developing and testing APIs faster with GitHub Copilot and Postman's Postbot. [Session10] [Session21] In the Python sessions at GitHub Universe, you can expect to see a showcase of GitHub Copilot being used in various ways. In one session, you'll learn how to leverage GitHub Copilot to enhance data engineering projects with advanced features to identify data-related solutions and create automations that increase high data quality. Another session will focus on the synergy between GitHub Copilot and Postman’s Postbot, an extension that transforms API development and testing, allowing you to speed up the creation and testing of APIs. These sessions will provide valuable insights on how to integrate these tools to streamline your workflow and ensure the delivery of high-quality work. Which GitHub services will be showcased in the python sessions at GitHub Universe? The Python sessions at GitHub Universe will showcase tools for AI app evaluation, leveraging GitHub Copilot for data engineering projects, and integrating GitHub Copilot with Postman's Postbot to speed up API development and testing.
mode (keyword, vector, ...) • # of results returned • Search query cleaning step • Data preparation Knowledge Search Large Language Model Question • Model • Temperature • Max tokens • Message history • System prompt
is the ideal answer for a question. One option: generate synthetic ground truth data, then do manual curation. Azure OpenAI results Q/A pairs Database Human review Other options: generate based on help desk tickets, support emails, search requests, questions from beta testers, etc.
affect app quality, we run a bulk evaluation and compare the results to the checked in baseline. Evaluate action needs: • Azure OpenAI model access • Sufficient model capacity Example PR: https://github.com/pamelafox/rag- postgres-openai-python-ghu/pull/7
by editing one of the files in the repo: 1. evals/eval_config.json: Change API parameter settings for retrieval and mode 2. src/backend/fastapi_app/prompts/answer.txt: Change the main prompt 3. Submit a PR with the code change 4. Add comment with "/evaluate" (Don't expect it to autocomplete, since it's custom) 5. Wait 1 minute for a confirmation that the evaluation has begun 6. Wait ~10 minutes for a comment with a summary of your evaluation metrics While you wait: check other PRs, aka.ms/azure-ai-eval-sdk, aka.ms/rag/eval Try it! Run evals on your PR
wrong. What works better for 3 Qs doesn't always work better for 200. Don't trust absolute metrics for GPT evals, but pay attention to any relative changes. Remember that LLMs are non-deterministic. Best practices for evaluating
distracting results = wrong answers. Related: Vector search can be noisy! Use wisely. Model choice makes a huge difference. What moves the needle? What doesn't move the needle? • Remove fluff from prompts; it's usually inconsequential.
different parameters Satisfied? Run flow against larger dataset Evaluate answers Satisfied? Deploy app to users No No Yes Yes Change defaults Run online evaluations Improve the prompt and orchestration Connect to your data Customize prompt for domain Collect user feedback 1. Ideating/exploring 2. Building/augmenting 3. Productionizing Productionizing a RAG app Run A/B experiments
it will be best for users in production? Run an A/B experiment, and measure metrics across experiment groups. Private preview sign-up: https://aka.ms/genAI-CI-CD-private-preview