Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Testing OpenAI Applications - Sydney Testers

Adrian Cole
September 26, 2024

Testing OpenAI Applications - Sydney Testers

First talk on an approach to testing OpenAI applications from a developer perspective. A stroll through GenAI with some python, ollama and advice! https://www.meetup.com/sydney-testers/events/303384472

Adrian Cole

September 26, 2024
Tweet

More Decks by Adrian Cole

Other Decks in Technology

Transcript

  1. Introductions! I’m Adrian from the Elastic Observability team, mostly work

    on OpenTelemetry. Baby in GenAI programming (<3 months full-time), but spent a lot of time testing it recently! Co-led wazero, zero dependency webassembly runtime for go, a couple years ago. Tons of portability things in my open source history github.com/codefromthecrypt x.com/adrianfcole
  2. Agenda • Introduction to OpenAI and ChatGPT • Using OpenAI

    Playground to learn how to code • How to make a basic integration test • How to use recorded HTTP requests in unit tests • Introduction to Ollama for OpenAI API hosting • Demo of a real OpenAI test setup • Summing up and Q&A
  3. Introduction to OpenAI Generative Models generate new data based on

    known data • Causal language models are unidirectional: predict words based on previous ones • Large Language Models (LLMs) are typically trained for chat, code completion, etc. OpenAI is an organization and cloud platform, 49% owned by Microsoft • OpenAI develops the GPT (Generative Pre-trained Transformer) LLM • GPT is hosted by OpenAI and Azure, accessible via API or apps like ChatGPT
  4. Introduction to ChatGPT The LLM can answer correctly if geography

    was a part of the text it was trained on ChatGPT is natural language processing (NLP) application that uses the GPT LLM • Like the GPT model family ChatGPT is hosted by OpenAI and Azure
  5. Context window The context window includes all messages considered when

    generating a response. In other words, any chat message replays all prior questions and answers. • It can include chat history, instructions, documents, images (if multimodal), etc. • Limits are not in characters or words, but tokens roughly ¾ of a word. 41 tokens in the context window 33 tokens generated
  6. Key things about programming OpenAI This part is required This

    part is optional This part reads ENV variables
  7. Making a basic chat library Taking the code from the

    playground we can separate input from implementation • This hides parts we don’t know well • This gives us a way to test the code
  8. First test goes against the real platform You need to

    minimally set OPENAI_API_KEY and choose a model • Tests need access to the platform and will cost a little each time
  9. Why can’t we test for an exact response? You cannot

    100pct control the output of an LLM, the answer may vary • There are settings like seed and temperature to reduce creativity • Real runs may include keywords, but not in an exact order
  10. Recording exact requests for exact tests OpenAI is an HTTP

    service, which makes it unit testable • You can record real HTTP responses and play them back with VCR • This allows us to make exact assertions • pytest-vcr is an easy way to use VCR pytest-vcr.readthedocs.io
  11. There’s still a bit more work to do.. OpenAI’s client

    is designed to require OPENAI_API_KEY!
  12. Oh!!! Even more work! VCR recordings don’t scrub any secrets

    by default. You must be careful about this data! • Do you have any auth keys? • Are you leaking org info or personal data?
  13. Scrub config Good VCR config • Makes visible what is

    logged • Considers requests and responses • Considers case sensitivity • Still needs users to pay attention!
  14. What about CI? VCR tests are just normal unit tests,

    so should run in CI • These only require real credentials on change, to produce new recordings Integration tests require credentials for OpenAI, so some may choose against it • Credentials can be misconfigured and leak the account • Repetitive revisions or model choices can cost a considerable amount Do we just skip integration tests? • Skipping in CI is better than deleting tests • There are alternatives to running tests against OpenAI the platform
  15. Not all LLMs are only available as a service OpenAI

    is an organization and cloud platform, 49% owned by Microsoft • OpenAI develops the GPT (Generative Pre-trained Transformer) LLM • GPT is hosted by OpenAI and Azure, accessible via API or apps like ChatGPT LLaMa is an LLM developed by Meta • You can download it for free from meta and run it with llama-stack • It is also hosted on platforms, and downloadable in llama.cpp’s GGUF format Thousands of LLMs exist • Many types of LLMs, differing on source, size, training, modality (photo, sound, etc) • We’ll use the small and well documented Qwen2.5 LLM developed by Alibaba
  16. Let’s use Qwen 2.5! Where do I get this? •

    Dense, easy-to-use, decoder-only language models, available in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B sizes, and base and instruct variants. • Pretrained on our latest large-scale dataset, encompassing up to 18T tokens. • Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. • More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. • Context length support up to 128K tokens and can generate up to 8K tokens. • Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more.
  17. Hello Ollama github.com/ollama/ollama Written in Go Docker like experience Llama.cpp

    backend Simplified model flow $ ollama serve $ ollama pull qwen2.5:0.5b $ ollama run qwen2.5:0.5b "What is the name of the ocean that contains Bouvet Island?" The ocean that contains Bouvet Island is the Atlantic Ocean.
  18. OpenAI API is a de’facto standard Ollama includes a lot

    of features in the OpenAI API • Change the base URL to localhost • Add a fake API key • Leave the rest alone! github.com/openai/openai-openapi
  19. Hallucination LLMs can hallucinate, giving irrelevant, nonsense or factually incorrect

    answers • LLMs rely on statistical correlation and may invent things to fill gaps • Mitigate with selecting relevant models, prompt engineering, etc. User: What is the name of the ocean that contains Bouvet Island? Assistant: Ocean of icebergs. $ ollama ls qwen2.5:14b 7cdf5a0187d5 9.0 GB 3 days ago qwen2.5:latest 845dbda0ea48 4.7 GB 3 days ago qwen2.5:0.5b a8b0c5157701 397 MB 5 days ago More parameters can be more expensive, but might give more relevant answers *
  20. For starters, maybe just retry the test High parameters results

    in a lot more resources • As a human, you’d just retry if something flaked • Use pytest-retry to do this for you pypi.org/project/pytest-retry
  21. One way to CI in GitHub Run unit and integration

    tests on pull requests • Use separate jobs for unit and integration Run integration tests against ollama, not openai • Always use the latest version of ollama • Pull the model in a separate step, before tests execute Keep a CONTRIBUTING.md with instructions on how to do everything • Don’t assume people will remember or same will always be around • Keep documentation up to to date with practice
  22. Takeaways and Thanks! OpenAI requires your best and most creative

    testing skills Unit Tests should record real HTTP requests in whatever way is best for your language If using python, use pytest-vcr Integration Tests should use OpenAI, but allow local model usage as well. Ollama is a very good option for local model hosting, and Qwen 2.5 is a great model Tests themselves should be strict in unit tests and flexible in integration tests LLMs responses are not entirely predictable, and can sometimes miss. Be aware of this. github.com/codefromthecrypt x.com/adrianfcole www.linkedin.com/in/adrianfcole