Testing OpenAI Applications - Sydney Testers

Slide 1

Slide 1 text

Testing OpenAI Applications Sydney Testers

Slide 2

Slide 2 text

Introductions! I’m Adrian from the Elastic Observability team, mostly work on OpenTelemetry. Baby in GenAI programming (<3 months full-time), but spent a lot of time testing it recently! Co-led wazero, zero dependency webassembly runtime for go, a couple years ago. Tons of portability things in my open source history github.com/codefromthecrypt x.com/adrianfcole

Slide 3

Slide 3 text

Agenda ● Introduction to OpenAI and ChatGPT ● Using OpenAI Playground to learn how to code ● How to make a basic integration test ● How to use recorded HTTP requests in unit tests ● Introduction to Ollama for OpenAI API hosting ● Demo of a real OpenAI test setup ● Summing up and Q&A

Slide 4

Slide 4 text

Introduction to OpenAI Generative Models generate new data based on known data ● Causal language models are unidirectional: predict words based on previous ones ● Large Language Models (LLMs) are typically trained for chat, code completion, etc. OpenAI is an organization and cloud platform, 49% owned by Microsoft ● OpenAI develops the GPT (Generative Pre-trained Transformer) LLM ● GPT is hosted by OpenAI and Azure, accessible via API or apps like ChatGPT

Slide 5

Slide 5 text

Introduction to ChatGPT The LLM can answer correctly if geography was a part of the text it was trained on ChatGPT is natural language processing (NLP) application that uses the GPT LLM ● Like the GPT model family ChatGPT is hosted by OpenAI and Azure

Slide 6

Slide 6 text

Context window The context window includes all messages considered when generating a response. In other words, any chat message replays all prior questions and answers. ● It can include chat history, instructions, documents, images (if multimodal), etc. ● Limits are not in characters or words, but tokens roughly ¾ of a word. 41 tokens in the context window 33 tokens generated

Slide 7

Slide 7 text

OpenAI playground teaches you the API github.com/openai/openai-openapi platform.openai.com/playground/chat

Slide 8

Slide 8 text

Key things about programming OpenAI This part is required This part is optional This part reads ENV variables

Slide 9

Slide 9 text

Making a basic chat library Taking the code from the playground we can separate input from implementation ● This hides parts we don’t know well ● This gives us a way to test the code

Slide 10

Slide 10 text

First test goes against the real platform You need to minimally set OPENAI_API_KEY and choose a model ● Tests need access to the platform and will cost a little each time

Slide 11

Slide 11 text

Why can’t we test for an exact response? You cannot 100pct control the output of an LLM, the answer may vary ● There are settings like seed and temperature to reduce creativity ● Real runs may include keywords, but not in an exact order

Slide 12

Slide 12 text

Recording exact requests for exact tests OpenAI is an HTTP service, which makes it unit testable ● You can record real HTTP responses and play them back with VCR ● This allows us to make exact assertions ● pytest-vcr is an easy way to use VCR pytest-vcr.readthedocs.io

Slide 13

Slide 13 text

Exact responses can be chatty though..

Slide 14

Slide 14 text

There’s still a bit more work to do.. OpenAI’s client is designed to require OPENAI_API_KEY!

Slide 15

Slide 15 text

Oh!!! Even more work! VCR recordings don’t scrub any secrets by default. You must be careful about this data! ● Do you have any auth keys? ● Are you leaking org info or personal data?

Slide 16

Slide 16 text

Scrub conﬁg Good VCR conﬁg ● Makes visible what is logged ● Considers requests and responses ● Considers case sensitivity ● Still needs users to pay attention!

Slide 17

Slide 17 text

What about CI? VCR tests are just normal unit tests, so should run in CI ● These only require real credentials on change, to produce new recordings Integration tests require credentials for OpenAI, so some may choose against it ● Credentials can be misconﬁgured and leak the account ● Repetitive revisions or model choices can cost a considerable amount Do we just skip integration tests? ● Skipping in CI is better than deleting tests ● There are alternatives to running tests against OpenAI the platform

Slide 18

Slide 18 text

Not all LLMs are only available as a service OpenAI is an organization and cloud platform, 49% owned by Microsoft ● OpenAI develops the GPT (Generative Pre-trained Transformer) LLM ● GPT is hosted by OpenAI and Azure, accessible via API or apps like ChatGPT LLaMa is an LLM developed by Meta ● You can download it for free from meta and run it with llama-stack ● It is also hosted on platforms, and downloadable in llama.cpp’s GGUF format Thousands of LLMs exist ● Many types of LLMs, diﬀering on source, size, training, modality (photo, sound, etc) ● We’ll use the small and well documented Qwen2.5 LLM developed by Alibaba

Slide 19

Slide 19 text

Let’s use Qwen 2.5! Where do I get this? ● Dense, easy-to-use, decoder-only language models, available in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B sizes, and base and instruct variants. ● Pretrained on our latest large-scale dataset, encompassing up to 18T tokens. ● Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. ● More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. ● Context length support up to 128K tokens and can generate up to 8K tokens. ● Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more.

Slide 20

Slide 20 text

Hello Ollama github.com/ollama/ollama Written in Go Docker like experience Llama.cpp backend Simpliﬁed model ﬂow $ ollama serve $ ollama pull qwen2.5:0.5b $ ollama run qwen2.5:0.5b "What is the name of the ocean that contains Bouvet Island?" The ocean that contains Bouvet Island is the Atlantic Ocean.

Slide 21

Slide 21 text

OpenAI API is a de’facto standard Ollama includes a lot of features in the OpenAI API ● Change the base URL to localhost ● Add a fake API key ● Leave the rest alone! github.com/openai/openai-openapi

Slide 22

Slide 22 text

You may have to clarify your requests a bit

Slide 23

Slide 23 text

Integration tests look like they’ll always pass!

Slide 24

Slide 24 text

But sometimes, it doesn’t?! Ocean of Icebergs, huh…

Slide 25

Slide 25 text

Hallucination LLMs can hallucinate, giving irrelevant, nonsense or factually incorrect answers ● LLMs rely on statistical correlation and may invent things to ﬁll gaps ● Mitigate with selecting relevant models, prompt engineering, etc. User: What is the name of the ocean that contains Bouvet Island? Assistant: Ocean of icebergs. $ ollama ls qwen2.5:14b 7cdf5a0187d5 9.0 GB 3 days ago qwen2.5:latest 845dbda0ea48 4.7 GB 3 days ago qwen2.5:0.5b a8b0c5157701 397 MB 5 days ago More parameters can be more expensive, but might give more relevant answers *

Slide 26

Slide 26 text

For starters, maybe just retry the test High parameters results in a lot more resources ● As a human, you’d just retry if something ﬂaked ● Use pytest-retry to do this for you pypi.org/project/pytest-retry

Slide 27

Slide 27 text

One way to CI in GitHub Run unit and integration tests on pull requests ● Use separate jobs for unit and integration Run integration tests against ollama, not openai ● Always use the latest version of ollama ● Pull the model in a separate step, before tests execute Keep a CONTRIBUTING.md with instructions on how to do everything ● Don’t assume people will remember or same will always be around ● Keep documentation up to to date with practice

Slide 28

Slide 28 text

Real life example! https://github.com/square/exchange

Slide 29

Slide 29 text

Takeaways and Thanks! OpenAI requires your best and most creative testing skills Unit Tests should record real HTTP requests in whatever way is best for your language If using python, use pytest-vcr Integration Tests should use OpenAI, but allow local model usage as well. Ollama is a very good option for local model hosting, and Qwen 2.5 is a great model Tests themselves should be strict in unit tests and flexible in integration tests LLMs responses are not entirely predictable, and can sometimes miss. Be aware of this. github.com/codefromthecrypt x.com/adrianfcole www.linkedin.com/in/adrianfcole