Chapter 4 – LLM-Powered TestFormat (ISTQBⓇ CT-GenAI v1.1). Reading Materials

ISTQB® CT-GenAI TRAINING COURSE Chapter 4. LLM-Powered Test Infrastructure for
Software Testing Iuliia Emelianova, Dmitrii Degtiarenko BUILD SOFTWARE TO TEST SOFTWARE ISTQB® CT-GenAI COURSE 2026, V1.1 exactpro.com

Learning Activity Overview Title: Chapter 4 – LLM-Powered Test Format:
Reading Materials (self-study or guided reading) Estimated Duration: 110 minutes Target Audience: Software Testers, Test Automation Engineers, Test Analysts, Test Managers, Software Developers and professionals who need a solid understanding of Generative AI (GenAI) in testing – project managers, quality managers, software development managers, business analysts, IT directors and consultants, professionals preparing for ISTQBⓇ CT-GenAI certification Programme Context: This learning activity forms a part of the ISTQBⓇ CT-GenAI training programme and aligns with the syllabus version 1.1 Engagement: During this chapter, you will: • Understand the architecture and core components of LLM-powered test infrastructure • Explain how Retrieval-Augmented Generation (RAG) improves the quality and reliability of AI-generated testing outputs • Describe the role of LLM-powered agents in automating and supporting software test activities • Understand how fine-tuning adapts LLMs and SLMs for organisation-specific test tasks • Explain the purpose of LLMOps in managing, deploying, and governing GenAI solutions in software test environments Infrastructure for Software Testing (ISTQBⓇ CT-GenAI v1.1) ISTQB® CT-GenAI Training Course | Chapter 4. LLM-Powered Test Infrastructure for Software Testing Page 2 of 17

Learning Objectives By the end of this learning activity, participants
will be able to: • Explain key architectural components and concepts of LLM-powered test infrastructure • Summarise Retrieval-Augmented Generation • Explain the role and application of LLM-powered agents in automating test processes • Explain the fine-tuning of language models for specific test tasks • Explain LLMOps and its role in deploying and managing LLMs for test tasks Learning Structure This reading activity follows a structured learning flow: 1. Explore the architectural approaches and core components of LLM-powered test infrastructure, including front-end, back-end, and LLM integration concepts (Section 4.1.1) 2. Learn how RAG enhances AI-generated testing outputs through semantic retrieval and context-aware generation (Section 4.1.2) 3. Understand the role of LLM-powered agents in automating software test activities and supporting multi-step workflows (Section 4.1.3) 4. Examine how fine-tuning adapts LLMs and SLMs to organisation-specific test tasks and domain requirements (Section 4.2.1) 5. Understand how LLMOps supports the deployment, governance, monitoring, and maintenance of GenAI solutions in software test environments (Section 4.2.2) ISTQB® CT-GenAI Training Course | Chapter 4. LLM-Powered Test Infrastructure for Software Testing Page 3 of 17

4.1 Architectural Approaches for LLM-Powered Test Infrastructure AI chatbots and
LLM-powered test tools are two forms of systems that use LLMs in software testing. While chatbots focus mainly on conversation, LLM-powered test tools are designed to support end-to-end test activities. 4.1.1 Key Architectural Components and Concepts of LLM-Powered Test Infrastructure (K2) An LLM-powered test infrastructure is a system that embeds an LLM into the test process to improve automation, reasoning, and decision-making. Unlike a traditional AI chatbot, which focuses primarily on user conversation, an LLM-powered test tool is built to process test-related queries, analyse requirements, generate test artefacts, support defect analysis, and evaluate outputs in alignment with the test basis. To achieve this, the typical architecture uses a multi-component design that orchestrates secure, structured communication between testers, back-end systems, and the LLM. It consists of the following core components: • Front-end. The front-end is the user interface, which may be a web dashboard, command line, test management tool plugin, or chatbot-style interface. It allows testers to enter prompts, upload test artefacts, or request specific test actions. • Back-end. The back-end is responsible for all coordination and logic behind the scenes. It handles access control and authentication, prompt preprocessing (context gathering, formatting, sanitising), retrieval of relevant test artefacts, and sending properly structured prompts to the LLM. It also manages integration with external systems such as test management tools, CI/CD pipelines, or version-control systems. ISTQB® CT-GenAI Training Course | Chapter 4. LLM-Powered Test Infrastructure for Software Testing Page 4 of 17

• The LLM. It can be a third-party cloud-hosted model
(accessed via API), or a custom, locally deployed model hosted within the organisation. The model generates responses based on structured prompts and receives only the information provided by the back-end. LLM-powered test infrastructure extends far beyond a classic client-server design. Instead of simply routing messages, the system includes several intelligent processing layers: 1. The LLM acts as a reasoning engine. It does not behave like a fixed rule-based server. Instead, it interprets the test context, applies semantic understanding, and generates test insights from inputs such as test cases, requirements, logs, or code. 2. Dynamic generation instead of scripted responses. Rule-based chatbots rely on predefined flows. In contrast, an LLM-powered test tool supports flexible reasoning and generates contextually relevant outputs, even for tasks it has not been explicitly programmed for. 3. Multi-source data integration. The back-end typically connects to several data sources, such as relational databases for structured test data (test cases, execution results, user accounts, metadata) and vector databases (databases optimised for storing and querying high-dimensional vector representations of data) for semantic search and retrieval of relevant artefacts using embeddings. This allows the system to retrieve, filter, and assemble the right context before sending it to the LLM. 4. Post-processing enhancements. Raw LLM output often requires refinement before being presented to the user. The back-end can apply post-processing steps such as verifying format and structure, ensuring alignment with the test basis, filtering hallucinations, applying organisation-specific rules, or converting output into test case templates or automation-ready scripts. These steps improve consistency and ensure that generated content fits the organisation’s testing standards. ISTQB® CT-GenAI Training Course | Chapter 4. LLM-Powered Test Infrastructure for Software Testing Page 5 of 17

4.1.2 Retrieval-Augmented Generation (K2) Retrieval-Augmented Generation (RAG) enhances the capabilities
of LLMs by supplying them with additional, up-to-date, and task-specific information. Instead of relying only on the knowledge the model learned during pre-training, RAG allows the system to retrieve relevant information from external sources and incorporate it into the generation process. This approach significantly improves the accuracy, relevance, and trustworthiness of the output, especially when working with organisation-specific test artefacts. Because this allows the LLM to work with information far beyond what fits into its training data or context window. RAG combines two main components: 1. A retrieval system that searches for information relevant to the user’s query. 2. A language model that uses this retrieved information to generate a grounded, context-aware response. Before retrieval can work effectively, project data must be prepared. Large documents (requirement documents, test plans, user manuals) are split into smaller chunks, typically around 256-512 tokens each. This ensures the model can process the content within its context window and retrieve only the most relevant pieces. Each chunk is cleaned (removing noise, formatting issues, and irrelevant text) and encoded into a high-dimensional vector called an embedding. These embeddings are stored in a vector database, which enables fast, efficient retrieval based on semantic similarity rather than keyword matching. At runtime, a user query is also converted into an embedding, allowing the system to find the semantically closest chunks in the database. A relevant response is one in which the LLM grounds its answer in retrieved, reliable, real-world data, integrates this information with its own general knowledge, and produces output that is accurate, precise, and directly aligned with the context of the query. This grounding dramatically reduces hallucinations and increases the quality of testing-related outputs. ISTQB® CT-GenAI Training Course | Chapter 4. LLM-Powered Test Infrastructure for Software Testing Page 6 of 17

During user prompt processing, a RAG system follows a clear
two-step pipeline: 1. Retrieval. The user’s query is encoded and matched against the vector database. Relevant chunks, such as requirements, code snippets, test reports, or design documents, are retrieved based on semantic similarity. 2. Generation. The retrieved chunks are appended as context to the prompt and sent to the LLM. The model then generates a response that combines the retrieved enterprise-specific data, and the model’s own general knowledge. This approach produces more accurate, evidence-based, and contextually appropriate output, because RAG allows the system to access enterprise data sources such as requirement repositories, test management systems, codebases, release notes, defect databases, API documentation, architectural diagrams. With real-time retrieval, GenAI can perform test tasks, such as test analysis, test design, or coverage evaluation, based on the latest project information. This helps ensure alignment with the most current specifications, avoids outdated assumptions, and significantly improves the reliability of AI-generated test artefacts. 4.1.3 The Role of LLM-Powered Agents in Automating Test Processes (K2) LLM-powered agents are specialised GenAI systems designed to carry out semi-autonomous or autonomous tasks. Unlike simple AI chatbots, which focus mainly on conversational question-and-answer flows, LLM-powered agents can reason, retrieve context, follow multi-step instructions, and take actions by interacting with external tools or systems. At their foundation, these agents combine LLM capabilities (language understanding, reasoning, and generation), context retrieval (from RAG, databases, or APIs), and function execution (tools the agent can call to perform tasks). This makes them significantly more powerful and flexible than traditional conversational interfaces. ISTQB® CT-GenAI Training Course | Chapter 4. LLM-Powered Test Infrastructure for Software Testing Page 7 of 17

Instead of only generating text, LLM-powered agents can “act” by
invoking predefined tools such as test management system APIs, file readers or writers, code execution tools, CI/CD pipeline commands, test automation frameworks, and data retrieval functions. This lets the agent perform operations like creating test cases, updating reports, running automated tests, or analysing logs. The autonomy of these agents varies depending on the use case and risk level: • Autonomous agents. Operate independently with minimal human oversight. They can use predefined rules, reinforcement learning, feedback loops, or self-directed reasoning. These agents can perform continuous tasks such as monitoring test results, triggering test runs, or maintaining test suites. • Semi-autonomous agents. Execute tasks but include human checkpoints for validation. This is often preferred for higher-risk tasks where hallucinations or reasoning errors could lead to incorrect test cases, faulty automation code, or incorrect defect analyses. A multi-agent setup involves several agents, each dedicated to a specialised function. Examples include one agent focused on requirement analysis, one on test design, one on automation code generation, one on execution scheduling, and one on reporting and summarisation. These agents collaborate by exchanging information and handing tasks to each other. This coordinated workflow is known as orchestration and often leads to more efficient and reliable outcomes than a single agent working alone. LLM-powered agents can automate or support many complex testing tasks by emulating aspects of human reasoning. They can analyse requirements and detect gaps, generate test conditions and test cases, create or update automation scripts, run regression tests, evaluate logs and error messages, and prepare structured reports. In modern testing tools, these capabilities are often demonstrated through integrated AI assistants embedded directly into the workflow. As a result, they act as digital test assistants capable of executing multi-step workflows in a semi-autonomous way, reducing manual effort and shortening feedback cycles, while shifting test automation from script-based execution to goal-driven, agent-based test automation. ISTQB® CT-GenAI Training Course | Chapter 4. LLM-Powered Test Infrastructure for Software Testing Page 8 of 17

Despite their capabilities, LLM-powered agents still inherit all the limitations
of LLMs described in Section 3.1, including hallucinations, reasoning errors, biases, and non-deterministic behaviour. These issues can produce incorrect or misleading outputs, such as flawed test cases or faulty automation scripts, which undermine the reliability of automated test processes. To manage these risks, organisations can implement automated verification steps to validate agent outputs (e.g., syntax checkers, test runners, consistency checks), and use semi-autonomous agents for high-stakes or safety-critical tasks where human oversight is necessary. Key Takeaways – 4.1 • LLM-powered test infrastructure integrates front-end interfaces, back-end orchestration layers, and LLMs to support intelligent, end-to-end software test activities • Unlike rule-based chatbots, LLM-powered test tools use semantic reasoning and dynamic generation to produce context-aware testing outputs • Back-end systems play a critical role by managing prompt processing, context retrieval, integrations with testing tools, and post-processing of LLM outputs • Retrieval-Augmented Generation (RAG) improves the accuracy, relevance, and trustworthiness of AI-generated outputs by grounding responses in retrieved enterprise-specific data • Vector databases and embeddings enable semantic retrieval of relevant artefacts such as requirements, test cases, logs, and design documents • LLM-powered agents can perform semi-autonomous or autonomous test tasks by combining reasoning, context retrieval, and function execution capabilities • Although LLM-powered agents can significantly automate testing workflows, organisations must manage risks such as hallucinations, reasoning errors, and non-deterministic behaviour through validation and human oversight ISTQB® CT-GenAI Training Course | Chapter 4. LLM-Powered Test Infrastructure for Software Testing Page 9 of 17

Reflection – 4.1 1. How can Retrieval-Augmented Generation (RAG) improve
the quality and reliability of AI-generated test artefacts in your organisation? 2. Using examples from your own experience, what risks could arise from relying on autonomous or semi-autonomous AI agents in software testing, and how could these risks be mitigated? 4.2 Fine-Tuning and LLMOps: Operationalising Generative AI for Software Testing Building an LLM-powered test infrastructure is not just about selecting a model and connecting it to a tool. To use GenAI reliably in day-to-day testing operations, two key practices are required: • Fine-tuning LLMs to adapt them to testing-specific tasks and organisational context; • Managing the full operational lifecycle of GenAI through LLMOps. Together, these practices help transform GenAI from an experimental tool into a stable, governed, and production-ready testing capability. ISTQB® CT-GenAI Training Course | Chapter 4. LLM-Powered Test Infrastructure for Software Testing Page 10 of 17

4.2.1 Fine-Tuning LLMs for Test Tasks (K2) Fine-tuning is the
process of adapting a pre-trained Language Model, such as LLM or an SLM, to perform specific tasks or to operate within a specific domain. Instead of training a model from scratch, fine-tuning continues its training on a targeted, domain-specific dataset. This allows the model to learn domain-specific terminology, organisation-specific formats, typical reasoning patterns, and specialised testing practices. As a result, the model becomes more accurate, more context-aware, and more relevant for the intended software testing use case. In practice, fine-tuning is especially useful when a generic LLM: • does not understand organisation-specific vocabulary, • produces outputs in the wrong format, • lacks domain-specific reasoning patterns. Fine-tuning allows the model to adopt testing terminology used by the organisation, follow internal test case templates, reflect company-specific quality criteria, and align with domain rules (e.g., finance, healthcare, automotive). Fine-tuning can be applied to both LLMs and SLMs. LLMs offer broad general intelligence but require significant computational resources when fine-tuned. SLMs are smaller, faster, and less resource-intensive. When fine-tuned, SLMs can achieve very high performance on narrow, well-defined tasks at a fraction of the cost. This provides a certain degree of flexibility. We can use LLMs when broad reasoning and language coverage are needed, and use fine-tuned SLMs when speed, cost-efficiency, and domain focus are critical. ISTQB® CT-GenAI Training Course | Chapter 4. LLM-Powered Test Infrastructure for Software Testing Page 11 of 17

Example 4.1. In a testing organisation, fine-tuning can be used
to teach a model how to convert user stories into test cases, apply the organisation’s specific test case structure, use company-specific terms and abbreviations, and follow internal testing standards. By training the model on pairs of real user stories, and the corresponding approved test cases, the model learns how that organisation expects testing artefacts to look and behave. While fine-tuning is powerful, it introduces several important challenges: • Avoiding biased or inaccurate results. The quality of the fine-tuned model depends directly on the quality of the training data. It’s a garbage-in-garbage-out principle. If the training dataset contains errors, outdated practices, or bias, the model will learn and reproduce those flaws. • Mitigating overfitting. Overfitting happens when the model becomes too specialised to the training data and performs poorly on new, unseen scenarios. A fine-tuned testing model must still generalise well across different applications, different requirement styles, and evolving system behaviour. • Addressing opacity in model reasoning. LLMs often behave like black boxes. After fine-tuning, it may be difficult to explain why a model generated a specific test case or decision. This lack of transparency complicates debugging incorrect outputs, validating results in regulated environments, and building trust with stakeholders. • Managing computational resource demands (for LLMs). Fine-tuning large models requires high-performance GPUs or AI accelerators, large volumes of storage, careful experiment tracking, and significant financial investment. This is one reason why fine-tuned SLMs are often preferred for operational test environments. ISTQB® CT-GenAI Training Course | Chapter 4. LLM-Powered Test Infrastructure for Software Testing Page 12 of 17

4.2.2 LLMOps when Deploying and Managing LLMs for Software Testing
(K2) LLMOps (Large Language Model Operations) refers to the set of structured practices, tools, and processes used to manage the development, deployment, monitoring, and maintenance of LLMs in production environments. In software testing, LLMOps ensures that GenAI solutions are operationally stable, secure and compliant, cost-controlled, and continuously monitored for quality and risk. Without LLMOps, GenAI usage typically remains experimental and unreliable, making it unsuitable for integration into enterprise-level test processes. Organisations can introduce generative AI into their test processes using several different implementation approaches. Each approach leads to different LLMOps decisions related to governance, risk, cost, and technical management. Three common approaches are described below. 1. Using an AI chatbot. When GenAI is used through a general-purpose AI chatbot, the main operational considerations focus on data privacy and security, and cost optimisation. Organisations may choose between LLM-as-a-Service platforms, where the model is hosted by a third-party provider, or in-house deployments of open-source LLMs, which provide greater control over data and infrastructure. In both cases, a rigorous assessment of vendor guarantees or internal security capabilities is essential to mitigate data privacy and security risks, ensure compliance with regulations, and control operational expenses. Chatbots are typically used for ad-hoc testing support, such as requirements analysis, test idea generation, or exploratory testing assistance. 2. Using a test tool with built-in generative AI capabilities. In this approach, GenAI is embedded directly into a commercial or open-source test tool. The core LLMOps considerations are similar to those for chatbots data privacy, security, infrastructure reliability, and operational cost. In addition, organisations must evaluate the data protection guarantees provided by the vendor, the performance and availability of the AI features, and the integration quality with existing test processes and toolchains. Since these tools typically augment existing test workflows, organisations must also conduct a thorough cost-benefit analysis and risk assessment to ensure that GenAI adoption creates real operational value. ISTQB® CT-GenAI Training Course | Chapter 4. LLM-Powered Test Infrastructure for Software Testing Page 13 of 17

3. In-house development of a GenAI-based test tool. This approach
provides the highest degree of control, but also the highest operational responsibility. Key LLMOps considerations include full control over data privacy and security, careful planning for AI resource utilisation such as computational resources, data storage, model maintenance, and staff training, and the establishment of formal processes for validating, monitoring, and maintaining GenAI components. Developing in-house solutions also requires strong expertise in machine learning infrastructure, secure model deployment, prompt engineering and fine-tuning, and LLM-powered test infrastructure design. This approach is typically adopted by large organisations with high confidentiality requirements or specialised testing needs. These three approaches are not mutually exclusive. An organisation may, for example, use an AI chatbot for exploratory analysis, rely on a commercial GenAI-enabled test tool for daily test execution, and develop custom in-house tools for sensitive or business-critical systems. Furthermore, all approaches may incorporate additional technologies such as RAG, and fine-tuning of LLMs or SLMs, to further enhance the accuracy, adaptability, and relevance of GenAI within test processes. Key Takeaways – 4.2 • Fine-tuning adapts LLMs and SLMs to organisation-specific test tasks, terminology, formats, and domain requirements, improving the relevance of AI-generated outputs • Fine-tuned SLMs can provide efficient and cost-effective solutions for specialised test activities, while LLMs offer broader reasoning capabilities for more complex tasks • The effectiveness of fine-tuning depends on high-quality training data and careful management of risks such as bias, overfitting, and lack of transparency • LLMOps provides the operational framework needed to deploy, monitor, maintain, and govern GenAI solutions in software test environments ISTQB® CT-GenAI Training Course | Chapter 4. LLM-Powered Test Infrastructure for Software Testing Page 14 of 17

• Organisations can adopt GenAI in testing through chatbots, AI-enabled
test tools, in-house solutions, or a combination of approaches depending on their security, cost, and operational requirements Reflection – 4.2 1. How could fine-tuning improve the quality, consistency, or relevance of AI-generated testing outputs in your own projects? 2. Considering your organisation’s processes and constraints, what LLMOps practices would be most important to ensure that GenAI solutions remain secure, reliable, and maintainable in production test environments? Key Takeaways and Summary • LLM-powered test infrastructure combines front-end interfaces, back-end orchestration, and LLMs to support intelligent, end-to-end software testing activities beyond traditional chatbot interactions • Modern LLM-powered testing systems use semantic reasoning, multi-source data integration, and post-processing mechanisms to generate more context-aware and reliable testing outputs • Retrieval-Augmented Generation (RAG) improves the accuracy and trustworthiness of AI-generated results by retrieving relevant enterprise data through embeddings and vector databases before generating responses • LLM-powered agents extend GenAI capabilities by reasoning, retrieving context, and interacting with external tools to automate or support testing workflows through autonomous or semi-autonomous actions • Fine-tuning enables LLMs and SLMs to adapt to organisation-specific test terminology, formats, and domain requirements, improving the relevance and consistency of generated outputs ISTQB® CT-GenAI Training Course | Chapter 4. LLM-Powered Test Infrastructure for Software Testing Page 15 of 17

• LLMOps provides the operational practices needed to securely deploy,
govern, monitor, and maintain GenAI solutions in production test environments, whether using chatbots, AI-enabled testing tools, in-house systems, or hybrid approaches Reflection and Knowledge Check Answer these questions after completing the reading: 1. How does an LLM-powered test infrastructure differ from a traditional chatbot or rule-based test tool in terms of architecture, reasoning, and automation capabilities? 2. How does semantic retrieval using embeddings and vector databases improve the effectiveness of Retrieval-Augmented Generation (RAG)? 3. Which test activities could benefit most from LLM-powered agents or AI-assisted automation, and where would human oversight still be required? 4. How can post-processing and validation mechanisms help reduce hallucinations and improve the reliability of AI-generated testing outputs? 5. Why can fine-tuned SLMs sometimes be more practical than large LLMs for operational test environments? 6. What risks and challenges must organisations manage when fine-tuning models for software test activities? References • ISTQB® Certified Tester Specialist Level Testing with Generative AI (CT-GenAI) Syllabus Version 1.1, 2026, https://istqb.org/?sdm_process_download=1&download_id=6295 (accessed May 2026) ISTQB® CT-GenAI Training Course | Chapter 4. LLM-Powered Test Infrastructure for Software Testing Page 16 of 17

Feedback and Evaluation Learner feedback is collected to support continuous
improvement of delivery and materials. Understanding is evaluated through: • Chapter quiz covering key concepts from this chapter • Q&A session to clarify questions arising from the activities and quiz ISTQB® CT-GenAI Training Course | Chapter 4. LLM-Powered Test Infrastructure for Software Testing Page 17 of 17

Chapter 4 – LLM-Powered TestFormat (ISTQBⓇ CT-G...

Chapter 4 – LLM-Powered TestFormat (ISTQBⓇ CT-GenAI v1.1). Reading Materials

Exactpro PRO

More Decks by Exactpro

Other Decks in Technology

Featured

Transcript

ISTQB® CT-GenAI TRAINING COURSE Chapter 4. LLM-Powered Test Infrastructure for

Learning Activity Overview Title: Chapter 4 – LLM-Powered Test Format:

Learning Objectives By the end of this learning activity, participants

4.1 Architectural Approaches for LLM-Powered Test Infrastructure AI chatbots and

• The LLM. It can be a third-party cloud-hosted model

4.1.2 Retrieval-Augmented Generation (K2) Retrieval-Augmented Generation (RAG) enhances the capabilities

During user prompt processing, a RAG system follows a clear

Instead of only generating text, LLM-powered agents can “act” by

Despite their capabilities, LLM-powered agents still inherit all the limitations

Reflection – 4.1 1. How can Retrieval-Augmented Generation (RAG) improve

4.2.1 Fine-Tuning LLMs for Test Tasks (K2) Fine-tuning is the

Example 4.1. In a testing organisation, fine-tuning can be used

4.2.2 LLMOps when Deploying and Managing LLMs for Software Testing

3. In-house development of a GenAI-based test tool. This approach

• Organisations can adopt GenAI in testing through chatbots, AI-enabled

• LLMOps provides the operational practices needed to securely deploy,

Feedback and Evaluation Learner feedback is collected to support continuous