Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chapter 4 – LLM-Powered Test. Slides

Chapter 4 – LLM-Powered Test. Slides

Format: Reading Materials (self-study or guided reading)
Estimated Duration: 110 minutes

Target Audience: Software Testers, Test Automation Engineers, Test Analysts, Test Managers, Software Developers and professionals who need a solid understanding of Generative AI (GenAI) in testing – project managers, quality managers, software development managers, business analysts, IT directors and consultants, professionals preparing for ISTQBⓇ CT-GenAI certification

During this chapter, you will:
•Understand the architecture and core components of LLM-powered test infrastructure
•Explain how Retrieval-Augmented Generation (RAG) improves the quality and reliability of AI-generated testing outputs
•Describe the role of LLM-powered agents in automating and supporting software test activities
•Understand how fine-tuning adapts LLMs and SLMs for organisation-specific test tasks
•Explain the purpose of LLMOps in managing, deploying, and governing GenAI solutions in software test environments

Join Software Testing Hub via Linkedin: https://www.linkedin.com/groups/16889021/
Join Software Testing Hub via Facebook: https://www.facebook.com/groups/746590458484807

Avatar for Exactpro

Exactpro PRO

May 27, 2026

More Decks by Exactpro

Other Decks in Technology

Transcript

  1. BUILD SOFTWARE TO TEST SOFTWARE exactpro.com ISTQBⓇ CT-GenAI Chapter 4.

    LLM-Powered Test Infrastructure for Software Testing Iuliia Emelianova, Dmitrii Degtiarenko TRAINING COURSE ISTQBⓇ CT-GenAI COURSE V1.1
  2. 2 Learning Activity Overview…………………………………………………….…………………………………………………………….…………...… 4 Learning Objectives………………………………………………………………………………………………..……………………….…….………………… 5 4.1 Architectural

    Approaches for LLM-Powered Test Infrastructure…………………………………... 6 4.1.1 Key Architectural Components and Concepts of LLM-Powered Test Infrastructure………………………………………………………………………………………………………………………………..……. 8 4.1.2 Retrieval-Augmented Generation……….………………………………………………………………………………..….. 19 4.1.3 The Role of LLM-Powered Agents in Automating Test Processes…….………………………… 27 Key Takeaways – 4.1.………………………………………………………………………………………………………..…………………………… 38 Reflection – 4.1………………….…….…….…………………………………………………………………………………….………………………….. 39 4.2 Fine-Tuning and LLMOps: Operationalising Generative AI for Software Testing… 40 4.2.1 Fine-Tuning LLMs for Test Tasks…………………………………………………………………………………...……………. 43 4.2.2 LLMOps when Deploying and Managing LLMs for Software Testing……………………..… 54 Key Takeaways – 4.2.………………………………………………………………………………………………………..…………………………… 63 Reflection – 4.2………………….…….…….…………………………………………………………………………………….………………………….. 64 CONTENTS
  3. 3 Key Takeaways and Summary…………………………………………………….……………………………………………………………………… 65 Reflection and Knowledge Check…………………………………………………………………..……………………….…………………………

    66 References…………………………………………………….………………………………………………………………………………………………….…………… 67 Feedback and Evaluation…………………………………………………………………………………..……………………….………………………… 68 CONTENTS
  4. 4 Chapter 4 – LLM-Powered Test Infrastructure for Software Testing

    (ISTQBⓇ CT-GenAI v1.1) Format Reading Materials (self-study or guided reading) Estimated Duration 110 minutes Target Audience Software Testers, Test Automation Engineers, Test Analysts, Test Managers, Software Developers and professionals who need a solid understanding of Generative AI (GenAI) in testing – project managers, quality managers, software development managers, business analysts, IT directors and consultants, professionals preparing for ISTQBⓇ CT-GenAI certification Programme Context This learning activity forms a part of the ISTQBⓇ CT-GenAI training programme and aligns with the syllabus version 1.1 Engagement During this chapter, you will: • Understand the architecture and core components of LLM-powered test infrastructure • Explain how Retrieval-Augmented Generation (RAG) improves the quality and reliability of AI-generated testing outputs • Describe the role of LLM-powered agents in automating and supporting software test activities • Understand how fine-tuning adapts LLMs and SLMs for organisation-specific test tasks • Explain the purpose of LLMOps in managing, deploying, and governing GenAI solutions in software test environments LEARNING ACTIVITY OVERVIEW
  5. 5 By the end of this learning activity, participants will

    be able to: • Explain key architectural components and concepts of LLM-powered test infrastructure • Summarise Retrieval-Augmented Generation • Explain the role and application of LLM-powered agents in automating test processes • Explain the fine-tuning of language models for specific test tasks • Explain LLMOps and its role in deploying and managing LLMs for test tasks LEARNING OBJECTIVES
  6. 7 Sec. 4.1 Q A SYSTEMS USING LLMS IN SOFTWARE

    TESTING AI Chatbots LLM-Powered Test Tools ✓ Focus mainly on conversation ✓ Support end-to-end test activities AI chatbots and LLM-powered test tools are two forms of systems that use LLMs in software testing. While chatbots focus mainly on conversation, LLM-powered test tools are designed to support end-to-end test activities.
  7. 9 Sec. 4.1.1 Test infrastructure: The test environments, test tools,

    office environment and procedures needed to perform testing LLM-POWERED TEST INFRASTRUCTURE Test Monitoring and Control Test Planning Test Design Test Implemen- tation Test Execution Test Completion Test Analysis REASONS TO EMBED LLM ✓ Improved automation ✓ Improved reasoning ✓ Improved decision-making An LLM-powered test infrastructure is a system that embeds a Large Language Model into the test process to improve automation, reasoning, and decision-making.
  8. 10 Sec. 4.1.1 TASKS FOR LLM-POWERED TEST TOOL ✓ Process

    test-related queries ✓ Analyse requirements REQUIREMENTS ✓ Generate test artefacts ✓ Support defect analysis ✓ Evaluate outputs in alignment with the test basis Unlike a traditional AI chatbot, which focuses primarily on user conversation, an LLM-powered test tool is built to process test-related queries, analyse requirements, generate test artefacts, support defect analysis, and evaluate outputs in alignment with the test basis.
  9. 11 Sec. 4.1.1 MULTI-COMPONENT DESIGN To achieve this, the typical

    architecture uses a multi-component design that orchestrates secure, structured communication between testers, back-end systems, and the LLM.
  10. 12 Sec. 4.1.1 KEY ARCHITECTURAL COMPONENTS Front-End • Is the

    user interface: ◦ a web dashboard ◦ command line ◦ test management tool plugin ◦ chatbot-style interface • Allows testers to: ◦ enter prompts ◦ upload test artefacts ◦ request specific test actions It consists of the following core components: • Front-end. The front-end is the user interface, which may be a web dashboard, command line, test management tool plugin, or chatbot-style interface. It allows testers to enter prompts, upload test artefacts, or request specific test actions.
  11. 13 Sec. 4.1.1 KEY ARCHITECTURAL COMPONENTS Front-End Back-End • Is

    responsible for all coordination and logic behind the scenes • Handles: ◦ access control ◦ authentication ◦ prompt preprocessing ◦ retrieval of relevant test artefacts ◦ sending properly structured prompts to the LLM • Manages integration with external systems: ◦ test management tools ◦ CI/CD pipelines ◦ version-control systems • Back-end. The back-end is responsible for all coordination and logic behind the scenes. It handles access control and authentication, prompt preprocessing (context gathering, formatting, sanitising), retrieval of relevant test artefacts, and sending properly structured prompts to the LLM. It also manages integration with external systems such as test management tools, CI/CD pipelines, or version-control systems.
  12. 14 Sec. 4.1.1 KEY ARCHITECTURAL COMPONENTS Front-End Back-End LLM •

    Generates responses based on structured prompts • Receives only the information provided by the back-end Custom, locally deployed model hosted within the organisation Third-party cloud-hosted model • The LLM. It can be a third-party cloud-hosted model (accessed via API), or a custom, locally deployed model hosted within the organisation. The model generates responses based on structured prompts and receives only the information provided by the back-end.
  13. 15 Sec. 4.1.1 LLM-POWERED TEST INFRASTRUCTURE VS A CLASSIC CLIENT-SERVER

    DESIGN 1. THE LLM ACTS AS A REASONING ENGINE • Interprets the test context • Applies semantic understanding • Generates test insights from inputs such as: ◦ test cases ◦ requirements ◦ logs ◦ code LLM-powered test infrastructure extends far beyond a classic client-server design. Instead of simply routing messages, the system includes several intelligent processing layers: 1. The LLM acts as a reasoning engine. It does not behave like a fixed rule-based server. Instead, it interprets the test context, applies semantic understanding, and generates test insights from inputs such as test cases, requirements, logs, or code.
  14. 16 Sec. 4.1.1 LLM-POWERED TEST INFRASTRUCTURE VS A CLASSIC CLIENT-SERVER

    DESIGN 2. DYNAMIC GENERATION INSTEAD OF SCRIPTED RESPONSES • Supports flexible reasoning • Generates contextually relevant outputs 2. Dynamic generation instead of scripted responses. Rule-based chatbots rely on predefined flows. In contrast, an LLM-powered test tool supports flexible reasoning and generates contextually relevant outputs, even for tasks it has not been explicitly programmed for.
  15. 17 Sec. 4.1.1 LLM-POWERED TEST INFRASTRUCTURE VS A CLASSIC CLIENT-SERVER

    DESIGN 3. MULTI-SOURCE DATA INTEGRATION • The back-end connects to several data sources: ◦ relational databases for structured test data ◦ vector databases for semantic search and retrieval of relevant artefacts using embeddings • The right context is retrieved, filtered, and assembled before sending to the LLM Vector database: A database optimised for storing and querying high-dimensional vector representations of data 3. Multi-source data integration. The back-end typically connects to several data sources, such as relational databases for structured test data (test cases, execution results, user accounts, metadata) and vector databases (databases optimised for storing and querying high-dimensional vector representations of data) for semantic search and retrieval of relevant artefacts using embeddings. This allows the system to retrieve, filter, and assemble the right context before sending it to the LLM.
  16. 18 Sec. 4.1.1 LLM-POWERED TEST INFRASTRUCTURE VS A CLASSIC CLIENT-SERVER

    DESIGN 4. POST-PROCESSING ENHANCEMENTS • The back-end applies post-processing steps: ◦ verifying format and structure ◦ ensuring alignment with the test basis ◦ filtering hallucinations ◦ applying organisation-specific rules ◦ converting output into test case templates or automation-ready scripts 4. Post-processing enhancements. Raw LLM output often requires refinement before being presented to the user. The back-end can apply post-processing steps such as verifying format and structure, ensuring alignment with the test basis, filtering hallucinations, applying organisation-specific rules, or converting output into test case templates or automation-ready scripts. These steps improve consistency and ensure that generated content fits the organisation’s testing standards.
  17. 20 Sec. 4.1.2 Retrieval-augmented generation (RAG): A technique combining LLM

    capabilities with a retriever to fetch relevant data for generating accurate, contextually relevant responses RETRIEVAL-AUGMENTED GENERATION PROS ✓ Improved accuracy ✓ Improved relevance ✓ Improved trustworthiness of the output ✓ Working with information far beyond what fits into the training data or context window Retrieval-Augmented Generation (RAG) enhances the capabilities of LLMs by supplying them with additional, up-to-date, and task-specific information. Instead of relying only on the knowledge the model learned during pre-training, RAG allows the system to retrieve relevant information from external sources and incorporate it into the generation process. This approach significantly improves the accuracy, relevance, and trustworthiness of the output, especially when working with organisation-specific test artefacts. Because this allows the LLM to work with information far beyond what fits into its training data or context window.
  18. 21 Sec. 4.1.2 A Retrieval System A Language Model •

    Uses the retrieved information to generate a grounded, context-aware response RAG COMPONENTS • Searches for information relevant to the user’s query RAG combines two main components: 1. A retrieval system that searches for information relevant to the user’s query. 2. A language model that uses this retrieved information to generate a grounded, context-aware response.
  19. 22 Sec. 4.1.2 REQUIREMENTS USER QUERY OUTPUT 1. Large documents

    are split into smaller chunks (around 256-512 tokens each). 2. Each chunk is cleaned. 3. Encoding into a high-dimensional vector (embedding). 4. A user query is converted into an embedding. 5. The semantically closest chunks are found in the database. 6. LLM integrates this information with its own general knowledge. 7. The output is produced. RAG ✓ Reduced hallucinations ✓ Increased quality of testing-related outputs PROS Before retrieval can work effectively, project data must be prepared. Large documents (requirement documents, test plans, user manuals) are split into smaller chunks, typically around 256-512 tokens each. This ensures the model can process the content within its context window and retrieve only the most relevant pieces. Each chunk is cleaned (removing noise, formatting issues, and irrelevant text) and encoded into a high-dimensional vector called an embedding. These embeddings are stored in a vector database, which enables fast, efficient retrieval based on semantic similarity rather than keyword matching. At runtime, a user query is also converted into an embedding, allowing the system to find the semantically closest chunks in the database. A relevant response is one in which the LLM grounds its answer in retrieved, reliable, real-world data, integrates this information with its own general knowledge, and produces output that is accurate, precise, and directly aligned with the context of the query. This grounding dramatically reduces hallucinations and increases the quality of testing-related outputs.
  20. 23 Sec. 4.1.2 USER PROMPT PROCESSING STEPS Retrieval • The

    user’s query is encoded and matched against the vector database • Relevant chunks are retrieved based on semantic similarity: ◦ requirements ◦ code snippets ◦ test reports ◦ design documents During user prompt processing, a RAG system follows a clear two-step pipeline: 1. Retrieval. The user’s query is encoded and matched against the vector database. Relevant chunks, such as requirements, code snippets, test reports, or design documents, are retrieved based on semantic similarity.
  21. 24 Sec. 4.1.2 USER PROMPT PROCESSING STEPS Retrieval • The

    retrieved chunks are appended as context to the prompt and sent to the LLM • The model generates a response that combines: ◦ the retrieved enterprise-specific data ◦ the model’s own general knowledge Generation 2. Generation. The retrieved chunks are appended as context to the prompt and sent to the LLM. The model then generates a response that combines the retrieved enterprise-specific data, and the model’s own general knowledge.
  22. 25 Sec. 4.1.2 FEATURE BENEFIT • Access to enterprise data

    sources: ◦ requirement repositories ◦ test management systems ◦ codebases ◦ release notes ◦ defect databases ◦ API documentation ◦ architectural diagrams • Produced output is: ◦ more accurate ◦ more evidence-based ◦ more contextually appropriate • Real-time retrieval • Ability to use the latest project information to perform test tasks: ◦ test analysis ◦ test design ◦ coverage evaluation RAG This approach produces more accurate, evidence-based, and contextually appropriate output, because RAG allows the system to access enterprise data sources such as requirement repositories, test management systems, codebases, release notes, defect databases, API documentation, architectural diagrams. With real-time retrieval, GenAI can perform test tasks, such as test analysis, test design, or coverage evaluation, based on the latest project information.
  23. 26 Sec. 4.1.2 BENEFITS OF USING RAG Ensuring alignment with

    the most current specifications Avoiding outdated assumptions Improving the reliability of AI-generated test artefacts This helps ensure alignment with the most current specifications, avoids outdated assumptions, and significantly improves the reliability of AI-generated test artefacts.
  24. 28 Sec. 4.1.3 Q A AI CHATBOT LLM-POWERED AGENT LLM-powered

    agent: An application that integrates LLM reasoning, decision-making, and memory, using tools to perform tasks AI CHATBOTS VS LLM-POWERED AGENTS • Main focus on conversational question-and- answer flows • Ability to: ◦ reason ◦ retrieve context ◦ follow multi-step instructions ◦ take actions by interacting with external tools or systems LLM-powered agents are specialised GenAI systems designed to carry out semi-autonomous or autonomous tasks. Unlike simple AI chatbots, which focus mainly on conversational question-and-answer flows, LLM-powered agents can reason, retrieve context, follow multi-step instructions, and take actions by interacting with external tools or systems.
  25. 29 Sec. 4.1.3 FEATURES OF LLM-POWERED AGENTS LLM Capabilities Context

    Retrieval Function Execution • Language understanding • Reasoning • Generation From: • RAG • Databases • APIs • Tools the agent can call to perform tasks At their foundation, these agents combine LLM capabilities (language understanding, reasoning, and generation), context retrieval (from RAG, databases, or APIs), and function execution (tools the agent can call to perform tasks). This makes them significantly more powerful and flexible than traditional conversational interfaces.
  26. 30 Sec. 4.1.3 INVOCABLE TOOLS POSSIBLE OPERATIONS • Test management

    system APIs • File readers/writers • Code execution tools • CI/CD pipeline commands • Test automation frameworks • Data retrieval functions • Creating test cases • Updating reports • Running automated tests • Analysing logs LLM-POWERED AGENTS Instead of only generating text, LLM-powered agents can “act” by invoking predefined tools such as test management system APIs, file readers or writers, code execution tools, CI/CD pipeline commands, test automation frameworks, and data retrieval functions. This lets the agent perform operations like creating test cases, updating reports, running automated tests, or analysing logs.
  27. 31 AUTONOMOUS AGENT SEMI-AUTONOMOUS AGENT Sec. 4.1.3 LLM-POWERED AGENTS ???

    Go left! • Operate independently with minimal human oversight • Can use: ◦ predefined rules ◦ reinforcement learning ◦ feedback loops ◦ self-directed reasoning • Can perform continuous tasks: ◦ monitoring test results ◦ triggering test runs ◦ maintaining test suites • Execute tasks but include human checkpoints for validation • Is preferred for higher-risk tasks where hallucinations or reasoning errors could lead to: ◦ incorrect test cases ◦ faulty automation code ◦ incorrect defect analyses The autonomy of these agents varies depending on the use case and risk level: • Autonomous agents. Operate independently with minimal human oversight. They can use predefined rules, reinforcement learning, feedback loops, or self-directed reasoning. These agents can perform continuous tasks such as monitoring test results, triggering test runs, or maintaining test suites. • Semi-autonomous agents. Execute tasks but include human checkpoints for validation. This is often preferred for higher-risk tasks where hallucinations or reasoning errors could lead to incorrect test cases, faulty automation code, or incorrect defect analyses.
  28. 32 AUTOMATION CODE GENERATION Sec. 4.1.3 MULTI-AGENT SETUP REQUIREMENTS EXPECTED

    RESULTS PRECONDITIONS STEPS #1 #2 #5 #3 TC-001 TC-002 TC-003 TC-004 TC-005 TC-006 TC-007 TC-008 TC-009 TC-010 TC-011 TC-012 #4 REQUIREMENT ANALYSIS TEST DESIGN REPORTING AND SUMMARISATION EXECUTION SCHEDULING A multi-agent setup involves several agents, each dedicated to a specialised function. Examples include one agent focused on requirement analysis, one on test design, one on automation code generation, one on execution scheduling, and one on reporting and summarisation.
  29. 33 AUTOMATION CODE GENERATION Sec. 4.1.3 ORCHESTRATION REQUIREMENTS EXPECTED RESULTS

    PRECONDITIONS STEPS #1 #2 #5 #3 TC-001 TC-002 TC-003 TC-004 TC-005 TC-006 TC-007 TC-008 TC-009 TC-010 TC-011 TC-012 #4 REQUIREMENT ANALYSIS TEST DESIGN REPORTING AND SUMMARISATION EXECUTION SCHEDULING These agents collaborate by exchanging information and handing tasks to each other. This coordinated workflow is known as orchestration and often leads to more efficient and reliable outcomes than a single agent working alone.
  30. 34 Sec. 4.1.3 TASKS FOR LLM-POWERED AGENTS Analysing requirements Detecting

    gaps Generating test conditions and test cases Creating/updating automation scripts Running regression tests Evaluating logs and error messages Preparing structured test reports LLM-powered agents can automate or support many complex testing tasks by emulating aspects of human reasoning. They can analyse requirements and detect gaps, generate test conditions and test cases, create or update automation scripts, run regression tests, evaluate logs and error messages, and prepare structured reports. In modern testing tools, these capabilities are often demonstrated through integrated AI assistants embedded directly into the workflow.
  31. 35 Sec. 4.1.3 LLM-POWERED AGENTS AS DIGITAL TEST ASSISTANTS •

    Executing multi-step workflows in a semi-autonomous way • Reducing manual effort • Shortening feedback cycles • Shifting test automation: ◦ from script-based execution ◦ to goal-driven, agent-based test automation SCRIPT-BASED EXECUTION GOAL-DRIVEN, AGENT-BASED TEST AUTOMATION As a result, they act as digital test assistants capable of executing multi-step workflows in a semi-autonomous way, reducing manual effort and shortening feedback cycles, while shifting test automation from script-based execution to goal-driven, agent-based test automation.
  32. 36 Sec. 4.1.3 LIMITATIONS Hallucinations Reasoning Errors Non-Deterministic Behaviour LLM-POWERED

    AGENTS Biases Lead to incorrect or misleading outputs: • flawed test cases • faulty automation scripts Despite their capabilities, LLM-powered agents still inherit all the limitations of LLMs described in Section 3.1, including hallucinations, reasoning errors, biases, and non-deterministic behaviour. These issues can produce incorrect or misleading outputs, such as flawed test cases or faulty automation scripts, which undermine the reliability of automated test processes.
  33. 37 HOW TO MANAGE RISKS Implement automated verification steps to

    validate agent outputs: • syntax checkers • test runners • consistency checks Use semi-autonomous agents for high-stakes or safety-critical tasks To manage these risks, organisations can implement automated verification steps to validate agent outputs (e.g., syntax checkers, test runners, consistency checks), and use semi-autonomous agents for high-stakes or safety-critical tasks where human oversight is necessary.
  34. 38 • LLM-powered test infrastructure integrates front-end interfaces, back-end orchestration

    layers, and LLMs to support intelligent, end-to-end software test activities • Unlike rule-based chatbots, LLM-powered test tools use semantic reasoning and dynamic generation to produce context-aware testing outputs • Back-end systems play a critical role by managing prompt processing, context retrieval, integrations with testing tools, and post-processing of LLM outputs • Retrieval-Augmented Generation (RAG) improves the accuracy, relevance, and trustworthiness of AI-generated outputs by grounding responses in retrieved enterprise-specific data • Vector databases and embeddings enable semantic retrieval of relevant artefacts such as requirements, test cases, logs, and design documents • LLM-powered agents can perform semi-autonomous or autonomous test tasks by combining reasoning, context retrieval, and function execution capabilities • Although LLM-powered agents can significantly automate testing workflows, organisations must manage risks such as hallucinations, reasoning errors, and non-deterministic behaviour through validation and human oversight KEY TAKEAWAYS – 4.1
  35. 39 1. How can Retrieval-Augmented Generation (RAG) improve the quality

    and reliability of AI-generated test artefacts in your organisation? 2. Using examples from your own experience, what risks could arise from relying on autonomous or semi-autonomous AI agents in software testing, and how could these risks be mitigated? REFLECTION – 4.1
  36. 41 Sec. 4.2 BUILDING AN LLM-POWERED TEST INFRASTRUCTURE KEY PRACTICES

    Fine-Tuning LLMs LLMOps • Adapting to testing-specific tasks and organisational context • Managing the full operational lifecycle of GenAI Building an LLM-powered test infrastructure is not just about selecting a model and connecting it to a tool. To use GenAI reliably in day-to-day testing operations, two key practices are required: • Fine-tuning LLMs to adapt them to testing-specific tasks and organisational context • Managing the full operational lifecycle of GenAI through LLMOps
  37. 42 Sec. 4.2 BUILDING AN LLM-POWERED TEST INFRASTRUCTURE KEY PRACTICES

    Fine-Tuning LLMs LLMOps Transform GenAI from an experimental tool into a stable, governed, and production-ready testing capability Together, these practices help transform GenAI from an experimental tool into a stable, governed, and production-ready testing capability.
  38. 44 Sec. 4.2.1 Fine-tuning: A supervised learning process using a

    dataset of labelled examples to update LLM weights and adapt them for specific tasks or domains FINE-TUNING LLMS • Training on a targeted, domain-specific dataset WHAT CAN MODEL LEARN ✓ Domain-specific terminology ✓ Organisation-specific formats ✓ Typical reasoning patterns ✓ Specialised testing practices MODEL BECOMES MORE… ✓ Accurate ✓ Context-aware ✓ Relevant Fine-tuning is the process of adapting a pre-trained Language Model, such as LLM or an SLM, to perform specific tasks or to operate within a specific domain. Instead of training a model from scratch, fine-tuning continues its training on a targeted, domain-specific dataset. This allows the model to learn domain-specific terminology, organisation-specific formats, typical reasoning patterns, and specialised testing practices. As a result, the model becomes more accurate, more context-aware, and more relevant for the intended software testing use case.
  39. 45 Sec. 4.2.1 WHEN FINE-TUNING IS USEFUL LLM does not

    understand organisation-specific vocabulary ! LLM produces outputs in the wrong format ! LLM lacks domain-specific reasoning patterns ! In practice, fine-tuning is especially useful when a generic LLM: • does not understand organisation-specific vocabulary, • produces outputs in the wrong format, • or lacks domain-specific reasoning patterns.
  40. 46 Sec. 4.2.1 WHEN FINE-TUNING IS USEFUL LLM does not

    understand organisation-specific vocabulary ! LLM produces outputs in the wrong format ! LLM lacks domain-specific reasoning patterns ! LLM adopts testing terminology used by the organisation LLM follows internal test case templates LLM reflects company-specific quality criteria and aligns with domain rules Fine-tuning allows the model to adopt testing terminology used by the organisation, follow internal test case templates, reflect company-specific quality criteria, and align with domain rules (e.g., finance, healthcare, automotive).
  41. 47 Sec. 4.2.1 MODEL BEING FINE-TUNED SLM LLM Is smaller

    and faster Is less resource-intensive High performance on narrow, well- defined tasks at a fraction of the cost Offers broad general intelligence Requires significant computational resources ! Fine-tuning can be applied to both LLMs and SLMs. LLMs offer broad general intelligence but require significant computational resources when fine-tuned. SLMs are smaller, faster, and less resource-intensive. When fine-tuned, SLMs can achieve very high performance on narrow, well-defined tasks at a fraction of the cost.
  42. 48 Sec. 4.2.1 MODEL BEING FINE-TUNED SLM LLM • Used

    when needed: ◦ broad reasoning ◦ language coverage • Used when needed: ◦ speed ◦ cost-efficiency ◦ domain focus This provides a certain degree of flexibility. We can use LLMs when broad reasoning and language coverage are needed, and use fine-tuned SLMs when speed, cost-efficiency, and domain focus are critical.
  43. 49 Sec. 4.2.1 TASKS WHERE FINE-TUNING CAN BE USED USER

    STORY ✓ Convert user stories into test cases ✓ Apply the organisation’s specific test case structure Test Case ID • Title • Environment Details • Priority/Severity • Description • Preconditions • Test Data • Test Steps • Expected Result • Actual Result • Status ✓ Use company-specific terms and abbreviations ✓ Follow internal testing standards STANDARD #3 STANDARD #2 STANDARD #1 OTE Yield Curve Maturity Settlement PnL Initial Margin Forward Option Futures Discount Factor SPAN For example, in a testing organisation, fine-tuning can be used to teach a model how to convert user stories into test cases, apply the organisation’s specific test case structure, use company-specific terms and abbreviations, and follow internal testing standards. By training the model on pairs of real user stories, and the corresponding approved test cases, the model learns how that organisation expects testing artefacts to look and behave.
  44. 50 Sec. 4.2.1 FINE-TUNING CHALLENGES 1. AVOIDING BIASED OR INACCURATE

    RESULTS • The quality of the model depends on the quality of the training data • The training dataset should be free of: ◦ errors ◦ outdated practices ◦ bias While fine-tuning is powerful, it introduces several important challenges: 1. Avoiding biased or inaccurate results. The quality of the fine-tuned model depends directly on the quality of the training data. It’s a garbage-in-garbage-out principle. If the training dataset contains errors, outdated practices, or bias, the model will learn and reproduce those flaws.
  45. 51 Sec. 4.2.1 FINE-TUNING CHALLENGES 2. MITIGATING OVERFITTING • Model

    must generalise well across: ◦ different applications ◦ different requirement styles ◦ evolving system behaviour Overfitting: The generation of an ML model that corresponds too closely to the training dataset, resulting in a model that finds it difficult to generalise to new data 2. Mitigating overfitting. Overfitting happens when the model becomes too specialised to the training data and performs poorly on new, unseen scenarios. A fine-tuned testing model must still generalise well across different applications, different requirement styles, and evolving system behaviour.
  46. 52 Sec. 4.2.1 FINE-TUNING CHALLENGES 3. ADDRESSING OPACITY IN MODEL

    REASONING • LLMs often behave like black boxes • Lack of transparency WHAT IS COMPLICATED ✓ Explain why a model generated a specific test case or decision ✓ Debugging incorrect outputs ✓ Validating results in regulated environments ✓ Building trust with stakeholders X-RAY NOT WORKING 3. Addressing opacity in model reasoning. LLMs often behave like black boxes. After fine-tuning, it may be difficult to explain why a model generated a specific test case or decision. This lack of transparency complicates debugging incorrect outputs, validating results in regulated environments, and building trust with stakeholders.
  47. 53 Sec. 4.2.1 FINE-TUNING CHALLENGES 4. MANAGING COMPUTATIONAL RESOURCE DEMANDS

    (FOR LLMS) • Required for LLMs: ◦ high-performance GPUs ◦ AI accelerators ◦ large volumes of storage ◦ careful experiment tracking ◦ significant financial investment • SLMs are often preferred for operational test environments 4. Managing computational resource demands (for LLMs). Fine-tuning large models requires high-performance GPUs or AI accelerators, large volumes of storage, careful experiment tracking, and significant financial investment. This is one reason why fine-tuned SLMs are often preferred for operational test environments.
  48. 55 Sec. 4.2.2 Large Language Model Operations (LLMOps): Practices and

    tools focused on deploying, monitoring, and maintaining LLMs in production environments LLMOps • Ensuring that GenAI solutions are: ◦ operationally stable ◦ secure ◦ compliant ◦ cost-controlled ◦ continuously monitored for quality and risk LLMOps (Large Language Model Operations) refers to the set of structured practices, tools, and processes used to manage the development, deployment, monitoring, and maintenance of LLMs in production environments. In software testing, LLMOps ensures that GenAI solutions are operationally stable, secure and compliant, cost-controlled, and continuously monitored for quality and risk. Without LLMOps, GenAI usage typically remains experimental and unreliable, making it unsuitable for integration into enterprise-level test processes.
  49. 56 Sec. 4.2.2 APPROACHES TO LLMOPS IMPLEMENTATION 1. USING AN

    AI CHATBOT DEPLOYMENT TYPES In-House Deployments of Open-Source LLMs LLM-as-a-Service Platforms • The model is hosted by a third-party provider • Vendor guarantees are essential • Provide greater control over data and infrastructure • Internal security capabilities are essential Q A THE MAIN CONSIDERATIONS ✓ Data privacy ✓ Security ✓ Operational cost Organisations can introduce generative AI into their test processes using several different implementation approaches. Each approach leads to different LLMOps decisions related to governance, risk, cost, and technical management. Three common approaches are described below. 1. Using an AI chatbot. When GenAI is used through a general-purpose AI chatbot, the main operational considerations focus on data privacy and security, and cost optimisation. Organisations may choose between LLM-as-a-Service platforms, where the model is hosted by a third-party provider, or in-house deployments of open-source LLMs, which provide greater control over data and infrastructure.In both cases, a rigorous assessment of vendor guarantees or internal security capabilities is essential to mitigate data privacy and security risks, ensure compliance with regulations, and control operational expenses.
  50. 57 Sec. 4.2.2 APPROACHES TO LLMOPS IMPLEMENTATION 1. USING AN

    AI CHATBOT Q A Requirements Analysis Exploratory Testing Assistance Test Idea Generation TYPICAL TASKS Chatbots are typically used for ad-hoc testing support, such as requirements analysis, test idea generation, or exploratory testing assistance.
  51. 58 Sec. 4.2.2 APPROACHES TO LLMOPS IMPLEMENTATION 2. USING A

    TEST TOOL WITH BUILT-IN GENERATIVE AI CAPABILITIES THE MAIN CONSIDERATIONS ✓ Data privacy ✓ Security ✓ Infrastructure reliability ✓ Operational cost TEST TOOL TYPES Open-Source Commercial What is essential: • Data protection guarantees provided by a vendor • Performance and availability of the AI features • Integration quality with existing test processes and toolchains • Conducting a thorough cost-benefit analysis and risk assessment 2. Using a test tool with built-in generative AI capabilities. In this approach, GenAI is embedded directly into a commercial or open-source test tool. The core LLMOps considerations are similar to those for chatbots data privacy, security, infrastructure reliability, and operational cost. In addition, organisations must evaluate the data protection guarantees provided by the vendor, the performance and availability of the AI features, and the integration quality with existing test processes and toolchains. Since these tools typically augment existing test workflows, organisations must also conduct a thorough cost-benefit analysis and risk assessment to ensure that GenAI adoption creates real operational value.
  52. 59 Sec. 4.2.2 APPROACHES TO LLMOPS IMPLEMENTATION 3. IN-HOUSE DEVELOPMENT

    OF A GENAI-BASED TEST TOOL • The highest degree of control • The highest operational responsibility 3. In-house development of a GenAI-based test tool. This approach provides the highest degree of control, but also the highest operational responsibility.
  53. 60 Sec. 4.2.2 APPROACHES TO LLMOPS IMPLEMENTATION 3. IN-HOUSE DEVELOPMENT

    OF A GENAI-BASED TEST TOOL THE MAIN CONSIDERATIONS Full control • Computational resources • Data storage • Model maintenance • Staff training Careful planning Validating, monitoring, and maintaining ✓ Data privacy ✓ Security ✓ AI resource utilisation ✓ Formal processes for GenAI components ✓ Strong expertise Is required in… • ML infrastructure • secure model deployment • prompt engineering • fine-tuning • LLM-powered test infrastructure design Key LLMOps considerations include full control over data privacy and security, careful planning for AI resource utilisation such as computational resources, data storage, model maintenance, and staff training, and the establishment of formal processes for validating, monitoring, and maintaining GenAI components. Developing in-house solutions also requires strong expertise in machine learning infrastructure, secure model deployment, prompt engineering and fine-tuning, and LLM-powered test infrastructure design.
  54. 61 Sec. 4.2.2 APPROACHES TO LLMOPS IMPLEMENTATION 3. IN-HOUSE DEVELOPMENT

    OF A GENAI-BASED TEST TOOL WHO USES THIS APPROACH Large organisations with: • high confidentiality requirements • specialised testing needs THE MAIN CONSIDERATIONS ✓ Data privacy ✓ Security ✓ AI resource utilisation ✓ Formal processes for GenAI components ✓ Strong expertise This approach is typically adopted by large organisations with high confidentiality requirements or specialised testing needs. These three approaches are not mutually exclusive.
  55. 62 Sec. 4.2.2 APPROACHES TO LLMOPS IMPLEMENTATION Q A EXPLORATORY

    ANALYSIS DAILY TEST EXECUTION SENSITIVE OR BUSINESS-CRITICAL SYSTEMS • RAG • Fine-tuning of LLMs or SLMs • Enhancing: ◦ accuracy ◦ adaptability ◦ relevance of GenAI within test processes An organisation may, for example, use an AI chatbot for exploratory analysis, rely on a commercial GenAI-enabled test tool for daily test execution, and develop custom in-house tools for sensitive or business-critical systems. Furthermore, all approaches may incorporate additional technologies such as RAG, and fine-tuning of LLMs or SLMs, to further enhance the accuracy, adaptability, and relevance of GenAI within test processes.
  56. 63 • Fine-tuning adapts LLMs and SLMs to organisation-specific test

    tasks, terminology, formats, and domain requirements, improving the relevance of AI-generated outputs • Fine-tuned SLMs can provide efficient and cost-effective solutions for specialised test activities, while LLMs offer broader reasoning capabilities for more complex tasks • The effectiveness of fine-tuning depends on high-quality training data and careful management of risks such as bias, overfitting, and lack of transparency • LLMOps provides the operational framework needed to deploy, monitor, maintain, and govern GenAI solutions in software test environments • Organisations can adopt GenAI in testing through chatbots, AI-enabled test tools, in-house solutions, or a combination of approaches depending on their security, cost, and operational requirements KEY TAKEAWAYS – 4.2
  57. 64 1. How could fine-tuning improve the quality, consistency, or

    relevance of AI-generated testing outputs in your own projects? 2. Considering your organisation’s processes and constraints, what LLMOps practices would be most important to ensure that GenAI solutions remain secure, reliable, and maintainable in production test environments? REFLECTION – 4.2
  58. 65 • LLM-powered test infrastructure combines front-end interfaces, back-end orchestration,

    and LLMs to support intelligent, end-to-end software testing activities beyond traditional chatbot interactions • Modern LLM-powered testing systems use semantic reasoning, multi-source data integration, and post-processing mechanisms to generate more context-aware and reliable testing outputs • Retrieval-Augmented Generation (RAG) improves the accuracy and trustworthiness of AI-generated results by retrieving relevant enterprise data through embeddings and vector databases before generating responses • LLM-powered agents extend GenAI capabilities by reasoning, retrieving context, and interacting with external tools to automate or support testing workflows through autonomous or semi-autonomous actions • Fine-tuning enables LLMs and SLMs to adapt to organisation-specific test terminology, formats, and domain requirements, improving the relevance and consistency of generated outputs • LLMOps provides the operational practices needed to securely deploy, govern, monitor, and maintain GenAI solutions in production test environments, whether using chatbots, AI-enabled testing tools, in-house systems, or hybrid approaches KEY TAKEAWAYS AND SUMMARY
  59. 66 Answer these questions after completing the reading: 1. How

    does an LLM-powered test infrastructure differ from a traditional chatbot or rule-based test tool in terms of architecture, reasoning, and automation capabilities? 2. How does semantic retrieval using embeddings and vector databases improve the effectiveness of Retrieval-Augmented Generation (RAG)? 3. Which test activities could benefit most from LLM-powered agents or AI-assisted automation, and where would human oversight still be required? 4. How can post-processing and validation mechanisms help reduce hallucinations and improve the reliability of AI-generated testing outputs? 5. Why can fine-tuned SLMs sometimes be more practical than large LLMs for operational test environments? 6. What risks and challenges must organisations manage when fine-tuning models for software test activities? (You should answer using examples from your own projects where possible.) REFLECTION AND KNOWLEDGE CHECK
  60. 67 • ISTQB® Certified Tester Specialist Level Testing with Generative

    AI (CT-GenAI) Syllabus Version 1.1, 2026 REFERENCES
  61. 68 Learner feedback is collected to support continuous improvement of

    delivery and materials. Understanding is evaluated through: • Chapter quiz covering key concepts from this chapter • Q&A session to clarify questions arising from the activities and quiz FEEDBACK AND EVALUATION