Chapter 2 – Prompt Engineering for Effective Software Testing (ISTQBⓇ CT-GenAI v1.1). Slides

BUILD SOFTWARE TO TEST SOFTWARE exactpro.com ISTQBⓇ CT-GenAI Chapter 2.
Prompt Engineering for Effective Software Testing Iuliia Emelianova, Dmitrii Degtiarenko TRAINING COURSE ISTQBⓇ CT-GenAI COURSE V1.1

2 Learning Activity Overview…………………………………………………….…………………………………………………………….…………...… 4 Learning Objectives………………………………………………………………………………………………..……………………….…….………………… 5 2.1 Effective
Prompt Development………………………………………………………………………………….….….……….…………. 6 2.1.1 Structure of Prompts for Generative AI in Software Testing……………………….…….…………. 15 2.1.2 Core Prompting Techniques for Software Testing.……………………………………………….….…………. 31 2.1.3 System Prompt and User Prompt………………………………………………………………………………......…………. 47 Key Takeaways – 2.1.………………………………………………………………………………………………………..…………………………… 60 Reﬂection – 2.1………………….…….…….…………………………………………………………………………………….………………………….. 61 2.2 Applying Prompt Engineering Techniques to Software Test Tasks.……………….…..…………. 62 2.2.1 Test Analysis with Generative AI………….…….…….………………………………………..……………...…..…………. 65 2.2.2 Test Design and Test Implementation with Generative AI………………..……..……………………. 79 2.2.3 Automated Regression Testing with Generative AI…………………………..……….……..…..…………. 93 2.2.4 Test Monitoring and Test Control with Generative AI…………………………..……..…………………… 112 2.2.5 Choosing Prompting Techniques for Software Testing……………………..…….………….……………. 124 CONTENTS

3 Key Takeaways – 2.2.………………………………………………………………………………………………………..…………………………… 128 Reflection – 2.2………………….…….…….…………………………………………………………………………………….………………………….. 129
2.3 Evaluate Generative AI Results and Refine Prompts for Software Test Tasks………… 130 2.3.1 Metrics for Evaluating the Results of Generative AI on Test Tasks...…………..……………. 133 2.3.2 Techniques for Evaluating and Iteratively Refining Prompts………………..….…………….……. 146 Key Takeaways – 2.3.………………………………………………………………………………………………………..…………………………… 155 Reflection – 2.3………………….…….…….…………………………………………………………………………………….………………………….. 156 Key Takeaways and Summary…………………………………………………….……………………………………………………………………… 157 Reflection and Knowledge Check…………………………………………………………………..……………………….………………………… 158 References…………………………………………………….………………………………………………………………………………………………….…………… 159 Feedback and Evaluation…………………………………………………………………………………..……………………….………………………… 160 CONTENTS

4 Chapter 2 – Prompt Engineering for Effective Software Testing
(ISTQBⓇ CT-GenAI v1.1) Format Reading Materials (self-study or guided reading) Estimated Duration 365 minutes Target Audience Software Testers, Test Automation Engineers, Test Analysts, Test Managers, Software Developers and professionals who need a solid understanding of Generative AI (GenAI) in testing – project managers, quality managers, software development managers, business analysts, IT directors and consultants, professionals preparing for ISTQBⓇ CT-GenAI certiﬁcation Programme Context This learning activity forms a part of the ISTQBⓇ CT-GenAI training programme and aligns with the syllabus version 1.1 Engagement During this chapter, you will: • Understand the fundamentals of prompt engineering for AI-assisted software testing • Learn how to structure effective prompts using role, context, instructions, constraints, and output formats • Explore prompt engineering techniques such as prompt chaining, few-shot prompting, and meta prompting • Apply GenAI to software testing tasks including test analysis, test design, regression testing, and test monitoring • Evaluate and reﬁne AI-generated outputs using quality metrics, feedback, and iterative improvement techniques LEARNING ACTIVITY OVERVIEW

5 By the end of this learning activity, participants will
be able to: • Give examples of the structure of prompts used in generative AI for software testing • Differentiate core prompting techniques for software testing • Distinguish between system prompts and user prompts • Apply generative AI to test analysis tasks • Apply generative AI to test design and test implementation tasks • Apply generative AI to automated regression testing • Apply generative AI to test monitoring and control task • Select and apply appropriate prompting techniques for a given context and test task • Understand the metrics for evaluating the results of generative AI on test tasks • Give examples of techniques for evaluating and iteratively reﬁning prompts LEARNING OBJECTIVES

6 2.1 EFFECTIVE PROMPT DEVELOPMENT

7 Prompt engineering: The process of designing and reﬁning input
prompts to guide LLMs toward producing desired outputs Sec. 2.1 The quality of a prompt determines how useful, accurate, and test-relevant the output will be PROMPT ENGINEERING “give me information about the apple company” “give me information about the apple” An apple is a sweet, edible fruit that grows on apple trees Apple Inc. is a global technology company famous for its consumer electronics, software, and digital services Generative AI systems depend entirely on how you talk to them. Every interaction, be it a request, question, or instruction, is a prompt, and the quality of that prompt determines how useful, accurate, and test-relevant the output will be. For testers, this skill is more than just “knowing what to type.” It’s about understanding how language models interpret context, what makes them focus or drift and how to guide them toward repeatable, reliable, testing-oriented results. Prompt engineering, therefore, becomes a core competence for AI-assisted testing, much like designing test cases is for conventional testing. Prompt engineering is the process of designing and refining input prompts to guide LLMs toward producing desired outputs. When you use an LLM, you’re not programming it in the traditional sense; you’re instructing it through language. The model doesn’t “understand” the way humans do but it predicts the most probable next words based on your prompt. That means small wording changes can lead to big differences in output.

8 Sec. 2.1 PROMPT ENGINEERING List test cases for login
functionality, this might give a short bullet list like “Valid login, Invalid password, Locked account”. • Valid login – correct username and password • Invalid password – valid username, wrong password • Invalid username – non-existing user • Locked account – user account is locked/disabled …………….… EXAMPLE Write ﬁve detailed functional test cases for login, including preconditions, steps, and expected results, and this might produce structured, actionable test cases with clear outcomes. Test Case 1: Valid Login Preconditions: • User account exists in the system • Account is active and not locked • User has a valid username and password Steps: • Navigate to the Login page • Enter a valid username • Enter the correct password • Click the Login button Expected Result: • User is successfully authenticated • User is redirected to the home/dashboard page • A valid session is created Prompt A Prompt B Prompt A: List test cases for login functionality, this might give a short bullet list like “Valid login, Invalid password, Locked account.” Prompt B: Write five detailed functional test cases for login, including preconditions, steps, and expected results, and this might produce structured, actionable test cases with clear outcomes. Both are correct, but only the second is usable in a test plan. This illustrates why prompt clarity is just as critical as requirement clarity, because the model reflects exactly what you ask, not what you meant to ask.

9 Sec. 2.1 PROMPT ENGINEERING AS A FORM OF TEST
DESIGN FOR THE AI Deﬁne the conditions (Input Text) Specify the expected behaviour (Type and Format of the Output) Does actual Output meet your intent? You can think of prompt engineering as a form of test design for the AI. You get to define the conditions (the input text), specify the expected behaviour (the type and format of output) and assess whether the actual output meets your intent.

10 Sec. 2.1 PROMPT ENGINEERING AS A FORM OF TEST
DESIGN FOR THE AI Prompts Positive Negative What should happen? What should not happen? Inputs are: • expected • valid • well-formed Inputs are: • invalid • incomplete • contradictory • inappropriate And just like with real test cases, prompts can be positive (what should happen) or negative (what should not happen).

11 Sec. 2.1 CHALLENGES WITH PROMPTS Vagueness Overload Missing context
The prompt is too general Format mismatch What are some of the typical challenges testers face with prompts? First is vagueness, it’s when the prompt is too general, which would cause AI to give shallow results.

12 Sec. 2.1 CHALLENGES WITH PROMPTS Vagueness Overload Missing context
The prompt mixes too many ideas Format mismatch Then comes overload, it’s when the prompt mixes too many ideas and AI loses focus.

13 Vagueness Overload The prompt does not provide important details
Format mismatch Sec. 2.1 CHALLENGES WITH PROMPTS Missing context Also there could be a missing context, it’s when the model doesn’t “see” prior project details unless you include them.

14 Vagueness Overload Missing context Sec. 2.1 CHALLENGES WITH PROMPTS
The output is useful but not in the right shape Format mismatch And lastly the format mismatch, it’s when the output is useful but not in the right shape (e.g., missing step numbers). These issues are common, but easily fixed once you understand prompt structure and the principles that govern LLM behaviour.

15 2.1.1 STRUCTURE OF PROMPTS FOR GENERATIVE AI IN SOFTWARE
TESTING (K2)

16 Sec. 2.1.1 Apple Inc. apple When testers interact with
LLM, the prompt becomes the bridge between what you need and what the model delivers. The same model can produce either brilliant or completely useless results depending on how you ask. That’s why understanding the principles of effective prompting is essential for anyone using GenAI in testing.

17 Sec. 2.1.1 STRUCTURE OF PROMPT FOR GENAI ROLE CONTEXT
INSTRUCTION INPUT DATA CONSTRAINTS OUTPUT FORMAT 3 6 4 2 5 1 A structured prompt for generative AI in software testing typically includes six components. Using this structure makes outputs predictable, testable, and easier to validate. Treat each component like a field in a test case: missing or vague fields cause ambiguity in the result.

18 Sec. 2.1.1 STRUCTURE OF PROMPT FOR GENAI 1. ROLE
The persona or perspective the model should adopt: • experienced QA engineer • test automation developer • product owner . . . 1. Role. The persona or perspective the model should adopt (for example: “experienced QA engineer”, “test automation developer”, “product owner”). Telling the model its role is like asking a specific colleague for help. You’d ask different people different questions. Asking “a chef” vs “a nutritionist” about a meal gives different answers.

19 Sec. 2.1.1 STRUCTURE OF PROMPT FOR GENAI 1. ROLE
Act as an experienced QA engineer EXAMPLE Example: Use the prompt to set the role: Act as an experienced QA engineer. This will cause the model to prioritise testable behaviours, edge cases, and clear steps rather than giving a high-level product marketing answer.

20 Sec. 2.1.1 STRUCTURE OF PROMPT FOR GENAI 2. CONTEXT
Background information the model needs: • project details • user stories • business rules • relevant constraints . . . 2. Context. Background information the model needs, for instance project details, user stories, business rules, relevant constraints already known to the team. Basically, context is the “file folder” you hand to a substitute teacher: without it they lack the syllabus and grading instructions.

21 Sec. 2.1.1 STRUCTURE OF PROMPT FOR GENAI 2. CONTEXT
EXAMPLE This is a mobile banking login feature that supports passwords, biometrics, and 2FA. The app must lock an account after 3 failed attempts for 15 minutes Example: Include a prompt that sets the context: This is a mobile banking login feature that supports passwords, biometrics, and 2FA. The app must lock an account after 3 failed attempts for 15 minutes. This will lead to the model anchoring its responses to project-specific behaviour and avoids generic assumptions.

22 Sec. 2.1.1 STRUCTURE OF PROMPT FOR GENAI 3. INSTRUCTION
The explicit task you want the model to perform (the action or goal): • Generate… • List… • Convert… • Write… • Summarise… . . . Write test cases! 3. Instruction. The explicit task you want the model to perform (the action or goal). This should be clear and imperative (e.g., “Generate”, “List”, “Convert”). Instruction is the actual command you give a teammate, be it “Write test cases” or “Summarise tasks”, it defines the job to be done.

23 Sec. 2.1.1 STRUCTURE OF PROMPT FOR GENAI 3. INSTRUCTION
EXAMPLE Generate ﬁve functional test cases for login, including preconditions, test steps, and expected results Test Case 1: Successful Login with Valid Credentials Preconditions …………………… Test Steps …………………… Expected Results …………………… Test Case 2: Login with Invalid Password Preconditions …………………… Test Steps …………………… Expected Results …………………… Test Case 3: Login with Non-Existent User Preconditions …………………… Test Steps …………………… Expected Results …………………… Test Case 4: Login with Empty Fields Preconditions …………………… Test Steps …………………… Expected Results …………………… Test Case 5: Password Field Masking Preconditions …………………… Test Steps …………………… Expected Results …………………… Example: Instruction: Generate five functional test cases for login, including preconditions, test steps, and expected results. This will make the model focus on producing concrete artefacts that match your requested activity.

24 Sec. 2.1.1 STRUCTURE OF PROMPT FOR GENAI 4. INPUT
DATA The concrete data the model should use to perform the task: • user stories • acceptance criteria • screenshots • code snippets • logs • sample inputs . . . 4. Input data. The concrete data the model should use to perform the task: user stories, acceptance criteria, screenshots, code snippets, logs, or sample inputs. Think of it this way – input data is the raw material handed to a craftsman and without the wood/metal the task can’t start. The better structured the input, the better the output.

25 Sec. 2.1.1 STRUCTURE OF PROMPT FOR GENAI 4. INPUT
DATA EXAMPLE login.log User story: “As a user I want to reset my password via email”; screenshot and example log are attached. Generate test cases. TC-01: Password reset option appears after failed login attempts …………………… TC-02: Password reset email is triggered successfully …………………… TC-03: Password reset email content validation …………………… TC-04: Reset link opens password reset page …………………… …………………… …………………… Example: Input data: User story: "As a user I want to reset my password via email"; screenshot: login_page.png; example log: [attach log excerpt]. This will allow the model to base test cases directly on the supplied story and assets, producing project-relevant results rather than vague suggestions.

26 Sec. 2.1.1 STRUCTURE OF PROMPT FOR GENAI 5. CONSTRAINTS
Any rules, limits or special conditions the model must respect: • format restrictions • privacy rules • performance limits • domain-speciﬁc policies . . . 5. Constraints. Include any rules, limits or special conditions the model must respect, for instance, format restrictions, privacy rules, performance limits, or domain-specific policies. Constraints are like the rules of a board game: they limit allowed moves and shape valid play. They prevent the model from producing outputs that break your project rules.

27 Sec. 2.1.1 STRUCTURE OF PROMPT FOR GENAI 5. CONSTRAINTS
EXAMPLE • Do not include any real customer data. • Use placeholders for PII. • Limit cases to web and Android only. • Provide no more than 5 test cases. Example: Constraints: Do not include any real customer data; use placeholders for PII. Limit cases to web and Android only. Provide no more than 5 test cases. Here the model avoids leaking sensitive info, restricts scope to relevant platforms, and keeps output concise.

28 Sec. 2.1.1 STRUCTURE OF PROMPT FOR GENAI 6. OUTPUT
FORMAT The exact structure you want in reply: • plain bullets • numbered steps • a table • JSON • Gherkin • CSV . . . A B C ✓ X 6. Output format. The exact structure you want in reply, be it plain bullets, numbered steps, a table, JSON, Gherkin, CSV, etc. This makes the result machine- or human-consumable without heavy rework. Specifying format is like telling a colleague to fill a spreadsheet column-by-column rather than scribbling notes on paper which makes the result unusable.

29 Sec. 2.1.1 STRUCTURE OF PROMPT FOR GENAI 6. OUTPUT
FORMAT EXAMPLE Provide results as a markdown table with columns: ID | Title | Preconditions | Steps | Expected Result. | ID | Title | Preconditions | Steps | Expected Result | |-----|-------|---------------|-------|-----------------| | | | | | | Example: Output format: Provide results as a markdown table with columns: ID | Title | Preconditions | Steps | Expected Result. Here the model returns a formatted table you can paste into documentation or convert programmatically.

30 Sec. 2.1.1 STRUCTURE OF PROMPT FOR GENAI ROLE CONTEXT
INSTRUCTION INPUT DATA CONSTRAINTS OUTPUT FORMAT 3 1 6 4 2 5 Using the six-component structure turns prompts from loose questions into testable specifications for the AI. Each field reduces ambiguity and helps make the model’s output predictable and review-friendly. Keep the language specific, supply project-relevant input, and define the format you need then treat the AI output like a draft that you verify and refine.

31 2.1.2 CORE PROMPTING TECHNIQUES FOR SOFTWARE TESTING (K2) Once
you know the six key components of a good prompt, the next step is to understand how to combine and extend them to handle more complex testing tasks.

32 Sec. 2.1.2 Prompt patterns are reusable ways of structuring
prompts to improve the quality, reasoning depth, or consistency of AI responses CORE PROMPTING TECHNIQUES Prompt Chaining Few-Shot Prompting Meta Prompting • Controlling complexity • Improving accuracy • Giving the model self-guidance These combinations are called prompt patterns. They are reusable ways of structuring prompts to improve the quality, reasoning depth, or consistency of AI responses. In software testing, three common and highly practical patterns are prompt chaining, few-shot prompting, and meta prompting. Each supports a different need: controlling complexity, improving accuracy, or giving the model self-guidance.

33 Sec. 2.1.2 CORE PROMPTING TECHNIQUES 1. PROMPT CHAINING Prompt
chaining: A prompting technique that involves using the output of one prompt as the input for another, creating a sequence of prompts PROMPT 1 PROMPT 2 PROMPT 3 OUTPUT 3 OUTPUT 1 OUTPUT 2 1. Prompt Chaining. Prompt chaining means breaking a complex task into a sequence of smaller, dependent prompts, where the output of one step becomes the input for the next. Each stage builds logically on the previous one, allowing the model to handle intricate reasoning gradually instead of all at once.

34 Sec. 2.1.2 CORE PROMPTING TECHNIQUES 1. PROMPT CHAINING Think
of it as assembly-line testing: one tester analyses requirements, the next designs test ideas, another writes test cases. g: one tester analyses requirements, the next designs test ideas, another writes test cases. The chain of steps produces a better final result than one overloaded request.

35 Sec. 2.1.2 CORE PROMPTING TECHNIQUES 1. PROMPT CHAINING EXAMPLE
Read this requirement and summarise its key functions. . . . From your summary, identify potential risks. . . . Based on those risks, generate ﬁve high-level test ideas. . . . Turn those test ideas into detailed test cases with expected results. . . . . . . . . . Prompt 1 Prompt 2 Prompt 3 Prompt 4 For example, a tester could structure a prompt chain like this: Prompt 1: Read this requirement and summarise its key functions. Prompt 2: From your summary, identify potential risks. Prompt 3: Based on those risks, generate five high-level test ideas. Prompt 4: Turn those test ideas into detailed test cases with expected results.

36 Sec. 2.1.2 CORE PROMPTING TECHNIQUES 1. PROMPT CHAINING Encourages
logical reasoning and structure Reduces hallucinations caused by overloaded prompts Simpliﬁes review and corrections Is useful in test processes where tasks are complicated This approach encourages logical reasoning and structure, potentially reducing hallucinations caused by overloaded prompts, all while being easier to review and correct one step at a time. Prompt chaining is particularly useful in test processes where tasks are complicated and require decomposition into subtasks and systematic checking of intermediate LLM outputs.

37 Sec. 2.1.2 CORE PROMPTING TECHNIQUES 2. FEW-SHOT PROMPTING Few-shot
prompting: A technique where a model is given a few examples within the prompt to guide it in generating appropriate responses One-shot prompting: A prompt writing technique where the prompt contains one example to guide the LLM's response I want a portrait in this style 2. Few-Shot Prompting. In few-shot prompting, you give the model a few clear examples (or one example if it’s one-shot prompting) of how you want the task performed before asking it to create new outputs. This helps the model learn the expected structure, tone, and level of detail for the given context.

38 ZERO-SHOT PROMPTING FEW-SHOT PROMPTING Sec. 2.1.2 CORE PROMPTING TECHNIQUES
2. FEW-SHOT PROMPTING • Test Case ID • Description • Prerequisites • Test Steps • Expected Results • Actual Results • Status Test Case Template Who writes test cases like this? Zero-shot prompting: A prompt writing technique where the prompt contains no examples, relying on the model's pre-existing knowledge to generate a response It’s like onboarding a new tester. You first show them a couple of well-written test cases, and then they can produce new ones that match your format. The idea is for the model to mimic your tone, structure, and level of detail. This is different from zero-shot prompting (when there is no example) that relies on the model's pre-existing knowledge to generate a response.

39 Sec. 2.1.2 CORE PROMPTING TECHNIQUES 2. FEW-SHOT PROMPTING Is
effective for tasks where examples can illustrate the required behaviour Fewer formatting issues Reduced need for post-editing If examples contain mistakes, the model will replicate them ! Few-shot prompting is particularly effective for tasks where examples can illustrate the required behaviour, allowing the model to produce consistent and professional outputs, while minimising formatting issues and reducing need for post-editing. But at the same time you have to remember that if examples contain mistakes, the model will replicate them.

40 Sec. 2.1.2 CORE PROMPTING TECHNIQUES 3. META PROMPTING Meta
prompting: The crafting of higher-level instructions that generate speciﬁc prompts for exploring or automating capabilities 3. Meta Prompting means writing prompts that teach the model how to write or improve its own prompts. It’s a higher-level technique where the AI reflects on or evaluates its own instructions. In other words, you ask the model to think about prompting itself.

41 Sec. 2.1.2 CORE PROMPTING TECHNIQUES 3. META PROMPTING TC
1 – PASSED TC 2 – PASSED TC 3 – PASSED TC 4 – PASSED TC 1 – PASSED TC 2 – PASSED TC 3 – PASSED TC 4 – PASSED TC 5 – FAILED TC 6 – FAILED TC 7 – PASSED TC 8 – FAILED It’s like training a junior tester not just to execute tests but to design better test ideas next time. Meta prompting makes the model an active participant in improving future interactions.

42 Sec. 2.1.2 CORE PROMPTING TECHNIQUES 3. META PROMPTING Task:
You are an AI test assistant. Step 1: Write a clear prompt that would generate five test cases for login functionality. Step 2: Explain why your prompt would produce good results. Step 3: Execute your own prompt and present the final test cases. Step 1: Prompt to generate five login test cases Prompt: “Generate five detailed test cases for a web application login functionality. Each test case should include: • Test case ID • Title • Preconditions • Test steps • Test data (username and password) • Expected result Cover both positive and negative scenarios, including validation and security-related cases.” For example: Task: You are an AI test assistant. Step 1: Write a clear prompt that would generate five test cases for login functionality.

You are an AI test assistant. Step 1: Write a clear prompt that would generate five test cases for login functionality. Step 2: Explain why your prompt would produce good results. Step 3: Execute your own prompt and present the final test cases. Step 2: Why this prompt produces good results This prompt works well because: 1. Clear scope – It explicitly targets login functionality for a web application. 2. Structured output – Listing required fields ensures consistent, readable test cases. 3. Balanced coverage – Asking for both positive and negative scenarios prevents shallow results. 4. Quality focus – Including validation and security cases encourages realistic, QA-grade tests. 5. No ambiguity – The AI knows exactly how many test cases to generate and what to include. Overall, the prompt guides the AI toward complete, professional, and usable test cases, not vague examples. Step 2: Explain why your prompt would produce good results.

You are an AI test assistant. Step 1: Write a clear prompt that would generate ﬁve test cases for login functionality. Step 2: Explain why your prompt would produce good results. Step 3: Execute your own prompt and present the ﬁnal test cases. Step 3: Executing the prompt — Final test cases Test Case 1 ID: TC_LOGIN_001 Title: Successful login with valid credentials Preconditions: • User account exists and is active Test Steps: 1. Open the login page 2. Enter a valid username 3. Enter a valid password 4. Click the “Login” button Test Data: • Username: valid_user • Password: ValidPass123 Expected Result: • User is successfully logged in and redirected to the dashboard Test Case 2 ID: TC_LOGIN_002 Title: Successful login with invalid password …………………….. Step 3: Execute your own prompt and present the final test cases.

45 Sec. 2.1.2 CORE PROMPTING TECHNIQUES 3. META PROMPTING The
model designs, explains, and applies its own instruction Improved prompt quality through self-reﬂection Is useful for building reusable prompt templates or training materials Helps testers understand why a certain prompt works Is more abstract ! Is more complex ! Requires careful review ! Can produce overly redundant explanations ! Basically, the model designs, explains, and applies its own instruction. It’s a prompt that builds prompts, all while improving prompt quality through self-reflection. It’s useful for building reusable prompt templates or training materials. It also helps testers understand why a certain prompt works. At the same time, it’s more abstract and complex and requires careful review. It can produce overly redundant explanations if not constrained.

46 Sec. 2.1.2 CORE PROMPTING TECHNIQUES Prompt Chaining Few-Shot Prompting
Meta Prompting These three techniques, if used together, can turn the AI from a passive tool into a collaborative test partner.

47 2.1.3 SYSTEM PROMPT AND USER PROMPT (K2) When interacting
with a Large Language Model, two types of prompts influence its behaviour: system prompt and user prompt.

48 Sec. 2.1.3 System Prompt User Prompt vs Defines: •
the boundaries • the “personality” of the AI Tool developers Test automation engineers Solution architects Testers Persons integrating an LLM into a testing workflow How behave in general System prompt: A predefined instruction set, typically hidden from the chatbot’s users, that consistently establishes the context, tone, and boundaries for an LLM's responses and guides its behaviour throughout interactions Defines: • the specific task They work together, but they serve very different purposes in shaping how the model responds. Understanding the distinction is essential for testers who use LLM-based tools, because the system prompt often defines the boundaries and the “personality” of the AI, while the user prompt drives the specific task. A system prompt is the hidden, predefined instruction that sets the overall behaviour of the LLM. In most chatbot interfaces, the end user never sees it and cannot modify it. It is usually written by: tool developers, test automation engineers, solution architects, testers or the person integrating an LLM into a testing workflow. It tells the model how it should behave in general, regardless of what the user asks next.

49 Sec. 2.1.3 SYSTEM PROMPT The role of the model
WHAT ESTABLISHES ✓ X The system prompt establishes: • the role of the model (e.g., “You are a software testing assistant”),

WHAT ESTABLISHES The tone ✓ X • the tone (formal, concise, neutral),

WHAT ESTABLISHES The tone The domain boundaries ✓ X X • the domain boundaries (e.g., “Use ISTQBⓇ terminology”),

WHAT ESTABLISHES The tone The domain boundaries The operational parameters ✓ X • the operational parameters (what to do and not do),

WHAT ESTABLISHES The tone The domain boundaries The operational parameters The global constraints for the entire session • the global constraints for the entire session.

54 Sec. 2.1.3 SYSTEM PROMPT One can think of the
system prompt as the rules of the workplace taped to the wall, like “Be professional,” or “Follow safety guidelines.” Your daily tasks may change, but the foundational rules stay the same.

55 Sec. 2.1.3 SYSTEM PROMPT EXAMPLE For example, a system
prompt used in an AI testing assistant might be: You are a professional software testing assistant. Always respond clearly and concisely. Follow ISTQBⓇ terminology and practices. Avoid speculation. When helpful, relate your answers to testing principles. This acts like the default behaviour profile that guides every answer.

56 Sec. 2.1.3 User prompt: An instruction or query entered
by a user into an LLM that directs the model's response to fulfil specific tasks or provide desired information . . . . . . . . . . . . Question Task Instruction System Prompt User Prompt vs Defines: • the specific task Defines: • the boundaries • the “personality” of the AI A user prompt, on the other hand, is the input written by the end user. It can come in the form of a question, instruction, or task sent during the conversation. User prompts can change from message to message and provide the immediate task.

57 Sec. 2.1.3 USER PROMPT WHAT INCLUDES Structured instructions Input
data: • requirements • logs • screenshots Speciﬁc output format They often include structured instructions, may include input data (requirements, logs, screenshots) and request a specific output format.

58 Sec. 2.1.3 ✓ Deﬁnes the global behaviour ✓ Is
the job description ✓ Deﬁnes the local task ✓ Is the task ticket System Prompt User Prompt vs So basically, while the system prompt defines the global behaviour, the user prompt defines the local task. If the system prompt is the job description, then the user prompt is the task ticket you’re working on right now.

59 Sec. 2.1.3 USER PROMPT EXAMPLE List the key differences
between black-box and white-box testing. Provide two examples for each. Use a short, clear paragraph format. For example, a user prompt built on top of the aforementioned system prompt might look like this: List the key differences between black-box and white-box testing. Provide two examples for each. Use a short, clear paragraph format. This tells the model what specific job to do within the boundaries set by the system prompt.

60 • Prompt engineering is the process of designing and
refining prompts to guide LLMs toward accurate, relevant, and testing-oriented outputs • Small changes in prompt wording can significantly affect the quality, structure, and usefulness of AI-generated results • Effective prompts typically include six components: role, context, instruction, input data, constraints, and output format • Common prompt engineering techniques include prompt chaining, few-shot prompting, and meta prompting, each supporting a different testing needs • Prompt chaining improves handling of complex tasks by breaking them into smaller sequential steps • Few-shot prompting increases consistency and formatting accuracy by providing example outputs • Meta prompting helps the model generate or improve prompts through self-reflection and guidance • System prompts define the overall behaviour and boundaries of the model, while user prompts define the specific task to perform • Clear, structured prompts reduce ambiguity and improve the predictability and reviewability of AI outputs KEY TAKEAWAYS – 2.1

61 1. How could unclear or incomplete prompts affect the
quality of AI-generated test artefacts in your testing work? 2. Which prompting technique (prompt chaining, few-shot prompting, or meta prompting) would be most useful for your typical testing tasks, and why? 3. In what situations would a structured output format (e.g., tables or Gherkin syntax) provide the greatest value? REFLECTION – 2.1

62 2.2 APPLYING PROMPT ENGINEERING TECHNIQUES TO SOFTWARE TEST TASKS
Prompt engineering becomes truly powerful when applied to real software testing activities. In practice, testers use techniques like prompt chaining, few-shot prompting, and meta prompting to guide GenAI through complex, multi-step testing workflows.

63 Sec. 2.2 Test analysis TEST TASKS SUPPORTED BY APPLYING
PROMPT ENGINEERING TECHNIQUES Test design Test automation Test case prioritisation Defect detection Coverage analysis Test monitoring Test control These techniques help GenAI support a wide range of test tasks, including: test analysis, test design, test automation, test case prioritisation, defect detection, coverage analysis, test monitoring and test control.

64 Sec. 2.2 PROMPT (INPUT) OUTPUT • Well-structured • Aligned
with the test objective • Precise • Relevant • Useful • Low-quality • Incomplete • Weak • Misleading • Hallucinated When prompts are well-structured and aligned with the test objective, GenAI produces outputs that are more precise, more relevant, and more useful. But the reverse is also true: low-quality or incomplete input will lead to weak, misleading, or hallucinated outputs.

65 2.2.1 TEST ANALYSIS WITH GENERATIVE AI (K3)

66 Sec. 2.2.1 TEST ANALYSIS WITH GENAI Test condition (test
situation, test requirement): A testable aspect of a component or system identiﬁed as a basis for testing Test Monitoring and Control Test Planning Test Design Test Implemen- tation Test Execution Test Completion Test Analysis • Generating and prioritising test conditions • Identifying defects in the test basis • Identifying risks • Evaluating coverage Generative AI can support many parts of test analysis by generating and prioritising test conditions (or testable aspects of a component or system identified as a basis for testing), identifying defects in the test basis, as well as identifying risks and evaluating coverage. In this context, the LLM behaves like a fast, consistent assistant that can scan large volumes of documentation and highlight important testing insights.

67 Sec. 2.2.1 TEST ANALYSIS WITH GENAI INPUT OUTPUT Test
analysis work products: • Prioritised test conditions (e.g. acceptance criteria) • Coverage analysis • Test techniques recommendations • . . . Test basis: • Requirements • User stories • Technical speciﬁcations • API descriptions • GUI wireframes • Process ﬂows • . . . Typical types of input include requirements, user stories, technical specifications, API descriptions, GUI wireframes, process flows, and any other artefacts that form the test basis, while the output consists of typical test analysis work products, such as prioritised test conditions (e.g. acceptance criteria), coverage analysis, and test techniques recommendations. In other words, the LLM helps produce the same deliverables a tester would, but faster, and sometimes even more comprehensively.

68 Sec. 2.2.1 TEST ANALYSIS TASKS 1. IDENTIFYING POTENTIAL DEFECTS
IN THE TEST BASIS • Analysis of the requirements and spotting problems: ◦ ambiguity ◦ contradictions ◦ missing conditions ◦ unclear acceptance criteria ◦ inconsistent terminology • Suggesting improvements GenAI can support the following test analysis tasks: • Identifying potential defects in the test basis. This is possible because GenAI can analyse requirements and spot problems such as ambiguity, contradictions, missing conditions, unclear acceptance criteria and inconsistent terminology. It does this by comparing the text to patterns in its training data. The LLM can also suggest improvements.

69 Sec. 2.2.1 TEST ANALYSIS TASKS 1. IDENTIFYING POTENTIAL DEFECTS
IN THE TEST BASIS EXAMPLE The system should allow unlimited login retries Early detection helps prevent defects from entering the design, coding, and testing phases. This contradicts common security practice. Consider deﬁning a lockout rule. SECURITY RED FLAG! Example: Text: The system should allow unlimited login retries (security red flag) LLM: This contradicts common security practice. Consider defining a lockout rule. This early detection helps prevent defects from entering the design, coding, and testing phases.

70 Sec. 2.2.1 TEST ANALYSIS TASKS 2. GENERATING TEST CONDITIONS
FROM THE TEST BASIS • Breaking down a requirement or user story into clear, testable statements • Generating test conditions from the test basis. When provided with a requirement or user story, an LLM can break it down into clear, testable statements.

71 Sec. 2.2.1 TEST ANALYSIS TASKS 2. GENERATING TEST CONDITIONS
FROM THE TEST BASIS EXAMPLE • Breaking down a requirement or user story into clear, testable statements If we give it a requirement example of “Users must reset their password via an emailed link,” the possible test conditions could be ‘Link is sent when requested’, ‘Link expires after set time’, ‘Link cannot be reused’, ‘New password must meet complexity policy’, and ‘Reset action is logged’. This is essentially automated requirement decomposition.

72 Sec. 2.2.1 TEST ANALYSIS TASKS 3. PRIORITISING TEST CONDITIONS
BASED ON RISK • Assigning priority • Considering factors: ◦ risk likelihood ◦ severity of failure impact ◦ regulatory or safety implications ◦ visibility to end users ◦ historical defect patterns • Prioritising test conditions based on risk. LLMs can also help assign priority by considering factors such as: risk likelihood, severity of failure impact, regulatory or safety implications, visibility to end users, historical defect patterns.

BASED ON RISK • Assigning priority • Considering factors: ◦ risk likelihood ◦ severity of failure impact ◦ regulatory or safety implications ◦ visibility to end users ◦ historical defect patterns EXAMPLE Payment processing User proﬁle picture upload For example, payment processing would be high risk, while user profile picture upload would be considered low risk.

BASED ON RISK • Assigning priority • Considering factors: ◦ risk likelihood ◦ severity of failure impact ◦ regulatory or safety implications ◦ visibility to end users ◦ historical defect patterns • Using explicit data (e.g. risk matrix) • Using implicit patterns (e.g. common defect areas) The LLM can use both explicit data (e.g., risk matrix) and implicit patterns (e.g., common defect areas).

75 Sec. 2.2.1 TEST ANALYSIS TASKS 4. SUPPORTING COVERAGE ANALYSIS
• Mapping requirements to: ◦ test conditions ◦ test cases • Identifying which areas are: ◦ covered ◦ missing • Is useful in complex projects • Supporting coverage analysis. An LLM can map requirements to test conditions to test cases, helping testers see which areas are covered and which are missing. This is especially useful in complex projects where coverage gaps are easy to overlook which can lead to missed defects.

76 Sec. 2.2.1 TEST ANALYSIS TASKS 5. SUGGESTING SUITABLE TEST
TECHNIQUES • Recommending appropriate test techniques: ◦ boundary value analysis ◦ equivalence partitioning ◦ state transition testing ◦ pairwise testing ◦ . . . Don’t worry! Use this test technique This is especially useful in complex projects where coverage gaps are easy to overlook which can lead to missed defects.

77 Sec. 2.2.1 TEST ANALYSIS TASKS 5. SUGGESTING SUITABLE TEST
TECHNIQUES EXAMPLE REQUIREMENTS 1–50 users Boundary value analysis is needed For example, if a requirement includes ranges (“1–50 users”), the LLM might suggest boundary value analysis.

78 Sec. 2.2.1 TEST ANALYSIS WITH GENAI INPUT OUTPUT REQUIREMENTS
(INPUT) ANALYSIS (OUTPUT) • Unclear • Outdated • Incomplete • Poor • Reﬂects weaknesses • Noisy • Hallucinated • Misleading • Well-prepared • Accurate • Actionable The effectiveness depends on the quality and relevance of the input The effectiveness of all these activities depends directly on the quality and relevance of the input the LLM receives. If the requirements are unclear, outdated, or incomplete, the output will reflect those weaknesses. Basically, GenAI will amplify whatever you give it. Well-prepared input leads to accurate, actionable analysis. Poor input leads to noise, hallucinations, or misleading conclusions.

79 2.2.2 TEST DESIGN AND TEST IMPLEMENTATION WITH GENERATIVE AI
(K3)

80 Sec. 2.2.2 Test Monitoring and Control Test Planning Test
Design Test Implemen- tation Test Execution Test Completion Test Analysis Accelerating the creation, refinement, and evaluation of test artefacts Transforming raw requirements into actionable, organised testware TEST DESIGN AND TEST IMPLEMENTATION WITH GENAI PROS OF USING GENAI Test design: The activity that derives and specifies test cases from test conditions Test implementation: The activity that prepares the testware needed for test execution Preparing: • test cases • test data • test scripts • test environments • test configuration According to the ISTQBⓇ, test design means taking test conditions and turning them into clear, structured test cases and other testware. Test implementation means preparing everything needed to execute those tests, such as test cases, test data, scripts, environments, and configuration. Generative AI can support both activities by accelerating the creation, refinement, and evaluation of a wide range of test artefacts. Whether the tests are manual or automated, GenAI can help testers transform raw requirements into actionable, organised testware much faster and with greater consistency.

81 Sec. 2.2.2 TEST DESIGN AND TEST IMPLEMENTATION WITH GENAI
REQUIREMENT TEST CONDITION TEST CASE TEST SCRIPT EXECUTION PLAN In practice, testers use techniques like prompt chaining or few-shot prompting to guide the LLM step-by-step. For example, from requirement to test condition to test case to test script to execution plan. Below are the typical tasks where GenAI provides substantial support.

82 Negative testing End-to-end scenarios Usability Error-handling coverage Sec. 2.2.2
TEST DESIGN AND TEST IMPLEMENTATION TASKS 1. TEST CASE GENERATION 1. Reading: ◦ requirements ◦ user stories ◦ acceptance criteria ◦ interface descriptions 2. Creating draft test cases: ◦ preconditions ◦ test inputs ◦ expected results ◦ coverage criteria OBJECTIVES ? ? ? Functional veriﬁcation • Test case generation. Using natural language processing, GenAI can read requirements, user stories, acceptance criteria, or interface descriptions and turn them into draft test cases. This includes preconditions, test inputs, expected results and simple coverage criteria. The model can produce test cases aimed at different objectives like functional verification, negative testing, end-to-end scenarios and usability or error-handling coverage. It’s like giving the LLM a requirement and asking it to break this into specific things we should test and write them clearly.

83 Sec. 2.2.2 TEST DESIGN AND TEST IMPLEMENTATION TASKS 1.
TEST CASE GENERATION EXAMPLE Users must verify their email before first login Test Case 1 Preconditions: • User account created but not verified Test Steps: 1. Request verification 2. Open link 3. Attempt login before verification 4. Attempt login after verification Expected results: • Login blocked before verification • Login allowed after verification completion Requirement Example: If we give the system a requirement of “Users must verify their email before first login.” LLM-generated test case elements could consist of: Set preconditions: User account created but not verified Taken steps: Request verification then open link then attempt login before verification then attempt after Expected results: Login blocked before verification; allowed after completion

TEST DATA SYNTHESIS • Generating representative synthetic test data • Producing: ◦ valid data ◦ invalid data ◦ edge cases ◦ boundary values ◦ combinations for equivalence partitions ◦ unusual or stress-test scenarios ◦ realistic but anonymised datasets • Test data synthesis. GenAI can generate representative synthetic test data that mimics real-world patterns while preserving privacy. It can help testers by producing valid and invalid data, edge cases and boundary values, combinations for equivalence partitions, unusual or stress-test scenarios and realistic but anonymised datasets.

TEST DATA SYNTHESIS EXAMPLE Generate 20 synthetic customer proﬁles, including 3 edge-case entries with missing or malformed data In other words, GenAI basically becomes a “data factory” that produces safe, relevant, and diverse test inputs on demand. So when given the prompt of Generate 20 synthetic customer profiles, including 3 edge-case entries with missing or malformed data.

TEST DATA SYNTHESIS # NAME EMAIL PAYMENT METHOD COUNTRY 1 Daniel Mercer [email protected] Visa **** 4821 USA 2 Priya Sharma [email protected] Mastercard **** 1934 India 3 Lucas Ferreira [email protected] PayPal ([email protected]) Brazil 4 MaximilianAlexand erTheThirdOfWest minster-Smythe-Jo hnson-Delacroix-Va nderbilt max.vanderbilt@extremelylong domainnameexamplecorporati oninternational.com Mastercard **** 1299 USA ---------------------------------------------------------------------------------------------------------------------------------- 18 Lara Kim [email protected] 1234-ABCD-INVALID-CARD South Korea 19 Sofia Andronikos sofi[email protected] Visa **** 7765 Greece 20 Olivia Benett NULL Visa **** 4431 Kenya EXAMPLE Customer profiles EXAMPLE LLM could produce a mix of realistic names, emails, payment methods, and extreme cases (NULL email, overly long names, invalid card formats).

AUTOMATED TEST SCRIPT GENERATION • Converting test cases into: ◦ manual test procedures ◦ automated test scripts: ✓ Selenium ✓ Cypress ✓ Playwright ✓ API ✓ . . . • Updating existing test scripts • Automated test script generation. GenAI can also convert structured test cases into manual test procedures, or automated test scripts (e.g., Selenium, Cypress, Playwright, API test scripts). The LLM can then translate natural-language test steps into executable code and can update existing scripts as requirements evolve. It’s as if it acts as a translator between “what to test” and “how to code the test.”

88 from selenium import webdriver from selenium.webdriver.common.by import By #
Test data valid_email = "[email protected]" valid_password = "ValidPassword123" # Initialize driver driver = webdriver.Chrome() driver.maximize_window() # Open login page driver.get("https://your-login-page-url.com") # Perform login steps driver.find_element(By.ID, "email").send_keys(valid_email) driver.find_element(By.ID, "password").send_keys(valid_password) driver.find_element(By.ID, "login-button").click() # Optional: wait or validation step could go here # Close browser driver.quit() Sec. 2.2.2 TEST DESIGN AND TEST IMPLEMENTATION TASKS 3. AUTOMATED TEST SCRIPT GENERATION EXAMPLE Generate a Selenium test script for the following test case: Enter a valid email and password, then click Login Example Test case step: Enter a valid email and password, then click Login. GenAI-generated Selenium snippet: driver.find_element(By.ID, "email").send_keys(valid_email) driver.find_element(By.ID, "password").send_keys(valid_password) driver.find_element(By.ID, "login-button").click()

TEST EXECUTION SCHEDULING AND PRIORITISATION # Test ID Priority Reason 1 TC-001 Critical Base environment validation 2 TC-003 High Database conﬁguration dependency 3 TC-009 Critical Depends on TC-003 4 TC-002 High Depends on TC-001 5 TC-007 High Depends on TC-002 6 TC-011 Critical Depends on TC-001 7 TC-005 Medium Depends on TC-002 8 TC-014 Medium Depends on TC-003 9 TC-019 Low Depends on TC-014 10 TC-021 Low Depends on TC-007 • Organising test cases into execution sequences • Analysing: ◦ priority ◦ risk ◦ dependencies between tests ◦ required environments ◦ data ◦ resource availability ◦ milestones ◦ deadlines • Is useful for efficient planning • Test execution scheduling and prioritisation. GenAI can organise test cases into execution sequences by analysing priority and risk, dependencies between tests, required environments or data, resource availability, milestones and deadlines. This helps with efficient planning, especially in large or fast-paced projects. It’s like having an AI test lead who arranges the test plan, placing the most important and most efficient tests first.

TEST EXECUTION SCHEDULING AND PRIORITISATION EXAMPLE Given these 25 test cases, schedule them for execution. Run high-risk login and payment cases ﬁrst, group API tests together, and avoid reuse of expired test data For example, we have a prompt of Given these 25 test cases, schedule them for execution. Run high-risk login and payment cases first, group API tests together, and avoid reuse of expired test data.

TEST EXECUTION SCHEDULING AND PRIORITISATION EXAMPLE 🗓 Execution Schedule Overview Day 1 • 09:00–10:30 → Login (Critical) • 10:45–13:00 → Payments (Critical) • 14:00–16:30 → API Batch Day 2 • 09:00–11:30 → Functional Medium Tests • 11:30–12:00 → Retest failed criticals (if any) 📦 Phase 3 – API Test Group (Run Together) Priority: High Execution Window: Day 1 – 14:00–16:30 Grouped for environment stability & shared setup. ID Test Case Endpoint Focus TC-15 Create user API POST /users TC-16 Get user details GET /users/{id} TC-17 Update user proﬁle PUT /users/{id} TC-18 Delete user DELETE /users/{id} TC-19 Auth token generation POST /auth/token TC-20 Invalid token access All protected endpoints TC-21 Payment API success POST /payments TC-22 Payment API invalid payload POST /payments TC-23 Rate limit validation Auth / Payment endpoints The LLM outputs a structured execution order with justification.

92 PROS OF USING GENAI Sec. 2.2.2 TEST DESIGN AND
TEST IMPLEMENTATION TASKS TEST CASE GENERATION SYNTHETIC DATA CREATION SCRIPT WRITING SCHEDULING ✓ Improving both the speed and the quality of test design and test implementation ✓ Quick movement from requirements to executable tests ✓ Supporting both manual and automated workﬂows Clear prompts and high-quality input lead to reliable and useful testware Through test case generation, synthetic data creation, script writing, and scheduling, GenAI can significantly improve both the speed and the quality of test design and test implementation. It helps testers move from requirements to executable tests more quickly, supporting both manual and automated workflows. You have to remember that the key is still the same. Clear prompts and high-quality input lead to reliable and useful testware.

93 2.2.3 AUTOMATED REGRESSION TESTING WITH GENERATIVE AI (K3)

94 Sec. 2.2.3 REGRESSION TESTING VERSION 1 TC 1 -
PASSED TC 2 - PASSED TC 3 - PASSED TC 4 - PASSED VERSION 2 TC 1 - PASSED TC 2 - PASSED . . . TC 9 - FAILED TC 10 - PASSED TC 1 - PASSED TC 2 - FAILED . . . TC 99 - PASSED TC 100 - FAILED . . . VERSION 10 . . . Regression suites become: • large • repetitive • time-consuming Automate regression testing! As software evolves through iterations, sprints, or releases, the set of regression tests grows. Regression suites often become large, repetitive, and time-consuming to execute which is why they are ideal for automation, especially inside CI/CD pipelines where changes are frequent and rapid.

95 Sec. 2.2.3 REGRESSION TESTING WITH GENAI Adapting to codebase
changes Performing impact analysis Highlighting most likely affected areas TASK EXAMPLES Automated tasks for test suites: • Creating • Maintaining • Optimising ✓ More targeted regression efforts ✓ Reduced unnecessary execution of unaffected tests PROS OF USING GENAI Generative AI can significantly streamline regression testing by helping create, maintain, and optimise automated test suites. One of its strengths is the ability to adapt to codebase changes, perform impact analysis, and highlight which areas are most likely affected by recent modifications. This makes regression efforts more targeted and reduces unnecessary execution of unaffected tests. Below are the typical regression testing and reporting activities that GenAI can support through well-engineered prompts.

96 Sec. 2.2.3 REGRESSION TESTING AND REPORTING ACTIVITIES 1. AUTOMATED
TEST SCRIPT IMPLEMENTATION WITH KEYWORD-DRIVEN AUTOMATION • Mapping keywords to: ◦ test steps ◦ locators ◦ functions ◦ automation modules ◦ test scripts • Generating scripts that follow existing framework conventions Keyword represents a common test action ClickButton EnterText VerifyMessage • Automated test script implementation with keyword-driven automation. Many teams use keyword-driven frameworks, where each keyword represents a common test action (e.g., EnterText, ClickButton, VerifyMessage). GenAI can map these keywords to specific test steps, corresponding locators or functions, reusable automation modules and structured test scripts. The LLM can assist test automation engineers by consistently generating scripts that follow existing framework conventions.

TEST SCRIPT IMPLEMENTATION WITH KEYWORD-DRIVEN AUTOMATION EXAMPLE Using our keyword framework, implement a test script for valid login. OpenBrowser NavigateToURL https://app.example.com/login EnterText emailField [email protected] EnterText passwordField ValidPassword123 ClickButton loginButton VerifyMessage welcomeBanner 'Welcome back, user!' Example: User prompt: “Using our keyword framework, implement a test script for valid login.” GenAI output: OpenBrowser NavigateToURL https://app.example.com/login EnterText emailField [email protected] EnterText passwordField ValidPassword123 ClickButton loginButton VerifyMessage welcomeBanner 'Welcome back, user!' This accelerates script creation and ensures consistent structure.

98 Sec. 2.2.3 REGRESSION TESTING AND REPORTING ACTIVITIES 2. IMPACT
ANALYSIS AND TEST OPTIMISATION • Reviewing: ◦ commit descriptions ◦ code diffs ◦ API change logs ◦ requirement updates • Identifying components and functionalities most likely to be affected • Regression testing recommendations: ◦ priority execution ◦ skipped execution ◦ necessary additional tests • Impact analysis and test optimisation. GenAI can review commit descriptions, code diffs, API change logs, requirement updates to identify components and functionalities most likely to be affected. It can then recommend which regression tests should run first, which can be skipped and where additional tests are needed. Basically it helps testers answer the question: “What should we retest now, based on what changed yesterday?”

99 Sec. 2.2.3 REGRESSION TESTING AND REPORTING ACTIVITIES 2. IMPACT
ANALYSIS AND TEST OPTIMISATION EXAMPLE ✓ Login tests ✓ Password reset ✓ Session management ✓ Multi-Factor Authentication (MFA) X Product browsing AUTHENTICATION SERVICE For example, if a change affects the authentication service, GenAI prioritises login tests, password reset, session management and Multi-Factor Authentication (MFA) flows while deprioritising unrelated areas like product browsing.

100 Sec. 2.2.3 REGRESSION TESTING AND REPORTING ACTIVITIES 3. SELF-HEALING
AND ADAPTIVE TESTS Reasons for failed UI and API tests: • Minor changes • Slightly renamed buttons • Moved UI elements • New HTML structures • Modiﬁed endpoint URLs • Updated payload formats FAILED “Log in” button does not match the test Wait! The requirement was changed. I can adjust test script! • Self-healing and adaptive tests. UI and API tests often fail due to minor changes, slightly renamed buttons, moved UI elements, new HTML structures, modified endpoint URLs and updated payload formats. GenAI can automatically adjust test scripts to keep them stable.

AND ADAPTIVE TESTS Reasons for failed UI and API tests: • Minor changes • Slightly renamed buttons • Moved UI elements • New HTML structures • Modiﬁed endpoint URLs • Updated payload formats GENAI TASKS ✓ Updating XPath or CSS selectors ✓ Renaming outdated variables ✓ Adjusting request payloads ✓ Response validation ✓ Regenerating mocks for updated API schemas It can come in the form of updating XPath or CSS selectors, renaming outdated variables, adjusting request payloads or response validation or regenerating mocks for updated API schemas.

AND ADAPTIVE TESTS EXAMPLE Version A <button id="submit">LOG IN</button> Version B <button id="confirmLogin">LOG IN</button> For example, a login form button renamed from #submit to #confirmLogin

AND ADAPTIVE TESTS EXAMPLE Version A <button id="submit">LOG IN</button> Version B <button id="confirmLogin">LOG IN</button> Change into “conﬁrmLogin” causes GenAI to update the locator everywhere in the script. This reduces maintenance overhead and keeps regression suites healthy.

TEST REPORTING AND INSIGHTS Test report: Documentation summarising testing and results • Generation of test reports with: ◦ passed/failed test cases ◦ failure clusters ◦ stability trends ◦ environment or conﬁguration issues ◦ performance variations ◦ predicted failure hot spots • Generation of dashboards / summaries • Automated test reporting and insights. GenAI can turn raw test execution outputs into readable, meaningful test reports (documentation summarising testing and results). It can aggregate passed/failed cases, failure clusters, stability trends, environment or configuration issues, performance variations or predicted failure hot spots. It can also generate stakeholder-friendly dashboards or summaries.

105 Sec. 2.2.3 REGRESSION TESTING AND REPORTING ACTIVITIES 5. ENHANCED
DEFECT REPORTING AND ROOT CAUSE ANALYSIS • Creating defect report based on: ◦ failure logs ◦ screenshots ◦ environment metadata • Proposing likely causes by comparing the failure patterns to defect categories • Enhanced defect reporting and root cause analysis. GenAI can automatically compile failure logs, screenshots and environment metadata into a clean and complete defect report. It can also propose likely causes by comparing the failure patterns to known defect categories.

106 Sec. 2.2.3 REGRESSION TESTING AND REPORTING ACTIVITIES 5. ENHANCED
DEFECT REPORTING AND ROOT CAUSE ANALYSIS EXAMPLE • The failing tests share identical timeout exceptions after build #218. • The API response time increased from 120 ms to 900 ms. Possible root cause Recent changes to database indexing or API throttling. For example, the failing tests share identical timeout exceptions after build #218. The API response time increased from 120ms to 900ms. Possible root cause: recent changes to database indexing or API throttling.

107 Sec. 2.2.3 FORMS OF REGRESSION TESTING ✓ Functional ✓
Non-functional: ◦ performance ◦ security ◦ usability ✓ GUI-based tests ✓ API-based tests These capabilities apply to many forms of regression testing: functional, non-functional (performance, security, usability), GUI-based tests, API-based tests.

108 Sec. 2.2.3 2 + 3 * 5 = 25
17 2 + 3 * 5 = 25 GENAI CAN MAKE MISTAKES However, the testers must keep in mind that GenAI can make mistakes. The output must therefore be thoroughly checked, based on the associated risk.

109 Sec. 2.2.3 UI TESTS CHALLENGES • Layout changes •
Renamed widgets • Dynamic locators • Modified navigation flows HOW GENAI CAN HELP ✓ Adapting scripts automatically ✓ Reducing flaky failures ✓ Increasing stability UI tests often fail due to layout changes, renamed widgets, dynamic locators or modified navigation flows. GenAI can adapt scripts automatically, reducing flaky failures and increasing stability.

110 Sec. 2.2.3 API TESTS • New endpoints • Updated
request/response schemas • Different authentication requirements HOW GENAI CAN HELP ✓ Regenerating client wrappers ✓ Updating test assertions ✓ Synthesising appropriate payloads ✓ Creating variations for negative testing CHALLENGES APIs evolve frequently and face challenges such as new endpoints, updated request/response schemas and different authentication requirements. To help with this, GenAI can regenerate client wrappers, update test assertions, synthesise appropriate payloads and create variations for negative testing. This ensures coverage remains intact even when API designs change.

111 Sec. 2.2.3 GenAI accelerates regression work, but testers ensure
its correctness. ? ? ? LOW RISK HIGH RISK While GenAI can offer powerful assistance, it is not infallible. Its outputs in the form of scripts, updated locators, impact analysis, and reports must still be reviewed by testers, with the level of scrutiny depending on the associated risk (as discussed in Chapter 3). GenAI accelerates regression work, but testers ensure its correctness.

112 2.2.4 TEST MONITORING AND TEST CONTROL WITH GENERATIVE AI
(K3)

113 Sec. 2.2.4 TEST MONITORING AND TEST CONTROL WITH GENAI
Test Monitoring and Control Test Planning Test Design Test Implemen- tation Test Execution Test Completion Test Analysis • Processing large volumes of data from: ◦ test management systems ◦ CI/CD pipelines ◦ defect trackers ◦ logs ◦ dashboards ◦ communication tools • Retrieving information • Summarising information • Interpreting information Test monitoring and test control are essential activities in managing a testing effort. They require processing large volumes of data (sometimes structured, sometimes unstructured) from test management systems, CI/CD pipelines, defect trackers, logs, dashboards, and communication tools. GenAI can assist by retrieving, summarising, and interpreting this information,

✓ Staying aware of: ◦ progress ◦ risks ◦ bottlenecks ◦ deviations from the test plan ✓ Faster and more informed decision-making PROS OF USING GENAI helping teams stay aware of progress, risks, bottlenecks, and deviations from the test plan. Instead of manually analysing hundreds of data points, testers can use prompting to help GenAI synthesise insights, enabling faster and more informed decision-making. Below are typical test monitoring and test control tasks that GenAI can facilitate:

115 Sec. 2.2.4 TEST MONITORING AND TEST CONTROL TASKS 1.
TEST MONITORING AND METRICS ANALYSIS • Automatic processing of: ◦ test execution logs ◦ pass/fail ratios ◦ defect rates ◦ blocking issues ◦ cycle-by-cycle trends ◦ requirement coverage ◦ . . . • Highlighting patterns • Predicting potential risks • Alerting when trends deviate from the expected plan • Test monitoring and metrics analysis. GenAI can automatically process test execution logs, pass/fail ratios, defect rates, blocking issues, cycle-by-cycle trends, requirement coverage, etc. It can highlight patterns, predict potential risks, and alert the team when trends deviate from the expected plan. It acts like a real-time “quality radar,” scanning all available test data and pointing out what needs attention.

TEST MONITORING AND METRICS ANALYSIS EXAMPLE Build 1 Build 2 Build 3 Build 4 Build 5 Analyse these past ﬁve builds and summarise the main testing trends. Identify any emerging risks. Analysis results: • Increased failures in the payment module. • Flaky UI tests due to locator changes. • Slower API responses in staging. Predicted risk: • Performance regression in next release. For example, prompt: Analyse these past five builds and summarise the main testing trends. Identify any emerging risks. GenAI outputs: Increased failures in the payment module, Flaky UI tests due to locator changes, Slower API responses in staging, Predicted risk: performance regression in next release. This allows teams to react early rather than after quality has already degraded.

117 Sec. 2.2.4 TEST MONITORING AND TEST CONTROL TASKS •
Recommending which tests should be reprioritised • Suggesting changes to the test schedule • Highlighting areas where more resources are needed • Identifying activities that fall behind • Suggesting workarounds or mitigation strategies Given how things look right now, what should we change to stay on track? 2. TEST CONTROL • Test control. Test control involves adjusting the testing approach based on actual progress. GenAI can help by recommending which tests should be reprioritised, suggesting changes to the test schedule, highlighting areas where more resources are needed, identifying activities that fall behind and suggesting workarounds or mitigation strategies. In simpler terms GenAI helps answer the question “Given how things look right now, what should we change to stay on track?”

TEST CONTROL EXAMPLE • Allocate more time to regression testing in authentication. • Shift two testers to API veriﬁcation. • Postpone GUI exploratory testing until critical defects are resolved. Based on this week’s testing progress, tell me which areas we need to focus on ﬁrst So if you give it this prompt Based on this week’s testing progress, tell me which areas we need to focus on first, GenAI could yield this summary: Allocate more time to regression testing in authentication, Shift two testers to API verification and Postpone GUI exploratory testing until critical defects are resolved. This helps maintain focus on the highest-risk areas.

TEST COMPLETION INSIGHTS AND CONTINUOUS LEARNING • Generating test completion reports that summarise: ◦ what was achieved ◦ what failed ◦ major blockers ◦ discovered risks ◦ lessons learned ◦ recommendations for future cycles • Supporting continuous improvement and structured learning across releases • Test completion insights and continuous learning. GenAI can generate test completion reports that summarise what was achieved, what failed, major blockers, discovered risks, lessons learned and recommendations for future cycles. This supports continuous improvement and structured learning across releases.

TEST COMPLETION INSIGHTS AND CONTINUOUS LEARNING EXAMPLE Create a test completion summary for Sprint 14 based on these logs and defect reports. The prompt could look something like this: Create a test completion summary for Sprint 14 based on these logs and defect reports.

ENHANCED TEST METRICS VISUALISATION AND REPORTING • Creating: ◦ interactive dashboards ◦ visualisations ◦ data summaries ◦ natural-language explanations ◦ role-speciﬁc scorecards (e.g. management-focused summaries) ✓ Visibility into testing progress ✓ Concise insights without manually digging through reports PROS OF USING GENAI • Enhanced test metrics visualisation and reporting. GenAI can assist in creating interactive dashboards, rich visualisations, data summaries, natural-language explanations and role-specific scorecards (e.g., management-focused summaries). This ensures all stakeholders have relevant, understandable, and timely visibility into testing progress. Basically, it converts raw metrics into something that is easy to read and easy to act on. Stakeholders receive concise insights without manually digging through reports.

Turning data into actionable insights from daily execution trends to high-level dashboards. ✓ Staying aligned with the quality goals ✓ Adjusting plans when necessary ✓ Capturing meaningful lessons for future test cycles PROS OF USING GENAI GenAI supports test monitoring and control by turning data into actionable insights from daily execution trends to high-level dashboards. It helps teams stay aligned with their quality goals, adjust plans when necessary, and capture meaningful lessons for future test cycles.

GenAI enhances decision-making, but human oversight ensures reliability. Still, testers must interpret the results and validate critical insights, especially when risk is high. GenAI enhances decision-making, but human oversight ensures reliability.

124 2.2.5 CHOOSING PROMPTING TECHNIQUES FOR SOFTWARE TESTING (K3)

125 Sec. 2.2.5 PROMPTING TECHNIQUE USE CASE RECOMMENDATION KEY FEATURES
AND RECOMMENDATIONS Prompt chaining Complex tasks requiring precision with human verification at each step Breaks tasks into smaller steps, useful for test analysis, test design and test automation, where each test step is checked for accuracy. Few-shot prompting Repetitive or specific/constrained output format tasks Provides examples to GenAI for repetitive generation with a specific pattern, for example in Gherkin style test case (e.g scenario-based), keyword-driven testing or test reporting with a specific output format. Meta prompting Flexible, dynamic tasks, useful for crafting prompts for new tasks General description of the objective and the task to be performed, which guides the LLM in the creation of the prompt. Useful for all kinds of complex tasks such as test report analysis and anomaly detection. The following table shows the suitability of the three prompting techniques mentioned in section 2.1.2 according to the characteristics of the test task. The first prompting technique is prompt chaining. Use case recommendation: Complex tasks requiring precision with human verification at each step. Key features and recommendations: Breaks tasks into smaller steps, useful for test analysis, test design and test automation, where each test step is checked for accuracy. The second prompting technique is few-shot prompting. Use case recommendation: Repetitive or specific/constrained output format tasks. Key features and recommendations: Provides examples to GenAI for repetitive generation with a specific pattern, for example in Gherkin style test case (e.g scenario-based), keyword-driven testing or test reporting with a specific output format. The third prompting technique is meta prompting. Use case recommendation: Flexible, dynamic tasks, useful for crafting prompts for new tasks. Key features and recommendations: General description of the objective and the task to be performed, which guides the LLM in the creation of the prompt. Useful for all kinds of complex tasks such as test report analysis and anomaly detection. It is also possible and, sometimes, beneficial to combine several prompt engineering techniques within the same use case.

126 Sec. 2.2.5 MULTI-LAYERED PROMPT STRATEGIES EXAMPLE META PROMPTING FEW-SHOT
PROMPTING • Generate an initial, well-structured prompt template • Adapt or expand example inputs or outputs PROMPT CHAINING • Break the overall activity into smaller, manageable subtasks • Verify intermediate results In practice, testers frequently create multi-layered prompt strategies. For example, a tester may start with meta prompting to generate an initial, well-structured prompt template. That generated prompt may include example inputs or outputs that require adaptation or expansion. This is where few-shot prompting becomes useful. Finally, to ensure that the task can be validated step by step, the tester can break the overall activity into smaller, manageable subtasks, applying prompt chaining to verify intermediate results before moving forward.

127 Sec. 2.2.5 MULTI-LAYERED PROMPT STRATEGIES META PROMPTING FEW-SHOT PROMPTING
PROMPT CHAINING • Prompt engineering techniques are not isolated tools. • They can be combined to create: ◦ more reliable ◦ more controlled ◦ more effective interactions with GenAI during test activities In other words, prompt engineering techniques are not isolated tools but they can be combined to create more reliable, more controlled, and more effective interactions with GenAI during test activities.

128 • Prompt engineering techniques help GenAI support a wide
range of testing activities, including test analysis, design, implementation, regression testing, monitoring, and control • GenAI can analyse requirements, identify defects and risks, generate test conditions, recommend test techniques, and support coverage analysis • LLMs accelerate test design and implementation by generating test cases, synthetic test data, automated scripts, and execution schedules • In regression testing, GenAI supports impact analysis, keyword-driven automation, self-healing tests, defect reporting, and test optimisation • GenAI enhances test monitoring and control by analysing metrics, identifying trends and risks, supporting prioritisation, and generating reports and dashboards • Different prompting techniques are suited for different tasks: prompt chaining for complex multi-step activities, few-shot prompting for structured repetitive outputs, and meta prompting for ﬂexible and evolving tasks KEY TAKEAWAYS – 2.2

129 1. How would you validate AI-generated test scripts before
integrating them into a CI/CD pipeline? 2. Which test monitoring or reporting activities in your organisation could be improved through GenAI-generated insights and summaries? REFLECTION – 2.2

130 2.3 EVALUATE GENERATIVE AI RESULTS AND REFINE PROMPTS FOR
SOFTWARE TEST TASKS

131 Sec. 2.3 EVALUATING GENAI RESULTS • A clear set
of metrics is required • Assessing: ◦ the quality ◦ relevance ◦ effectiveness METRICS General Speciﬁc • Applicable across many tasks • Speciﬁc to a particular testing activity To use Generative AI effectively in software testing, testers must be able to judge how good the AI’s outputs are. This requires a clear set of metrics or criteria that help assess the quality, relevance, and effectiveness of the generated artefacts. These metrics can be general (applicable across many tasks) or specific to a particular testing activity.

132 Sec. 2.3 EVALUATING GENAI RESULTS GOOD METRICS BETTER PROMPTING
BETTER TESTING OUTCOMES ✓ Identifying weaknesses ✓ Reﬁning prompts ✓ Selecting better techniques ✓ Ensuring that the results are ﬁt for purpose PROS FOR TESTERS By evaluating AI output systematically, testers can identify weaknesses, refine prompts, select better techniques, and ensure that the results are fit for purpose. In other words, good metrics lead to better prompting, and better prompting leads to better testing outcomes.

133 2.3.1 METRICS FOR EVALUATING THE RESULTS OF GENERATIVE AI
ON TEST TASKS (K2)

134 Sec. 2.3.1 The quality of GenAI output can vary
widely When GenAI is used to support test analysis, test design, test automation, or reporting, the quality of its output can vary widely. To ensure reliability, testers can assess the AI’s performance using several key metrics. Each metric focuses on a different aspect of what “good output” looks like in a testing context.

135 Sec. 2.3.1 METRICS FOR EVALUATING THE RESULTS OF GENAI
1. ACCURACY How correct the output is when compared to an authoritative reference: • expert-written test cases • requirement documents • organisational standards EXAMPLE ✓ Did the AI get the facts right? 1. Accuracy. Accuracy measures how correct the AI’s output is when compared to an authoritative reference. For example, expert-written test cases, requirement documents, or organisational standards. Did the AI get the facts right? Example. The degree to which generated test cases correctly cover all required behaviours specified in the test basis.

2. PRECISION How well the generated output satisﬁes a speciﬁc objective without adding irrelevant or incorrect information EXAMPLE How exact and on-target are the results? 2. Precision. Precision evaluates how well the generated output satisfies a specific objective without adding irrelevant or incorrect information. How exact and on-target are the results? Example. The degree to which the generated test cases correctly identify specific anomalies or boundary conditions without drifting into unrelated areas.

3. RECALL Measures the model’s ability to detect all relevant items or cases within the dataset or requirement set EXAMPLE ✓ X Did the AI miss anything important? 3. Recall. Recall measures the model’s ability to detect all relevant items or cases within the dataset or requirement set. Did the AI miss anything important? Example. How well the generated test cases cover all valid and invalid equivalence partitions for a given data class.

4. RELEVANCE AND CONTEXTUAL FIT Checks whether the output is appropriate for the given situation and aligns with the relevant context: • the test basis • domain rules • standards • constraints Is the output meaningful, applicable, and consistent with the context? EXAMPLE 4. Relevance and Contextual Fit. This metric checks whether the output is appropriate for the given situation and aligns with the relevant context, such as the test basis, domain rules, standards, or constraints. Is the output meaningful, applicable, and consistent with the context? Example. Whether generated test cases align with the domain-specific requirements, follow the structure of the test basis, and avoid introducing behaviour that does not exist in the real system.

5. DIVERSITY Assesses whether the AI produces a wide range of test ideas, inputs, or scenarios, rather than repetitive or overly similar cases Do we get variety, creativity, and edge cases? EXAMPLE 5. Diversity. Diversity assesses whether the AI produces a wide range of test ideas, inputs, or scenarios, rather than repetitive or overly similar cases. Do we get variety, creativity, and edge cases? Example. The degree to which the generated test cases include different user behaviours, multiple input categories, and edge-case explorations.

6. EXECUTION SUCCESS RATE Measures the percentage of generated test artefacts that can actually be executed successfully in the testing environment Do these tests run without breaking? EXAMPLE 6. Execution Success Rate. This measures the percentage of generated test artefacts that can actually be executed successfully in the testing environment. Do these tests run without breaking? Example. Determining how many AI-generated automation scripts run without syntax errors, broken locators, or formatting issues in a CI/CD environment.

7. TIME EFFICIENCY Measures how much time is saved when using GenAI compared to completing the task manually Does using GenAI actually save us time? EXAMPLE TIME ELAPSED 01:00:00 7. Time Efficiency. Time efficiency measures how much time is saved when using GenAI compared to completing the task manually. Does using GenAI actually save us time? Example. Comparing the time it takes GenAI to produce test cases versus the time a human tester would need to create the same set manually.

142 Sec. 2.3.1 Accuracy Precision Recall Relevance and Contextual Fit
Diversity Execution Success Rate Time Efficiency Task-Speciﬁc Metrics How well the GenAI supports speciﬁc test activities? ASSESSING THE GENAI PERFORMANCE IN TESTING ACTIVITIES ✓ Improving prompting strategies ✓ Identifying quality issues at an early stage ✓ Ensuring that AI-generated artefacts contribute meaningfully to the testing process PROS FOR TESTERS These metrics provide a solid framework for assessing the GenAI performance in testing activities. By measuring GenAI outputs systematically rather than relying on intuition, testers can improve their prompting strategies, identify quality issues early, and ensure that AI-generated artefacts contribute meaningfully to the testing process. In addition to these general metrics, task-specific metrics can be tailored to evaluate how well the GenAI supports specific test activities.

143 Sec. 2.3.1 I can validate the AI-generated testware! How
do we know even this output is correct? TEST ORACLE IN GENAI-SUPPORTED TESTING It is also worth noting that when Generative AI produces test cases, test data, scripts, or analysis results, testers must still answer a fundamental question: “How do we know even this output is correct?” This is where test oracles become essential. As we mentioned before, test oracle is any source of information that helps us decide whether a test result is correct. In AI-supported testing, oracles are used not only for the system under test, but also for validating the AI-generated testware itself.

144 Sec. 2.3.1 Requirements and acceptance criteria or checking whether
generated test cases truly reflect what is specified. COMMON TYPES OF TEST ORACLES USED WITH GENAI Business rules and domain knowledge or validating that AI outputs make sense in the real-world context. Existing verified test cases or comparing AI-generated tests to trusted, manually reviewed tests. Human expert judgment or using experienced testers as the final authority in high-risk situations. Cross-model comparison or generating the same output with two different LLMs and comparing results. Common types of test oracles used with GenAI include: • Requirements and acceptance criteria or checking whether generated test cases truly reflect what is specified. • Business rules and domain knowledge or validating that AI outputs make sense in the real-world context. • Existing verified test cases or comparing AI-generated tests to trusted, manually reviewed tests. • Human expert judgment or using experienced testers as the final authority in high-risk situations. • Cross-model comparison or generating the same output with two different LLMs and comparing results.

145 Invalid test case based on the requirement speciﬁcation Sec.
2.3.1 TEST ORACLE IN GENAI-SUPPORTED TESTING EXAMPLE Generate 12 test cases for a login feature Test Case 1: Login with valid credentials Test Case 2: Login with invalid password Test Case 3: Login with invalid username Test Case 4: Login with empty username Test Case 5: Login with empty password Test Case 6: Login with expired password Test Case 7: Password case sensitivity check Test Case 8: Username case sensitivity check Test Case 9: SQL injection attempt during login Test Case 10: Account lock after multiple failed login attempts Test Case 11: Login with leading or trailing spaces in username Test Case 12: Remember Me functionality during login Example. A tester asks an LLM to generate test cases for a login feature. The AI creates 12 test cases, including one for “login with expired password.” The test oracle is the requirement specification. When the tester checks it, they discover that password expiration is not implemented yet. The oracle reveals that this test case is invalid, even though it looks realistic. Without a clear test oracle, AI-generated output can appear correct while actually being wrong or irrelevant. Test oracles are therefore a critical safety mechanism when working with GenAI. They ensure that AI does not become the sole judge of correctness and that human-controlled validation remains in place, especially for high-risk test tasks.

146 2.3.2 TECHNIQUES FOR EVALUATING AND ITERATIVELY REFINING PROMPTS (K2)
Once metrics are defined (accuracy, precision, recall, etc.), testers can actively improve GenAI performance by applying structured techniques for evaluating and refining prompts.

147 TECHNIQUES FOR EVALUATING AND ITERATIVELY REFINING PROMPTS Sec. 2.3.2
Help teams transform an initial, imperfect prompt into one that produces reliable, consistent, and high-quality outputs These techniques help teams transform an initial, imperfect prompt into one that produces reliable, consistent, and high-quality outputs. Below are the key techniques that support prompt refinement and continuous improvement.

148 TECHNIQUES FOR EVALUATING AND ITERATIVELY REFINING PROMPTS 1. ITERATIVE
PROMPT MODIFICATION • Reﬁnement includes: ◦ adding context ◦ clarifying terminology ◦ narrowing the scope ◦ specifying the output format Sec. 2.3.2 • Iterative prompt modification. Start with a basic prompt, review the AI’s output, and then adjust the prompt based on what was missing or unclear. This cycle repeats until the results meet the required quality. This refinement might include adding context, clarifying terminology, narrowing the scope or specifying the output format

149 TECHNIQUES FOR EVALUATING AND ITERATIVELY REFINING PROMPTS 2. A/B
TESTING OF PROMPTS • Create two or more versions of a prompt • Compare their outputs using predeﬁned metrics • Find the best version Sec. 2.3.2 • A/B testing of prompts. Create two or more versions of a prompt and compare their outputs using predefined metrics (e.g., accuracy, precision, or diversity). Basically, run two prompt versions head-to-head and keep the winner.

150 TECHNIQUES FOR EVALUATING AND ITERATIVELY REFINING PROMPTS 3. OUTPUT
ANALYSIS • Review the output • Check for: ◦ inaccuracies ◦ contradictions ◦ missing test conditions ◦ misinterpretations of the requirements ◦ overly generic or repetitive cases • Understand why the prompt failed • Reﬁne the prompt Sec. 2.3.2 • Output analysis. Carefully review the AI’s output and check for inaccuracies, contradictions, missing test conditions, misinterpretations of the requirements and overly generic or repetitive cases. This analysis helps testers understand why the prompt failed and how to refine it.

151 TECHNIQUES FOR EVALUATING AND ITERATIVELY REFINING PROMPTS 3. OUTPUT
ANALYSIS EXAMPLE You’re the 50th user! Sec. 2.3.2 For example, if the AI repeatedly misinterprets a requirement about “maximum 50 users,” it may need more explicit constraints or examples in the prompt.

152 TECHNIQUES FOR EVALUATING AND ITERATIVELY REFINING PROMPTS 4. INTEGRATE
USER FEEDBACK • Collect input from testers or stakeholders with the highlighted: ◦ unclear steps ◦ missing details ◦ formatting issues ◦ unnecessary complexity ◦ inconsistencies with team standards • Reﬁne the prompt Sec. 2.3.2 • Integrate user feedback. Collect input from testers or stakeholders who use the AI-generated artefacts. They can highlight unclear steps, missing details, formatting issues, unnecessary complexity and inconsistencies with team standards. This human feedback helps refine prompts so that the output better supports real-world testing tasks. If testers say the generated test cases are too long or don’t include negative scenarios, then prompts can be updated accordingly.

153 TECHNIQUES FOR EVALUATING AND ITERATIVELY REFINING PROMPTS 5. ADJUST
PROMPT LENGTH AND SPECIFICITY • Experiment with making prompts to see which version produces better results: ◦ longer/more detailed ◦ shorter/more abstract This is today’s winner! Sec. 2.3.2 • Adjust prompt length and specificity. Experiment with making prompts longer/more detailed or shorter/more abstract to see which version produces better results. Sometimes more detail helps; sometimes it overwhelms the model. A longer prompt may help the AI follow strict rules or use domain-specific terminology. But at the same time, a shorter prompt may help avoid overfitting and encourage broader thinking or more diverse test ideas.

154 TECHNIQUES FOR EVALUATING AND ITERATIVELY REFINING PROMPTS GenAI becomes
not just a tool used by individuals, but a collaborative asset that grows more effective as the team learns, shares, and iterates together. Sec. 2.3.2 ✓ More consistent test artefacts ✓ Fewer repeated errors ✓ Clearer team-wide prompting practices ✓ Continuous improvement of test methodologies ✓ Creation of shared prompt libraries PROS FOR TEST TEAMS By applying these techniques, test teams can improve the quality of GenAI prompts over time. This enables more consistent test artefacts, fewer repeated errors, clearer team-wide prompting practices, continuous improvement of test methodologies and the creation of shared prompt libraries to support the whole organisation. GenAI becomes not just a tool used by individuals, but a collaborative asset that grows more effective as the team learns, shares, and iterates together.

155 • GenAI outputs should be evaluated using metrics such
as accuracy, precision, recall, relevance, diversity, execution success rate, and time efficiency • Test oracles such as requirements, business rules, verified test cases, and expert judgment help validate AI-generated outputs and reduce risks • Prompt refinement techniques such as iterative modification, A/B testing, output analysis, and user feedback improve the quality and consistency of AI responses KEY TAKEAWAYS – 2.3

156 1. How could teams build and maintain shared prompt
libraries to improve consistency across projects? 2. Which evaluation metrics would be most important when assessing AI-generated test artefacts in your projects, and why? 3. How can continuous feedback and reﬁnement improve the long-term effectiveness of GenAI in testing workﬂows? REFLECTION – 2.3

157 • Prompt engineering is essential for effective AI-assisted testing
because prompt quality directly influences the accuracy, relevance, and structure of LLM outputs • Well-structured prompts typically include role, context, instruction, input data, constraints, and output format to reduce ambiguity and improve consistency • Prompting techniques such as prompt chaining, few-shot prompting, and meta prompting help testers manage complex tasks, improve output quality, and create reusable prompt strategies • GenAI can support many testing activities, including test analysis, test design, test implementation, regression testing, monitoring, and reporting, while still requiring human oversight • AI-generated outputs should be evaluated using metrics such as accuracy, precision, recall, relevance, diversity, execution success rate, and time efficiency • Continuous prompt refinement, test oracles, human feedback, and validation against requirements are critical for ensuring reliable and trustworthy AI-supported testing workflows KEY TAKEAWAYS AND SUMMARY

158 Answer these questions after completing the reading: 1. Which
prompting techniques would be most suitable for complex multi-step testing tasks, and why? 2. What is the difference between a system prompt and a user prompt, and how do they inﬂuence LLM behaviour? 3. In what ways can GenAI improve test design, test implementation, and automated regression testing? 4. How can GenAI support self-healing automated tests when UI elements or APIs change? (You should answer using examples from your own projects where possible.) REFLECTION AND KNOWLEDGE CHECK

159 • ISTQB® Certiﬁed Tester Specialist Level Testing with Generative
AI (CT-GenAI) Syllabus Version 1.1, 2026 REFERENCES

160 Learner feedback is collected to support continuous improvement of
delivery and materials. Understanding is evaluated through: • Chapter quiz covering key concepts from this chapter • Q&A session to clarify questions arising from the activities and quiz FEEDBACK AND EVALUATION

161 Thank You!

Chapter 2 – Prompt Engineering for Effective So...

Chapter 2 – Prompt Engineering for Effective Software Testing (ISTQBⓇ CT-GenAI v1.1). Slides

More Decks by Exactpro

Other Decks in Technology

Featured

Transcript