AI&Software Testing - Using NLP to Detect Requirements Defects

Slide 1

Slide 1 text

Build Software to Test Software exactpro.com Research Seminar on AI in Test: Using NLP to Detect Requirements Defects 10.11.2020 Murad Mamedov, AI Researcher

Slide 2

Slide 2 text

2 Build Software to Test Software exactpro.com Using NLP to Detect Requirements Defects Topic: Using NLP to Detect Requirements Defects: an Industrial Experience in the Railway Domain Purpose: The authors want to verify requirements for completeness, clearness, preciseness, unequivocality, verifiability, testability, maintainability, and feasibility. Details: ● The research consists of two parts: preliminary and large set analysis ● Authors compared Manual Verification vs. NLP Analysis Link: https://www.researchgate.net/publication/313867290_Using_N LP_to_Detect_Requirements_Defects_An_Industrial_Experien ce_in_the_Railway_Domain

Slide 3

Slide 3 text

3 Build Software to Test Software exactpro.com Workﬂow of the Research Defect Classes Detection Role: VE1 Subtask: To assess relevance of the defects Subtask: To define patterns for each defect class using GATE Preliminary Research + Role: NLP-E Subtask: To assess feasibility of the defects Full Set Research Annotate Raw Data Metrics Assessment Role: VE3 Subtask: Data Markup Role: VE1, VE2 Subtask: Markup Review + Execution two iterations

Slide 4

Slide 4 text

4 Build Software to Test Software exactpro.com GATE Tool Overview Tokenization Splits a document into separate tokens e.g. words, numbers, spaces, punctuation POS Tagging Defines Part-of-Speech for each token e.g. noun (NN), verb (VB), adjective (JJ) Shallow Parsing: Identifies Noun and Verb Phrases e.g. in sentence “Messages are received by the system”, {messages, the system} is NP and {are received} is VP JAPE Rules With this technology you can define rules similar to regexp instructions Gazetteer Searches for a list of predetermined terms

Slide 5

Slide 5 text

5 Build Software to Test Software exactpro.com Patterns for Defects Prediction # Pattern Description 1 Anaphoric ambiguity References to the previous parts using pronouns, when there are few options to refer to 2 Coordination ambiguity When conjunctions lead to multiple interpretations 3 Vague terms When a term has no precise semantic 4 Modal adverb The adverbs that have the suffix -ly 5 Passive voice When it isn’t followed by the subject 6 Excessive length Picked the length of sentence >60 tokens 7 Missing condition Each if should have else/otherwise 8 Missing a unit of measurement Each number is required to have an associated unit of measurement, unless the number represents a reference 9 Missing reference A reference is presented in the text of the requirements but not in the list of references 10 Undefined term In the case of this company they had camelCase notation for terms and the researchers were checking all such terms if they are presented in the Glossary of requirement doc

Slide 6

Slide 6 text

6 Build Software to Test Software exactpro.com JAPE Rules for Patterns # Defect Class JAPE Rule 1 Anaphoric ambiguity PANA = (NP)(NP) + (Split)[0,1] (Token.POS == PP | Token.POS =∼ PR*) 2 Coordination ambiguity PCO1 = ((Token)+ (Token.string == AND | OR)) [2] PCO2 = (Token.POS == JJ) (Token.POS == NN | NNS) (Token.string == AND | OR) (Token.POS == NN | NNS) 3 Vague terms PV AG = (Token.string ∈ Vague) 4 Modal adverb PADV = (Token.POS == RB | RBR), (Token.string =∼ ”[.]*ly$”) 5 Passive voice PP V = (AUXVERB)(NOT)?(Token.POS == RB | RBR)? (Token.POS ==VBN) 6 Excessive length PLEN = Sentence.len > 60 7 Missing condition PMC = (IF)(Token, !Token.kind == punctuation)* (Token.kind == punctuation)(!(ELSE | OTHERWISE)) 8 Missing a unit of measurement PMU1 = (NUMBER)((Token)[0, 1](NUMBER))?(!MEASUREMENT) PMU2 = (NUMBER)((Token)[0, 1](NUMBER))?(!PERCENT) 9 Missing reference PMR = (Token.string == “Ref”)(Token.string == “.”) (SpaceToken)?(NUMBER) 10 Undefined term PUT = (Token.kind == word, Token.orth == mixedCaps)

Slide 7

Slide 7 text

7 Build Software to Test Software exactpro.com Dataset Structure Area: Railway signalling software consists of 4 components Volume: The raw dataset has 1866 requirements, each requirement may have 0, 1 or more than 1 defect Data Markup Approach: Manually (in order to compare with GATE’s output) Markup Stages: a. If a requirement was accepted or rejected b. If it was rejected, then why: completeness clarity preciseness unequivocality verifiability testability maintainability feasibility c. If it was rejected due to completeness clarity unequivocality then what exactly was lacking from Patterns perspective Markup Output: ● 1733 accepted reqs ● 93 rejected reqs ● Majority of the defects are due to passive voice

Slide 8

Slide 8 text

8 Build Software to Test Software exactpro.com Dataset Structure ReqID Accepted ... Completeness Clarity Unequivocality req_1 1 0 0 0 req_2 0 0 0 0 ... req_n 0 1 0 1 req_1 = has no bugs and was accepted req_2 = has bugs but not related to this research req_n = has 2 bugs, one on Completeness and one Unequivocality ReqID FragID PatternID req_1 frag_1 pattern_4 req_1 frag_2 pattern_2 ... req_m frag_n pattern_7 frag_1 = modal adverb defect frag_2 = coordination ambiguity frag_n = missing condition Dataset A Dataset B

Slide 9

Slide 9 text

9 Build Software to Test Software exactpro.com Measurement Approach Evaluation Measures by Defect Precision and recall calculated on top of “defects”. Where one defect is a piece of a requirement which contains a flaw following the Patterns Evaluation Measures by Requirement Focuses on requirements themselves In order to track the efficiency of the patterns applied together tpD - number of requirements fragments labeled as defective and correctly identified by the pattern fpD - number of requirements fragments wrongly identified as defective by the pattern fnD - number of requirements fragments labeled as defective that are not discovered by the pattern

Slide 10

Slide 10 text

10 Build Software to Test Software exactpro.com Patterns Tuning after Preliminary Research # Patterns Changes 1 Anaphoric ambiguity - 2 Coordination ambiguity - 3 Vague terms Added new terms, added stop-words (domain specific words) 4 Modal adverb - 5 Passive voice - 6 Excessive length Lists were also recognized as excessive length sentences (won’t fix) 7 Missing condition if-else construction can be replaced with an if-if option in some cases (won’t fix) 8 Missing a unit of measurement Measurements are not applicable for ranges 9 Missing reference - 10 Undefined term -

Slide 11

Slide 11 text

11 Build Software to Test Software exactpro.com Evaluation of the Results False Negative FN errors appeared due to req which have defect not presented in the patterns i.e. testability, feasibility defects False Positive FP errors appeared due to lack of standardization VE1 and VE3 had different understanding of how to annotate the data VE3 tolerated some of the linguistic defects (i.e. Vague Terms) and marked up only “tech” defects

Slide 12

Slide 12 text

12 Build Software to Test Software exactpro.com Evaluation of the Results Defect Class D R tpD fpD pD Anaphoric ambiguity 387 327 258 129 66.6% Coordination ambiguity 263 213 190 73 72.24% Vague terms 496 306 290 206 58.46% Modal adverbs 476 373 331 145 69.53% Passive voice 1265 615 1242 23 98.1% Excessive length 16 16 16 16 100% Missing condition 188 148 129 59 68.61% Missing unit of measurement 0 0 0 0 - Missing reference 4 2 4 0 100% Undefined term 54 49 43 11 79.62% Average 79.24% Requirements tpR fpR pR 1042 175 85.6%

Slide 13

Slide 13 text

13 Build Software to Test Software exactpro.com Conclusion In-House NLP: About proper tool usage/customization Requirements Language Counts: About a recommendation for such tools to be used by requirements editors (instead of VEs), because it’s about writing style (i.e. the major defect pattern “passive voice”) Validation Criteria Counts: About make criteria clear before annotation NLP is Only a Part of the Answer: That they were not able to detect testability/feasibility defects using described techniques Statistical NLP vs Lexical Techniques: If you want to use lexical-based approach instead of statistics-based, you need to revise it better.