TMPA-2021: An approach to modules similarity definition based on the system log analysis

1 25-27 NOVEMBER SOFTWARE TESTING, MACHINE LEARNING AND COMPLEX PROCESS
ANALYSIS An approach to modules similarity definition based on the system log analysis Ilya Samonenko, HSE Tamara Voznesenskaya, HSE Rostislav Yavorskiy, TPU

2 Distributed modular system Module 1 Module 2 Module 3
Module n Token 1 Token 2 Token r Module i Token j Outcome 1 Outcome 2 Outcome k 𝑻 = {𝒕𝟏 , … , 𝒕𝒓 } 𝑴 = {𝒎𝟏 , … , 𝒎𝒏 } 𝑶 = {𝒐𝟏 , … , 𝒐𝒌 } 𝝈: 𝑴 × 𝑻 → 𝑶 ∪ {𝒖𝒏𝒅𝒆𝒇} Digital mark: 𝝈 𝒎, 𝒕 = 𝒖𝒏𝒅𝒆𝒇 If the token t is missing in the module m, then: Digital trace of module m: 𝝉 𝒎 = (𝝈 𝒎, 𝒕𝟏 , 𝝈 𝒎, 𝒕𝟐 , … , 𝝈 𝒎, 𝒕𝑵 ) 𝚲 𝑺 = (𝝉 𝒎𝟏 , … , 𝝉 𝒎𝒏 ) Log of distributed modular system S:

3 Comparable modules For similarity let Let s – some
fixed natural number. Two modules m1 and m2 are comparable if there exist at least s tokens t, for which the both σ(m1 ,t) and σ(m2 ,t) are defined. Example: Let s = 3 m1 and m2 are comparable. Further we will consider only comparable pairs of modules. 𝑶 = {𝟎, 𝟏, 𝟐, 𝟑, … , 𝒌} 𝝉 𝒎𝟏 = (𝟏, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝒖𝒏𝒅𝒆𝒇, 𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝟐) 𝝉 𝒎𝟐 = (𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝒖𝒏𝒅𝒆𝒇)

4 Modules similarity Suppose, that we know that some comparable
pairs (m1 , m2 ) are similar (in some sense). Let π ⊂ M × M is a set of all comparable pairs (m1 , m2 ) for which we know that they are similar. Remarks: 1. if (m1 , m2 ) ∉ π then we don’t know anything about m1 and m2 , they may be similar or not. 2. | π | may by much smaller than | M × M | Our goal is to define some distance function d on M, that respects similarity: For all m1 , m2 , m3 , m4 ∈ M if (m1 , m2 ) ∈ π and (m3 , m4 ) ∉ π then d(m1 , m2 ) ≤ d(m3 , m4 ) If such a function does not exist, then our goal is to find the best approximation. Why? To determine abnormal behavior. We can do clustering using the distance function d. Suppose we know: - the red modules are similar, therefore they are close to each other; - the blue modules are similar, therefore they are close to each other. And suddenly we get that one blue module is far away from its cluster. This is the reason to check what is happening to it!

5 Standard distance and correlation functions of digital traces …
… 𝝉 𝒎𝟏 = (𝟏, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝒖𝒏𝒅𝒆𝒇, 𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝟐) 𝝉 𝒎𝟐 = (𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝒖𝒏𝒅𝒆𝒇) 𝑳𝟏 𝒎𝟏 , 𝒎𝟐 = 𝟏 𝟑 𝟏 − 𝟏 + 𝟏 − 𝟑 + 𝟐 − 𝟐 = 𝟐 𝟑 We will build a new distance function based on standard distance and correlation functions using digital markers of identical tokens. Functions should be normalized and their maximum and minimum values should not depend on the number of identical tokens. 𝝉 𝒎𝟑 = (𝟏, 𝟑, 𝟓, 𝒖𝒏𝒅𝒆𝒇, 𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝟐) 𝝉 𝒎𝟐 = (𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝒖𝒏𝒅𝒆𝒇) 𝑳𝟏 𝒎𝟐 , 𝒎𝟑 = 𝟏 𝟒 𝟏 − 𝟏 + 𝟑 − 𝟓 + 𝟏 − 𝟑 + 𝟐 − 𝟐 = 𝟏

6 Approach M M L1 … LK similar prob d
m1 m2 4 … 3 ? m1 m3 2 … 3 True m1 m5 1 … 2 ? m1 m6 3 … 4 ? m1 m10 3 … 3 True m2 m3 5 … 5 ? m2 m5 5 … 5 ? … … … … … … m7 m10 2 2 ? m9 m10 4 2 True

7 Approach Objects: pairs of comparable modules Features: values of
standard distance and correlation functions Target variable: «similar» or «not similar». Train а positive-undefined (PU) classifier for two classes: «similar» and «not similar». M M L1 … LK similar prob d m1 m2 4 … 3 ? m1 m3 2 … 3 True m1 m5 1 … 2 ? m1 m6 3 … 4 ? m1 m10 3 … 3 True m2 m3 5 … 5 ? m2 m5 5 … 5 ? … … … … … … m7 m10 2 2 ? m9 m10 4 2 True

m1 m2 4 … 3 False 0.1 m1 m3 2 … 3 True 0.9 m1 m5 1 … 2 True 0.8 m1 m6 3 … 4 False 0.4 m1 m10 3 … 3 True 0.9 m2 m3 5 … 1 False 0.3 m2 m5 5 … 5 True 0.5 … … … … … … … m7 m10 2 2 Ture 0.7 m9 m10 4 2 True 0.8 Objects: pairs of comparable modules Features: values of standard distance and correlation functions Target variable: «similar» or «not similar». Train а positive-undefined (PU) classifier for two classes: «similar» and «not similar». Calculate the probability p(mi , mj ) that (mi , mj ) is in class «similar».

m1 m2 4 … 3 False 0.1 0.9 m1 m3 2 … 3 True 0.9 0.1 m1 m5 1 … 2 True 0.8 0.2 m1 m6 3 … 4 False 0.4 0.6 m1 m10 3 … 3 True 0.9 0.1 m2 m3 5 … 1 False 0.3 0.7 m2 m5 5 … 5 True 0.5 0.5 … … … … … … … … m7 m10 2 2 Ture 0.7 0.3 m9 m10 4 2 True 0.8 0.2 Objects: pairs of comparable modules Features: values of standard distance and correlation functions Target variable: «similar» or «not similar». Train a positive-undefined (PU) classifier for two classes: «similar» and «not similar». Calculate the probability p(mi , mj ) that (mi , mj ) is in class «similar». Define a function d(mi , mj ) = 1 - p(mi , mj )

m1 m2 4 … 3 False 0.1 0.9 m1 m3 2 … 3 True 0.9 0.1 m1 m5 1 … 2 True 0.8 0.2 m1 m6 3 … 4 False 0.4 0.6 m1 m10 3 … 3 True 0.9 0.1 m2 m3 5 … 1 False 0.3 0.7 m2 m5 5 … 5 True 0.5 0.5 … … … … … … … … m7 m10 2 2 Ture 0.7 0.3 m9 m10 4 2 True 0.8 0.2 Objects: pairs of comparable modules Features: values of standard distance and correlation functions Target variable: «similar» or «not similar». Train а positive-undefined (PU) classifier for two classes: «similar» and «not similar». Calculate the probability p(mi , mj ) that (mi , mj ) is in class «similar». Define function d(mi , mj ) = 1 - p(mi , mj ) Testing that d(mi , mj ) is distance function: 𝝌(𝒅) = | (𝒎𝒊 , 𝒎𝒋 , 𝒎𝒌 𝒅 𝒎𝒊 , 𝒎𝒋 + 𝒅 𝒎𝒋 , 𝒎𝒌 ≥ 𝒅 𝒎𝒊 , 𝒎𝒌 | | (𝒎𝒊 , 𝒎𝒋 , 𝒎𝒌 − 𝒑𝒂𝒊𝒓𝒘𝒊𝒔𝒆 𝒄𝒐𝒎𝒑𝒂𝒓𝒂𝒃𝒍𝒆}|

11 Application Modules M: set of all disciplines of all
educational programs of the HSE in the period from 2017 to 2020. |M| = 54652 Tokens T: set of all students. Outcomes O: set of grades = {0,1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. Digital mark: 𝝈(𝒎, 𝒕) – the grade of student t for discipline m. Over 2,5 million grades. Digital trace: 𝝉(𝒎) – all grades for discipline m. Disciplines m1 and m2 are comparable if at least 10 of the same students passed exams in this disciplines. Over 950 000 pairs of comparable disciplines.

12 Application Disciplines m1 and m2 are similar if: 1.
they had the same name 2. they were implemented in the same program, in the same academic year and in the same astronomic year Example: Bachelor’s Program: Applied Mathematics and Information Science Disciplines: m1 = Calculus, fall semester, 1 year students, 2020 year m2 = Calculus, spring semester, 1 year students, 2020 year Result: Positive-Undefined (PU) classification: ROC-AUC = 0.81 𝝌(𝒅) = 0.98

13 Application № Discipline Module 1 Algebra 4 2 Algorithms
and data structures 2 3 Algorithms and data structures 4 4 English 2 5 Safe Living Basics 1 6 English Language Integrative Exam 4 7 Discrete Mathematics 2 8 Discrete Mathematics 3 9 History 4 10 Linear Algebra and Geometry 2 11 Linear Algebra and Geometry 4 12 Calculus 2 13 Calculus 4 14 Introduction to Programming 1 15 Introduction to Programming 3 16 Economics 2 2d discipline representation

14 Thank You! Follow TMPA on Facebook TMPA-2021 Conference

TMPA-2021: An approach to modules similarity de...

TMPA-2021: An approach to modules similarity definition based on the system log analysis

Exactpro
PRO

More Decks by Exactpro

Other Decks in Technology

Featured

Transcript

1 25-27 NOVEMBER SOFTWARE TESTING, MACHINE LEARNING AND COMPLEX PROCESS

2 Distributed modular system Module 1 Module 2 Module 3

3 Comparable modules For similarity let Let s – some

4 Modules similarity Suppose, that we know that some comparable

5 Standard distance and correlation functions of digital traces …

6 Approach M M L1 … LK similar prob d

7 Approach Objects: pairs of comparable modules Features: values of

8 Approach M M L1 … LK similar prob d

9 Approach M M L1 … LK similar prob d

10 Approach M M L1 … LK similar prob d

11 Application Modules M: set of all disciplines of all

12 Application Disciplines m1 and m2 are similar if: 1.

13 Application № Discipline Module 1 Algebra 4 2 Algorithms

14 Thank You! Follow TMPA on Facebook TMPA-2021 Conference