TMPA-2021: An approach to modules similarity definition based on the system log analysis

Slide 1

Slide 1 text

1 25-27 NOVEMBER SOFTWARE TESTING, MACHINE LEARNING AND COMPLEX PROCESS ANALYSIS An approach to modules similarity definition based on the system log analysis Ilya Samonenko, HSE Tamara Voznesenskaya, HSE Rostislav Yavorskiy, TPU

Slide 2

Slide 2 text

2 Distributed modular system Module 1 Module 2 Module 3 Module n Token 1 Token 2 Token r Module i Token j Outcome 1 Outcome 2 Outcome k 𝑻 = {𝒕𝟏 , … , 𝒕𝒓 } 𝑴 = {𝒎𝟏 , … , 𝒎𝒏 } 𝑶 = {𝒐𝟏 , … , 𝒐𝒌 } 𝝈: 𝑴 × 𝑻 → 𝑶 ∪ {𝒖𝒏𝒅𝒆𝒇} Digital mark: 𝝈 𝒎, 𝒕 = 𝒖𝒏𝒅𝒆𝒇 If the token t is missing in the module m, then: Digital trace of module m: 𝝉 𝒎 = (𝝈 𝒎, 𝒕𝟏 , 𝝈 𝒎, 𝒕𝟐 , … , 𝝈 𝒎, 𝒕𝑵 ) 𝚲 𝑺 = (𝝉 𝒎𝟏 , … , 𝝉 𝒎𝒏 ) Log of distributed modular system S:

Slide 3

Slide 3 text

3 Comparable modules For similarity let Let s – some fixed natural number. Two modules m1 and m2 are comparable if there exist at least s tokens t, for which the both σ(m1 ,t) and σ(m2 ,t) are defined. Example: Let s = 3 m1 and m2 are comparable. Further we will consider only comparable pairs of modules. 𝑶 = {𝟎, 𝟏, 𝟐, 𝟑, … , 𝒌} 𝝉 𝒎𝟏 = (𝟏, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝒖𝒏𝒅𝒆𝒇, 𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝟐) 𝝉 𝒎𝟐 = (𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝒖𝒏𝒅𝒆𝒇)

Slide 4

Slide 4 text

4 Modules similarity Suppose, that we know that some comparable pairs (m1 , m2 ) are similar (in some sense). Let π ⊂ M × M is a set of all comparable pairs (m1 , m2 ) for which we know that they are similar. Remarks: 1. if (m1 , m2 ) ∉ π then we don’t know anything about m1 and m2 , they may be similar or not. 2. | π | may by much smaller than | M × M | Our goal is to define some distance function d on M, that respects similarity: For all m1 , m2 , m3 , m4 ∈ M if (m1 , m2 ) ∈ π and (m3 , m4 ) ∉ π then d(m1 , m2 ) ≤ d(m3 , m4 ) If such a function does not exist, then our goal is to find the best approximation. Why? To determine abnormal behavior. We can do clustering using the distance function d. Suppose we know: - the red modules are similar, therefore they are close to each other; - the blue modules are similar, therefore they are close to each other. And suddenly we get that one blue module is far away from its cluster. This is the reason to check what is happening to it!

Slide 5

Slide 5 text

5 Standard distance and correlation functions of digital traces … … 𝝉 𝒎𝟏 = (𝟏, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝒖𝒏𝒅𝒆𝒇, 𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝟐) 𝝉 𝒎𝟐 = (𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝒖𝒏𝒅𝒆𝒇) 𝑳𝟏 𝒎𝟏 , 𝒎𝟐 = 𝟏 𝟑 𝟏 − 𝟏 + 𝟏 − 𝟑 + 𝟐 − 𝟐 = 𝟐 𝟑 We will build a new distance function based on standard distance and correlation functions using digital markers of identical tokens. Functions should be normalized and their maximum and minimum values should not depend on the number of identical tokens. 𝝉 𝒎𝟑 = (𝟏, 𝟑, 𝟓, 𝒖𝒏𝒅𝒆𝒇, 𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝟐) 𝝉 𝒎𝟐 = (𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝒖𝒏𝒅𝒆𝒇) 𝑳𝟏 𝒎𝟐 , 𝒎𝟑 = 𝟏 𝟒 𝟏 − 𝟏 + 𝟑 − 𝟓 + 𝟏 − 𝟑 + 𝟐 − 𝟐 = 𝟏

Slide 6

Slide 6 text

6 Approach M M L1 … LK similar prob d m1 m2 4 … 3 ? m1 m3 2 … 3 True m1 m5 1 … 2 ? m1 m6 3 … 4 ? m1 m10 3 … 3 True m2 m3 5 … 5 ? m2 m5 5 … 5 ? … … … … … … m7 m10 2 2 ? m9 m10 4 2 True

Slide 7

Slide 7 text

7 Approach Objects: pairs of comparable modules Features: values of standard distance and correlation functions Target variable: «similar» or «not similar». Train а positive-undefined (PU) classifier for two classes: «similar» and «not similar». M M L1 … LK similar prob d m1 m2 4 … 3 ? m1 m3 2 … 3 True m1 m5 1 … 2 ? m1 m6 3 … 4 ? m1 m10 3 … 3 True m2 m3 5 … 5 ? m2 m5 5 … 5 ? … … … … … … m7 m10 2 2 ? m9 m10 4 2 True

Slide 8

Slide 8 text

8 Approach M M L1 … LK similar prob d m1 m2 4 … 3 False 0.1 m1 m3 2 … 3 True 0.9 m1 m5 1 … 2 True 0.8 m1 m6 3 … 4 False 0.4 m1 m10 3 … 3 True 0.9 m2 m3 5 … 1 False 0.3 m2 m5 5 … 5 True 0.5 … … … … … … … m7 m10 2 2 Ture 0.7 m9 m10 4 2 True 0.8 Objects: pairs of comparable modules Features: values of standard distance and correlation functions Target variable: «similar» or «not similar». Train а positive-undefined (PU) classifier for two classes: «similar» and «not similar». Calculate the probability p(mi , mj ) that (mi , mj ) is in class «similar».

Slide 9

Slide 9 text

9 Approach M M L1 … LK similar prob d m1 m2 4 … 3 False 0.1 0.9 m1 m3 2 … 3 True 0.9 0.1 m1 m5 1 … 2 True 0.8 0.2 m1 m6 3 … 4 False 0.4 0.6 m1 m10 3 … 3 True 0.9 0.1 m2 m3 5 … 1 False 0.3 0.7 m2 m5 5 … 5 True 0.5 0.5 … … … … … … … … m7 m10 2 2 Ture 0.7 0.3 m9 m10 4 2 True 0.8 0.2 Objects: pairs of comparable modules Features: values of standard distance and correlation functions Target variable: «similar» or «not similar». Train a positive-undefined (PU) classifier for two classes: «similar» and «not similar». Calculate the probability p(mi , mj ) that (mi , mj ) is in class «similar». Define a function d(mi , mj ) = 1 - p(mi , mj )

Slide 10

Slide 10 text

10 Approach M M L1 … LK similar prob d m1 m2 4 … 3 False 0.1 0.9 m1 m3 2 … 3 True 0.9 0.1 m1 m5 1 … 2 True 0.8 0.2 m1 m6 3 … 4 False 0.4 0.6 m1 m10 3 … 3 True 0.9 0.1 m2 m3 5 … 1 False 0.3 0.7 m2 m5 5 … 5 True 0.5 0.5 … … … … … … … … m7 m10 2 2 Ture 0.7 0.3 m9 m10 4 2 True 0.8 0.2 Objects: pairs of comparable modules Features: values of standard distance and correlation functions Target variable: «similar» or «not similar». Train а positive-undefined (PU) classifier for two classes: «similar» and «not similar». Calculate the probability p(mi , mj ) that (mi , mj ) is in class «similar». Define function d(mi , mj ) = 1 - p(mi , mj ) Testing that d(mi , mj ) is distance function: 𝝌(𝒅) = | (𝒎𝒊 , 𝒎𝒋 , 𝒎𝒌 𝒅 𝒎𝒊 , 𝒎𝒋 + 𝒅 𝒎𝒋 , 𝒎𝒌 ≥ 𝒅 𝒎𝒊 , 𝒎𝒌 | | (𝒎𝒊 , 𝒎𝒋 , 𝒎𝒌 − 𝒑𝒂𝒊𝒓𝒘𝒊𝒔𝒆 𝒄𝒐𝒎𝒑𝒂𝒓𝒂𝒃𝒍𝒆}|

Slide 11

Slide 11 text

11 Application Modules M: set of all disciplines of all educational programs of the HSE in the period from 2017 to 2020. |M| = 54652 Tokens T: set of all students. Outcomes O: set of grades = {0,1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. Digital mark: 𝝈(𝒎, 𝒕) – the grade of student t for discipline m. Over 2,5 million grades. Digital trace: 𝝉(𝒎) – all grades for discipline m. Disciplines m1 and m2 are comparable if at least 10 of the same students passed exams in this disciplines. Over 950 000 pairs of comparable disciplines.

Slide 12

Slide 12 text

12 Application Disciplines m1 and m2 are similar if: 1. they had the same name 2. they were implemented in the same program, in the same academic year and in the same astronomic year Example: Bachelor’s Program: Applied Mathematics and Information Science Disciplines: m1 = Calculus, fall semester, 1 year students, 2020 year m2 = Calculus, spring semester, 1 year students, 2020 year Result: Positive-Undefined (PU) classification: ROC-AUC = 0.81 𝝌(𝒅) = 0.98

Slide 13

Slide 13 text

13 Application № Discipline Module 1 Algebra 4 2 Algorithms and data structures 2 3 Algorithms and data structures 4 4 English 2 5 Safe Living Basics 1 6 English Language Integrative Exam 4 7 Discrete Mathematics 2 8 Discrete Mathematics 3 9 History 4 10 Linear Algebra and Geometry 2 11 Linear Algebra and Geometry 4 12 Calculus 2 13 Calculus 4 14 Introduction to Programming 1 15 Introduction to Programming 3 16 Economics 2 2d discipline representation

Slide 14

Slide 14 text

14 Thank You! Follow TMPA on Facebook TMPA-2021 Conference