Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TMPA-2021: An approach to modules similarity de...

Exactpro
November 26, 2021

TMPA-2021: An approach to modules similarity definition based on the system log analysis

An approach to modules similarity definition based on the system log analysis

TMPA is an annual International Conference on Software Testing, Machine Learning and Complex Process Analysis. The conference will focus on the application of modern methods of data science to the analysis of software quality.

To learn more about Exactpro, visit our website https://exactpro.com/

Follow us on
LinkedIn https://www.linkedin.com/company/exactpro-systems-llc
Twitter https://twitter.com/exactpro

Exactpro

November 26, 2021
Tweet

More Decks by Exactpro

Other Decks in Technology

Transcript

  1. 1 25-27 NOVEMBER SOFTWARE TESTING, MACHINE LEARNING AND COMPLEX PROCESS

    ANALYSIS An approach to modules similarity definition based on the system log analysis Ilya Samonenko, HSE Tamara Voznesenskaya, HSE Rostislav Yavorskiy, TPU
  2. 2 Distributed modular system Module 1 Module 2 Module 3

    Module n Token 1 Token 2 Token r Module i Token j Outcome 1 Outcome 2 Outcome k 𝑻 = {𝒕𝟏 , … , 𝒕𝒓 } 𝑴 = {𝒎𝟏 , … , 𝒎𝒏 } 𝑶 = {𝒐𝟏 , … , 𝒐𝒌 } 𝝈: 𝑴 × 𝑻 → 𝑶 ∪ {𝒖𝒏𝒅𝒆𝒇} Digital mark: 𝝈 𝒎, 𝒕 = 𝒖𝒏𝒅𝒆𝒇 If the token t is missing in the module m, then: Digital trace of module m: 𝝉 𝒎 = (𝝈 𝒎, 𝒕𝟏 , 𝝈 𝒎, 𝒕𝟐 , … , 𝝈 𝒎, 𝒕𝑵 ) 𝚲 𝑺 = (𝝉 𝒎𝟏 , … , 𝝉 𝒎𝒏 ) Log of distributed modular system S:
  3. 3 Comparable modules For similarity let Let s – some

    fixed natural number. Two modules m1 and m2 are comparable if there exist at least s tokens t, for which the both σ(m1 ,t) and σ(m2 ,t) are defined. Example: Let s = 3 m1 and m2 are comparable. Further we will consider only comparable pairs of modules. 𝑶 = {𝟎, 𝟏, 𝟐, 𝟑, … , 𝒌} 𝝉 𝒎𝟏 = (𝟏, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝒖𝒏𝒅𝒆𝒇, 𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝟐) 𝝉 𝒎𝟐 = (𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝒖𝒏𝒅𝒆𝒇)
  4. 4 Modules similarity Suppose, that we know that some comparable

    pairs (m1 , m2 ) are similar (in some sense). Let π ⊂ M × M is a set of all comparable pairs (m1 , m2 ) for which we know that they are similar. Remarks: 1. if (m1 , m2 ) ∉ π then we don’t know anything about m1 and m2 , they may be similar or not. 2. | π | may by much smaller than | M × M | Our goal is to define some distance function d on M, that respects similarity: For all m1 , m2 , m3 , m4 ∈ M if (m1 , m2 ) ∈ π and (m3 , m4 ) ∉ π then d(m1 , m2 ) ≤ d(m3 , m4 ) If such a function does not exist, then our goal is to find the best approximation. Why? To determine abnormal behavior. We can do clustering using the distance function d. Suppose we know: - the red modules are similar, therefore they are close to each other; - the blue modules are similar, therefore they are close to each other. And suddenly we get that one blue module is far away from its cluster. This is the reason to check what is happening to it!
  5. 5 Standard distance and correlation functions of digital traces …

    … 𝝉 𝒎𝟏 = (𝟏, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝒖𝒏𝒅𝒆𝒇, 𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝟐) 𝝉 𝒎𝟐 = (𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝒖𝒏𝒅𝒆𝒇) 𝑳𝟏 𝒎𝟏 , 𝒎𝟐 = 𝟏 𝟑 𝟏 − 𝟏 + 𝟏 − 𝟑 + 𝟐 − 𝟐 = 𝟐 𝟑 We will build a new distance function based on standard distance and correlation functions using digital markers of identical tokens. Functions should be normalized and their maximum and minimum values should not depend on the number of identical tokens. 𝝉 𝒎𝟑 = (𝟏, 𝟑, 𝟓, 𝒖𝒏𝒅𝒆𝒇, 𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝟐) 𝝉 𝒎𝟐 = (𝟏, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟑, 𝒖𝒏𝒅𝒆𝒇, 𝟐, 𝒖𝒏𝒅𝒆𝒇, 𝒖𝒏𝒅𝒆𝒇) 𝑳𝟏 𝒎𝟐 , 𝒎𝟑 = 𝟏 𝟒 𝟏 − 𝟏 + 𝟑 − 𝟓 + 𝟏 − 𝟑 + 𝟐 − 𝟐 = 𝟏
  6. 6 Approach M M L1 … LK similar prob d

    m1 m2 4 … 3 ? m1 m3 2 … 3 True m1 m5 1 … 2 ? m1 m6 3 … 4 ? m1 m10 3 … 3 True m2 m3 5 … 5 ? m2 m5 5 … 5 ? … … … … … … m7 m10 2 2 ? m9 m10 4 2 True
  7. 7 Approach Objects: pairs of comparable modules Features: values of

    standard distance and correlation functions Target variable: «similar» or «not similar». Train а positive-undefined (PU) classifier for two classes: «similar» and «not similar». M M L1 … LK similar prob d m1 m2 4 … 3 ? m1 m3 2 … 3 True m1 m5 1 … 2 ? m1 m6 3 … 4 ? m1 m10 3 … 3 True m2 m3 5 … 5 ? m2 m5 5 … 5 ? … … … … … … m7 m10 2 2 ? m9 m10 4 2 True
  8. 8 Approach M M L1 … LK similar prob d

    m1 m2 4 … 3 False 0.1 m1 m3 2 … 3 True 0.9 m1 m5 1 … 2 True 0.8 m1 m6 3 … 4 False 0.4 m1 m10 3 … 3 True 0.9 m2 m3 5 … 1 False 0.3 m2 m5 5 … 5 True 0.5 … … … … … … … m7 m10 2 2 Ture 0.7 m9 m10 4 2 True 0.8 Objects: pairs of comparable modules Features: values of standard distance and correlation functions Target variable: «similar» or «not similar». Train а positive-undefined (PU) classifier for two classes: «similar» and «not similar». Calculate the probability p(mi , mj ) that (mi , mj ) is in class «similar».
  9. 9 Approach M M L1 … LK similar prob d

    m1 m2 4 … 3 False 0.1 0.9 m1 m3 2 … 3 True 0.9 0.1 m1 m5 1 … 2 True 0.8 0.2 m1 m6 3 … 4 False 0.4 0.6 m1 m10 3 … 3 True 0.9 0.1 m2 m3 5 … 1 False 0.3 0.7 m2 m5 5 … 5 True 0.5 0.5 … … … … … … … … m7 m10 2 2 Ture 0.7 0.3 m9 m10 4 2 True 0.8 0.2 Objects: pairs of comparable modules Features: values of standard distance and correlation functions Target variable: «similar» or «not similar». Train a positive-undefined (PU) classifier for two classes: «similar» and «not similar». Calculate the probability p(mi , mj ) that (mi , mj ) is in class «similar». Define a function d(mi , mj ) = 1 - p(mi , mj )
  10. 10 Approach M M L1 … LK similar prob d

    m1 m2 4 … 3 False 0.1 0.9 m1 m3 2 … 3 True 0.9 0.1 m1 m5 1 … 2 True 0.8 0.2 m1 m6 3 … 4 False 0.4 0.6 m1 m10 3 … 3 True 0.9 0.1 m2 m3 5 … 1 False 0.3 0.7 m2 m5 5 … 5 True 0.5 0.5 … … … … … … … … m7 m10 2 2 Ture 0.7 0.3 m9 m10 4 2 True 0.8 0.2 Objects: pairs of comparable modules Features: values of standard distance and correlation functions Target variable: «similar» or «not similar». Train а positive-undefined (PU) classifier for two classes: «similar» and «not similar». Calculate the probability p(mi , mj ) that (mi , mj ) is in class «similar». Define function d(mi , mj ) = 1 - p(mi , mj ) Testing that d(mi , mj ) is distance function: 𝝌(𝒅) = | (𝒎𝒊 , 𝒎𝒋 , 𝒎𝒌 𝒅 𝒎𝒊 , 𝒎𝒋 + 𝒅 𝒎𝒋 , 𝒎𝒌 ≥ 𝒅 𝒎𝒊 , 𝒎𝒌 | | (𝒎𝒊 , 𝒎𝒋 , 𝒎𝒌 − 𝒑𝒂𝒊𝒓𝒘𝒊𝒔𝒆 𝒄𝒐𝒎𝒑𝒂𝒓𝒂𝒃𝒍𝒆}|
  11. 11 Application Modules M: set of all disciplines of all

    educational programs of the HSE in the period from 2017 to 2020. |M| = 54652 Tokens T: set of all students. Outcomes O: set of grades = {0,1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. Digital mark: 𝝈(𝒎, 𝒕) – the grade of student t for discipline m. Over 2,5 million grades. Digital trace: 𝝉(𝒎) – all grades for discipline m. Disciplines m1 and m2 are comparable if at least 10 of the same students passed exams in this disciplines. Over 950 000 pairs of comparable disciplines.
  12. 12 Application Disciplines m1 and m2 are similar if: 1.

    they had the same name 2. they were implemented in the same program, in the same academic year and in the same astronomic year Example: Bachelor’s Program: Applied Mathematics and Information Science Disciplines: m1 = Calculus, fall semester, 1 year students, 2020 year m2 = Calculus, spring semester, 1 year students, 2020 year Result: Positive-Undefined (PU) classification: ROC-AUC = 0.81 𝝌(𝒅) = 0.98
  13. 13 Application № Discipline Module 1 Algebra 4 2 Algorithms

    and data structures 2 3 Algorithms and data structures 4 4 English 2 5 Safe Living Basics 1 6 English Language Integrative Exam 4 7 Discrete Mathematics 2 8 Discrete Mathematics 3 9 History 4 10 Linear Algebra and Geometry 2 11 Linear Algebra and Geometry 4 12 Calculus 2 13 Calculus 4 14 Introduction to Programming 1 15 Introduction to Programming 3 16 Economics 2 2d discipline representation