Anomaly Detection. Part 3 – Machine Learning Approach

Slide 1

Slide 1 text

1 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks BUILD SOFTWARE TO TEST SOFTWARE exactpro.com Lecture 3. Machine Learning Approach MINI- COURSE ON ANOMALY DETECTION FOR AI TESTING 6 june | 10.00 GET | 11.30 SLST Rostislav Yavorski Head of Research, Exactpro

Slide 2

Slide 2 text

2 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Terms An outlier is a data point that differs significantly from other observations Anomalies are patterns in data that do not conform to a well-defined notion of normal behaviour

Slide 3

Slide 3 text

3 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks ● fraud detection ● health monitoring ● medical diagnosis ● system fault detection ● cyber-security intrusion detection ● improving the performance of machine learning algorithms See Lecture 1 on Applications 3 https://youtu.be/Mi13lqDVET0

Slide 4

Slide 4 text

4 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Challenges in Anomaly Detection ● Definition of normal behaviour is extremely challenging ● Noise data aren’t anomalies ● The definition of anomaly is domain-specific ● Anomalies evolve over time ● Getting a set of labeled anomalous instances is difficult

Slide 5

Slide 5 text

5 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks What is normal ● Average characteristics ● Similar to many ● Quite frequent occurrence ● Predictable from the past ● Labeled as normal

Slide 6

Slide 6 text

6 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks See Lecture 2 on Statistical Methods ● Graphical Methods ● Interquartile Range ● Tukey's Fences ● Seasonal and Trend Decomposition ● Statistical Hypothesis Test ● p-value and t-statistic https://youtu.be/7Rz84cp1xQA

Slide 7

Slide 7 text

7 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks

Slide 8

Slide 8 text

8 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks

Slide 9

Slide 9 text

9 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Unsupervised learning

Slide 10

Slide 10 text

10 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Unsupervised anomaly detection The objective is to detect rare objects or events without any prior knowledge: Step 1. Modelling the normal data distribution Step 2. Deﬁning a measurement to classify samples as anomalous or normal.

Slide 11

Slide 11 text

11 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Unsupervised anomaly detection The objective is to detect rare objects or events without any prior knowledge: Step 1. Modelling the normal data distribution Step 2. Deﬁning a measurement to classify samples as anomalous or normal

Slide 12

Slide 12 text

12 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Unsupervised anomaly detection The objective is to detect rare objects or events without any prior knowledge: Step 1. Modelling the normal data distribution ● Clustering (grouping) ● Dimensionality (number of parameters) reduction Step 2. Deﬁning a measurement to classify samples as anomalous or normal

Slide 13

Slide 13 text

13 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Clustering

Slide 14

Slide 14 text

14 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Unsupervised. Clustering Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups

Slide 15

Slide 15 text

15 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Clustering algorithms Connectivity models are based on distance connectivity Centroid models represents each cluster by a single mean vector Density models deﬁne clusters as connected dense regions in the data space

Slide 16

Slide 16 text

16 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Agglomerative "bottom-up" algorithm chick duckling cat rabbit hen dog rooster goose

Slide 17

Slide 17 text

17 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks rabbit Agglomerative "bottom-up" algorithm chick duckling cat hen dog rooster goose

Slide 18

Slide 18 text

18 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks rabbit chick duckling cat hen dog goose rooster ﬁsh

Slide 19

Slide 19 text

19 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks rabbit chick duckling cat hen dog goose rooster ﬁsh Anomaly

Slide 20

Slide 20 text

20 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Dimensionality Reduction

Slide 21

Slide 21 text

21 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks What is dimension The dimension of a dataset corresponds to the number of attributes that exist in a dataset. A dataset with a large number of attributes (a hundred or more) is referred to as high dimensional data. Two dimensional array

Slide 22

Slide 22 text

22 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Dimensionality Reduction Dimension reduction is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data. Three dimensional coordinate system

Slide 23

Slide 23 text

23 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Example: Software Defect Prediction Number of instances: 10 885 modules. Creators: NASA, http://mdp.ivv.nasa.gov. Hypotheses: ● code with complicated pathways are more error-prone ● code that is hard to read is more likely to be fault prone ● static measures are useful to guide software quality predictions ● static measures can never be a certain indicator of the presence of a fault Number of attributes (dimensionality): 22 ● 5 different lines of code measure ● 3 McCabe metrics (cyclomatic, essential, design complexity) ● 4 base Halstead measures (volume, length, difficulty, intelligence) ● 8 derived Halstead measures, a branch-count ● 1 goal field (module has/has not one or more reported defects) https://www.kaggle.com/datasets/semustafacevik/software-defect-prediction

Slide 24

Slide 24 text

24 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Example. Operational Data from Enterprise Application https://www.kaggle.com/datasets/anomalydetectionml/rawdata Goal: eﬀectively detect run-time anomalies using machine learning on operation metrics The dataset consists of metrics measured from the operating system and from WebLogic Server monitoring beans

Slide 25

Slide 25 text

25 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks https://towardsdatascience.com/11-dimensionality-reduction-techniques-you-should-know-in-2021-dcb9500d388b Only keep the most important features Find a combination of new features Linear methods Non-linear methods ● Backward elimination ● Forward selection ● Random forests ● PCA ● Factor analysis ● LDA ● Truncated SVD ● Kernel PCA ● t-SNE ● MDS ● Isomap Dimensionality Reduction methods

Slide 26

Slide 26 text

26 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Principal Component Analysis

Slide 27

Slide 27 text

27 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Principal Component Analysis (PCA) Principal components are new variables that are linear combinations of the initial variables These combinations are done in such a way that the new variables are uncorrelated and most of the information within the initial variables compressed into the ﬁrst components https://builtin.com/data-science/step-step-explanation-principal-component-analysis

Slide 28

Slide 28 text

28 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Principal Component Analysis (PCA) Prerequisites: ● Linear algebra ○ matrix multiplication, transposition, inverses ○ matrix decomposition ○ eigenvectors and eigenvalues ● Statistics ○ standardization, variance, covariance ○ independence ● Machine learning ○ linear regression ○ feature selection https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c

Slide 29

Slide 29 text

29 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Principal Component Analysis (PCA) 1. Normalize the initial variables 2. Compute the covariance matrix: the correlations between all the possible pairs of variables 3. Compute eigenvectors and rank the eigenvalues in descending order (get the principal components in order of signiﬁcance) 4. Compute the feature vector, which is a matrix that has as columns the eigenvectors of the components that we decide to keep 5. Reorient the data from the original axes to the ones represented by the principal components https://builtin.com/data-science/step-step-explanation-principal-component-analysis

Slide 30

Slide 30 text

30 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Factor Analysis

Slide 31

Slide 31 text

31 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Factor Analysis (FA) It is a method for modeling observed variables and their relationship in terms of unobserved variables, i.e. factors It is used to reduce a large number of variables into fewer numbers of factors Factor analysis is kind of extension of PCA FA is a complex mathematical procedure, so it is performed with software Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Factor A Factor B

Slide 32

Slide 32 text

32 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Factor Analysis (FA) There are two types of FA, exploratory (EFA) and conﬁrmatory (CFA) EFA is used to ﬁnd the underlying structure of a large set of variables and reduce data to a smaller set of summary variables, but EFA can generate a large number of possible models for your data If you do have an idea about what the models look like, and you want to test your hypotheses about the data structure, CFA is a better approach https://www.statisticshowto.com/factor-analysis/ Geometric interpretation of Factor Analysis

Slide 33

Slide 33 text

33 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Supervised learning

Slide 34

Slide 34 text

34 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Supervised anomaly detection Items in the training dataset are labeled into two categories: normal and abnormal The model will use these examples to recognize abnormal patterns in the previously unseen data It is rarely used due to unavailability of labelled data for the known anomalies https://www.enjoyalgorithms.com/blogs/supervised-unsupervised-and-semisupervised-learning

Slide 35

Slide 35 text

35 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Conclusion

Slide 36

Slide 36 text

36 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks https://github.com/openvinotoolkit/anomalib A library for benchmarking, developing and deploying deep learning anomaly detection algorithms

Slide 37

Slide 37 text

37 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks https://www.oreilly.com/library/view/hands-on-unsupervised-learning/9781492035633/

Slide 38

Slide 38 text

38 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks AI Testing Talks Thank you! Questions?