Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Anomaly Detection. Part 3 – Machine Learning Approach

Exactpro
PRO
June 06, 2022
17

Anomaly Detection. Part 3 – Machine Learning Approach

Rostislav Yavorski
Head of Research, Exactpro

“In Lecture 3, we are going to discuss the unsupervised, supervised, and semi-supervised learning methods, placing special emphasis on clustering and dimensionality reduction.”

AI Testing Talks – Anomaly Detection. 6 June 2022

https://exactpro.com/events/external/ai-testing-talks-anomaly-detection?utm_source=speakerdeck&utm_medium=Refferer&utm_campaign=machine-learning

---

Follow us on
LinkedIn https://www.linkedin.com/company/exactpro-systems-llc
Twitter https://twitter.com/exactpro

Exactpro
PRO

June 06, 2022
Tweet

Transcript

  1. 1 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks BUILD

    SOFTWARE TO TEST SOFTWARE exactpro.com Lecture 3. Machine Learning Approach MINI- COURSE ON ANOMALY DETECTION FOR AI TESTING 6 june | 10.00 GET | 11.30 SLST Rostislav Yavorski Head of Research, Exactpro
  2. 2 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Terms

    An outlier is a data point that differs significantly from other observations Anomalies are patterns in data that do not conform to a well-defined notion of normal behaviour
  3. 3 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks •

    fraud detection • health monitoring • medical diagnosis • system fault detection • cyber-security intrusion detection • improving the performance of machine learning algorithms See Lecture 1 on Applications 3 https://youtu.be/Mi13lqDVET0
  4. 4 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Challenges

    in Anomaly Detection • Definition of normal behaviour is extremely challenging • Noise data aren’t anomalies • The definition of anomaly is domain-specific • Anomalies evolve over time • Getting a set of labeled anomalous instances is difficult
  5. 5 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks What

    is normal • Average characteristics • Similar to many • Quite frequent occurrence • Predictable from the past • Labeled as normal
  6. 6 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks See

    Lecture 2 on Statistical Methods • Graphical Methods • Interquartile Range • Tukey's Fences • Seasonal and Trend Decomposition • Statistical Hypothesis Test • p-value and t-statistic https://youtu.be/7Rz84cp1xQA
  7. 7 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks

  8. 8 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks

  9. 9 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Unsupervised

    learning
  10. 10 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Unsupervised

    anomaly detection The objective is to detect rare objects or events without any prior knowledge: Step 1. Modelling the normal data distribution Step 2. Defining a measurement to classify samples as anomalous or normal.
  11. 11 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Unsupervised

    anomaly detection The objective is to detect rare objects or events without any prior knowledge: Step 1. Modelling the normal data distribution Step 2. Defining a measurement to classify samples as anomalous or normal
  12. 12 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Unsupervised

    anomaly detection The objective is to detect rare objects or events without any prior knowledge: Step 1. Modelling the normal data distribution • Clustering (grouping) • Dimensionality (number of parameters) reduction Step 2. Defining a measurement to classify samples as anomalous or normal
  13. 13 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Clustering

  14. 14 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Unsupervised.

    Clustering Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups
  15. 15 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Clustering

    algorithms Connectivity models are based on distance connectivity Centroid models represents each cluster by a single mean vector Density models define clusters as connected dense regions in the data space
  16. 16 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Agglomerative

    "bottom-up" algorithm chick duckling cat rabbit hen dog rooster goose
  17. 17 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks rabbit

    Agglomerative "bottom-up" algorithm chick duckling cat hen dog rooster goose
  18. 18 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks rabbit

    chick duckling cat hen dog goose rooster fish
  19. 19 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks rabbit

    chick duckling cat hen dog goose rooster fish Anomaly
  20. 20 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Dimensionality

    Reduction
  21. 21 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks What

    is dimension The dimension of a dataset corresponds to the number of attributes that exist in a dataset. A dataset with a large number of attributes (a hundred or more) is referred to as high dimensional data. Two dimensional array
  22. 22 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Dimensionality

    Reduction Dimension reduction is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data. Three dimensional coordinate system
  23. 23 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Example:

    Software Defect Prediction Number of instances: 10 885 modules. Creators: NASA, http://mdp.ivv.nasa.gov. Hypotheses: • code with complicated pathways are more error-prone • code that is hard to read is more likely to be fault prone • static measures are useful to guide software quality predictions • static measures can never be a certain indicator of the presence of a fault Number of attributes (dimensionality): 22 • 5 different lines of code measure • 3 McCabe metrics (cyclomatic, essential, design complexity) • 4 base Halstead measures (volume, length, difficulty, intelligence) • 8 derived Halstead measures, a branch-count • 1 goal field (module has/has not one or more reported defects) https://www.kaggle.com/datasets/semustafacevik/software-defect-prediction
  24. 24 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Example.

    Operational Data from Enterprise Application https://www.kaggle.com/datasets/anomalydetectionml/rawdata Goal: effectively detect run-time anomalies using machine learning on operation metrics The dataset consists of metrics measured from the operating system and from WebLogic Server monitoring beans
  25. 25 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks https://towardsdatascience.com/11-dimensionality-reduction-techniques-you-should-know-in-2021-dcb9500d388b

    Only keep the most important features Find a combination of new features Linear methods Non-linear methods • Backward elimination • Forward selection • Random forests • PCA • Factor analysis • LDA • Truncated SVD • Kernel PCA • t-SNE • MDS • Isomap Dimensionality Reduction methods
  26. 26 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Principal

    Component Analysis
  27. 27 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Principal

    Component Analysis (PCA) Principal components are new variables that are linear combinations of the initial variables These combinations are done in such a way that the new variables are uncorrelated and most of the information within the initial variables compressed into the first components https://builtin.com/data-science/step-step-explanation-principal-component-analysis
  28. 28 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Principal

    Component Analysis (PCA) Prerequisites: • Linear algebra ◦ matrix multiplication, transposition, inverses ◦ matrix decomposition ◦ eigenvectors and eigenvalues • Statistics ◦ standardization, variance, covariance ◦ independence • Machine learning ◦ linear regression ◦ feature selection https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c
  29. 29 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Principal

    Component Analysis (PCA) 1. Normalize the initial variables 2. Compute the covariance matrix: the correlations between all the possible pairs of variables 3. Compute eigenvectors and rank the eigenvalues in descending order (get the principal components in order of significance) 4. Compute the feature vector, which is a matrix that has as columns the eigenvectors of the components that we decide to keep 5. Reorient the data from the original axes to the ones represented by the principal components https://builtin.com/data-science/step-step-explanation-principal-component-analysis
  30. 30 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Factor

    Analysis
  31. 31 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Factor

    Analysis (FA) It is a method for modeling observed variables and their relationship in terms of unobserved variables, i.e. factors It is used to reduce a large number of variables into fewer numbers of factors Factor analysis is kind of extension of PCA FA is a complex mathematical procedure, so it is performed with software Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Factor A Factor B
  32. 32 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Factor

    Analysis (FA) There are two types of FA, exploratory (EFA) and confirmatory (CFA) EFA is used to find the underlying structure of a large set of variables and reduce data to a smaller set of summary variables, but EFA can generate a large number of possible models for your data If you do have an idea about what the models look like, and you want to test your hypotheses about the data structure, CFA is a better approach https://www.statisticshowto.com/factor-analysis/ Geometric interpretation of Factor Analysis
  33. 33 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Supervised

    learning
  34. 34 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Supervised

    anomaly detection Items in the training dataset are labeled into two categories: normal and abnormal The model will use these examples to recognize abnormal patterns in the previously unseen data It is rarely used due to unavailability of labelled data for the known anomalies https://www.enjoyalgorithms.com/blogs/supervised-unsupervised-and-semisupervised-learning
  35. 35 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks Conclusion

  36. 36 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks https://github.com/openvinotoolkit/anomalib

    A library for benchmarking, developing and deploying deep learning anomaly detection algorithms
  37. 37 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks https://www.oreilly.com/library/view/hands-on-unsupervised-learning/9781492035633/

  38. 38 BUILD SOFTWARE TO TEST SOFTWARE AI Testing Talks AI

    Testing Talks Thank you! Questions?