Kirill Rudakov. Developing a Massive Log Classification System

Developing a Massive Log Classi fi cation System Rudakov Kirill
Tomsk, 2021

Agenda • Introduction • Target data • Hypotheses and analysis
• Model developing • Results

Analysis of the target process Logs obtaining Preprocessing Signatures detecting
Grouping and clustering Monitoring and alerting Tests launching

Log structure

Tools Python: https://www.python.org PyCharm: https://www.jetbrains.com/pycharm/ Project Jupyter: https://jupyter.org

Regular expressions

Techniques Bag-of-words VS TF-IDF VS ?

Bag-of-word & TF-IDF https://miro.medium.com/max/880/1*hLvya7MXjsSc3NS2SoLMEg.png https://avatars.mds.yandex.net/get-zen_doc/3986249/pub_5f589066197cd55cd9ab2254_5f5890f7197cd55cd9ac1822/scale_1200

Word2Vec (FastText) https://miro.medium.com/max/1400/1*hELlVp9hmZbDZVFstS61pg.png

Techniques Bag-of-words VS TF-IDF VS ? W2V

Techniques Bag-of-words VS TF-IDF VS ? Preprocess log instances W2V

N-gram Approach https://devopedia.org/images/article/219/7356.1569499094.png

Set of n-grams

Set of n-grams Metrics

Jaccard & Cosine similarity https://dev-to-uploads.s3.amazonaws.com/i/zbj2nxs9dh9mwohapjng.jpg https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/assets/2b4a7a82-ad4c-4b2a-b808-e423a334de6f.png

Set of n-grams Metrics Jaccard + Cosine similarity

Techniques Bag-of-words VS TF-IDF VS ? Dimensionality reduction (PCA) Preprocess
log instances W2V Set of n-grams Metrics Jaccard + Cosine similarity

log instances W2V Set of n-grams Metrics Jaccard + Cosine similarity Unfortunately no. We need a clear explanation of the features

log instances K-Means VS K-Medoids VS ? W2V Set of n-grams Metrics Jaccard + Cosine similarity Unfortunately no. We need a clear explanation of the features

Clustering https://scikit-learn.org/stable/_images/sphx_glr_plot_kmeans_digits_0011.png https://3.bp.blogspot.com/-TQYHVkgesMg/WbTcMIOuquI/AAAAAAAAD3Y/dY4YpxJ3OhU5VGppwcrS6j-ewvlddxSjwCLcBGAs/s1600/hcust.PNG

log instances K-Means VS K-Medoids VS ? W2V Set of n-grams Metrics Jaccard + Cosine similarity Unfortunately no. We need a clear explanation of the features Greedy algorithm + Agglomerative clustering

Architecture of Model • Reduce documents duplication by RegExps •
Use set of ngrams with Jaccard similarity • Use W2V with Cosine similarity • Store similarities in a sparse matrix — discard distant messages (use threshold) • Use Greedy algorithm to form clusters and receive number of them • Upgrade clustering model by Agglomerative • We can retrain daily

How to assess the quality of the model empirically? •
Labeled data • Synthetic data • Give a look to a user

App Prototype

App UI

Synthetic data example • age is an issue of mind
over matter. if you don't mind, it doesn't matter • age is an issue that's mind over matter. if you don't mind, it doesn't matter • age is an 03/11/2019t06:49:31 issue of mind over matter. if you don't mind, whole doesn't matter. comeswomennothingthroughpeopletheres ejemplo siendo • age is an 03/28/2020t01:09:01 issue of mind over matter. if you don't mind, it doesn't matter • age an 07/16/2019t18:55:38 issue 85.9.232.124 of mind over matter. if you don't mind, it doesn't prettyyourething matter. pueblo • age is an 12/16/2020t12:52:31 issue different mind over matter. if you don't reyruaazlulxobwuzozhysjeqnmnyk mind, it doesn't matter. partido • age awdq_ruzrholckf_omxhw_kkknxzxi is an issue of over matter. if you don't mind, it doesn't matter. algï¿½n problemas • is an issue of mind over matter. if you don't mind, it doesn't matter.

Results (1)

Results (2)

Results • Data analysis • Hypothesis testing • Creating a
clustering model • Support for various projects • Prototyping a custom application to validate the model in production • Assess model on synthetic and labeled data

Developing a Massive Log Classi fi cation System Rudakov Kirill
Tomsk, 2021

Kirill Rudakov. Developing a Massive Log Classi...

Kirill Rudakov. Developing a Massive Log Classification System

Exactpro
PRO

More Decks by Exactpro

Other Decks in Technology

Featured

Transcript

Developing a Massive Log Classi fi cation System Rudakov Kirill

Agenda • Introduction • Target data • Hypotheses and analysis

Analysis of the target process Logs obtaining Preprocessing Signatures detecting

Log structure

Tools Python: https://www.python.org PyCharm: https://www.jetbrains.com/pycharm/ Project Jupyter: https://jupyter.org

Regular expressions

Techniques Bag-of-words VS TF-IDF VS ?

Bag-of-word & TF-IDF https://miro.medium.com/max/880/1*hLvya7MXjsSc3NS2SoLMEg.png https://avatars.mds.yandex.net/get-zen_doc/3986249/pub_5f589066197cd55cd9ab2254_5f5890f7197cd55cd9ac1822/scale_1200

Word2Vec (FastText) https://miro.medium.com/max/1400/1*hELlVp9hmZbDZVFstS61pg.png

Techniques Bag-of-words VS TF-IDF VS ? W2V

Techniques Bag-of-words VS TF-IDF VS ? Preprocess log instances W2V

N-gram Approach https://devopedia.org/images/article/219/7356.1569499094.png

Techniques Bag-of-words VS TF-IDF VS ? Preprocess log instances W2V

Techniques Bag-of-words VS TF-IDF VS ? Preprocess log instances W2V

Jaccard & Cosine similarity https://dev-to-uploads.s3.amazonaws.com/i/zbj2nxs9dh9mwohapjng.jpg https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/assets/2b4a7a82-ad4c-4b2a-b808-e423a334de6f.png

Techniques Bag-of-words VS TF-IDF VS ? Preprocess log instances W2V

Techniques Bag-of-words VS TF-IDF VS ? Dimensionality reduction (PCA) Preprocess

Techniques Bag-of-words VS TF-IDF VS ? Dimensionality reduction (PCA) Preprocess

Techniques Bag-of-words VS TF-IDF VS ? Dimensionality reduction (PCA) Preprocess

Clustering https://scikit-learn.org/stable/_images/sphx_glr_plot_kmeans_digits_0011.png https://3.bp.blogspot.com/-TQYHVkgesMg/WbTcMIOuquI/AAAAAAAAD3Y/dY4YpxJ3OhU5VGppwcrS6j-ewvlddxSjwCLcBGAs/s1600/hcust.PNG

Techniques Bag-of-words VS TF-IDF VS ? Dimensionality reduction (PCA) Preprocess

Architecture of Model • Reduce documents duplication by RegExps •

How to assess the quality of the model empirically? •

App Prototype

App UI

Synthetic data example • age is an issue of mind

Results (1)

Results (2)

Results • Data analysis • Hypothesis testing • Creating a

Developing a Massive Log Classi fi cation System Rudakov Kirill