Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kirill Rudakov. Developing a Massive Log Classification System

Kirill Rudakov. Developing a Massive Log Classification System

We invite you to join the International Summer School on Data Science in Software Engineering. The summer school will be held online on 12-16 July, 2021, organized by the Laboratory of Software Testing, Tomsk Polytechnic University.
Participation is free. The official language of the school is English.

Students, young researchers and practitioners interested in applications of modern data science methods to the development and testing of complex software systems are invited to join. Follow the link to learn the full program and register your participation: https://itr-tpu.timepad.ru/event/1629835/

Watch video here: https://youtu.be/jee9bCRvp84
____
To learn more about Exactpro, visit our website https://exactpro.com/

Follow us on
LinkedIn https://www.linkedin.com/company/exactpro-systems-llc
Twitter https://twitter.com/exactpro
Facebook https://www.facebook.com/exactpro/
Instagram https://www.instagram.com/exactpro/
Vkontakte https://vk.com/exactpro_llc

Subscribe to Exactpro YouTube channel https://www.youtube.com/c/exactprosystems

Exactpro

July 15, 2021
Tweet

More Decks by Exactpro

Other Decks in Technology

Transcript

  1. Analysis of the target process Logs obtaining Preprocessing Signatures detecting

    Grouping and clustering Monitoring and alerting Tests launching
  2. Techniques Bag-of-words VS TF-IDF VS ? Preprocess log instances W2V

    Set of n-grams Metrics Jaccard + Cosine similarity
  3. Techniques Bag-of-words VS TF-IDF VS ? Dimensionality reduction (PCA) Preprocess

    log instances W2V Set of n-grams Metrics Jaccard + Cosine similarity
  4. Techniques Bag-of-words VS TF-IDF VS ? Dimensionality reduction (PCA) Preprocess

    log instances W2V Set of n-grams Metrics Jaccard + Cosine similarity Unfortunately no. We need a clear explanation of the features
  5. Techniques Bag-of-words VS TF-IDF VS ? Dimensionality reduction (PCA) Preprocess

    log instances K-Means VS K-Medoids VS ? W2V Set of n-grams Metrics Jaccard + Cosine similarity Unfortunately no. We need a clear explanation of the features
  6. Techniques Bag-of-words VS TF-IDF VS ? Dimensionality reduction (PCA) Preprocess

    log instances K-Means VS K-Medoids VS ? W2V Set of n-grams Metrics Jaccard + Cosine similarity Unfortunately no. We need a clear explanation of the features Greedy algorithm + Agglomerative clustering
  7. Architecture of Model • Reduce documents duplication by RegExps •

    Use set of ngrams with Jaccard similarity • Use W2V with Cosine similarity • Store similarities in a sparse matrix — discard distant messages (use threshold) • Use Greedy algorithm to form clusters and receive number of them • Upgrade clustering model by Agglomerative • We can retrain daily
  8. How to assess the quality of the model empirically? •

    Labeled data • Synthetic data • Give a look to a user
  9. Synthetic data example • age is an issue of mind

    over matter. if you don't mind, it doesn't matter • age is an issue that's mind over matter. if you don't mind, it doesn't matter • age is an 03/11/2019t06:49:31 issue of mind over matter. if you don't mind, whole doesn't matter. comeswomennothingthroughpeopletheres ejemplo siendo • age is an 03/28/2020t01:09:01 issue of mind over matter. if you don't mind, it doesn't matter • age an 07/16/2019t18:55:38 issue 85.9.232.124 of mind over matter. if you don't mind, it doesn't prettyyourething matter. pueblo • age is an 12/16/2020t12:52:31 issue different mind over matter. if you don't reyruaazlulxobwuzozhysjeqnmnyk mind, it doesn't matter. partido • age awdq_ruzrholckf_omxhw_kkknxzxi is an issue of over matter. if you don't mind, it doesn't matter. alg�n problemas • is an issue of mind over matter. if you don't mind, it doesn't matter.
  10. Results • Data analysis • Hypothesis testing • Creating a

    clustering model • Support for various projects • Prototyping a custom application to validate the model in production • Assess model on synthetic and labeled data