Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kirill Rudakov. Developing a Massive Log Classification System

Kirill Rudakov. Developing a Massive Log Classification System

We invite you to join the International Summer School on Data Science in Software Engineering. The summer school will be held online on 12-16 July, 2021, organized by the Laboratory of Software Testing, Tomsk Polytechnic University.
Participation is free. The official language of the school is English.

Students, young researchers and practitioners interested in applications of modern data science methods to the development and testing of complex software systems are invited to join. Follow the link to learn the full program and register your participation: https://itr-tpu.timepad.ru/event/1629835/

Watch video here: https://youtu.be/jee9bCRvp84
____
To learn more about Exactpro, visit our website https://exactpro.com/

Follow us on
LinkedIn https://www.linkedin.com/company/exactpro-systems-llc
Twitter https://twitter.com/exactpro
Facebook https://www.facebook.com/exactpro/
Instagram https://www.instagram.com/exactpro/
Vkontakte https://vk.com/exactpro_llc

Subscribe to Exactpro YouTube channel https://www.youtube.com/c/exactprosystems

5206c19df417b8876825b5561344c1a0?s=128

Exactpro
PRO

July 15, 2021
Tweet

Transcript

  1. Developing a Massive Log Classi fi cation System Rudakov Kirill

    Tomsk, 2021
  2. Agenda • Introduction • Target data • Hypotheses and analysis

    • Model developing • Results
  3. Analysis of the target process Logs obtaining Preprocessing Signatures detecting

    Grouping and clustering Monitoring and alerting Tests launching
  4. Log structure

  5. Tools Python: https://www.python.org PyCharm: https://www.jetbrains.com/pycharm/ Project Jupyter: https://jupyter.org

  6. Regular expressions

  7. Techniques Bag-of-words VS TF-IDF VS ?

  8. Bag-of-word & TF-IDF https://miro.medium.com/max/880/1*hLvya7MXjsSc3NS2SoLMEg.png https://avatars.mds.yandex.net/get-zen_doc/3986249/pub_5f589066197cd55cd9ab2254_5f5890f7197cd55cd9ac1822/scale_1200

  9. Word2Vec (FastText) https://miro.medium.com/max/1400/1*hELlVp9hmZbDZVFstS61pg.png

  10. Techniques Bag-of-words VS TF-IDF VS ? W2V

  11. Techniques Bag-of-words VS TF-IDF VS ? Preprocess log instances W2V

  12. N-gram Approach https://devopedia.org/images/article/219/7356.1569499094.png

  13. Techniques Bag-of-words VS TF-IDF VS ? Preprocess log instances W2V

    Set of n-grams
  14. Techniques Bag-of-words VS TF-IDF VS ? Preprocess log instances W2V

    Set of n-grams Metrics
  15. Jaccard & Cosine similarity https://dev-to-uploads.s3.amazonaws.com/i/zbj2nxs9dh9mwohapjng.jpg https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/assets/2b4a7a82-ad4c-4b2a-b808-e423a334de6f.png

  16. Techniques Bag-of-words VS TF-IDF VS ? Preprocess log instances W2V

    Set of n-grams Metrics Jaccard + Cosine similarity
  17. Techniques Bag-of-words VS TF-IDF VS ? Dimensionality reduction (PCA) Preprocess

    log instances W2V Set of n-grams Metrics Jaccard + Cosine similarity
  18. Techniques Bag-of-words VS TF-IDF VS ? Dimensionality reduction (PCA) Preprocess

    log instances W2V Set of n-grams Metrics Jaccard + Cosine similarity Unfortunately no. We need a clear explanation of the features
  19. Techniques Bag-of-words VS TF-IDF VS ? Dimensionality reduction (PCA) Preprocess

    log instances K-Means VS K-Medoids VS ? W2V Set of n-grams Metrics Jaccard + Cosine similarity Unfortunately no. We need a clear explanation of the features
  20. Clustering https://scikit-learn.org/stable/_images/sphx_glr_plot_kmeans_digits_0011.png https://3.bp.blogspot.com/-TQYHVkgesMg/WbTcMIOuquI/AAAAAAAAD3Y/dY4YpxJ3OhU5VGppwcrS6j-ewvlddxSjwCLcBGAs/s1600/hcust.PNG

  21. Techniques Bag-of-words VS TF-IDF VS ? Dimensionality reduction (PCA) Preprocess

    log instances K-Means VS K-Medoids VS ? W2V Set of n-grams Metrics Jaccard + Cosine similarity Unfortunately no. We need a clear explanation of the features Greedy algorithm + Agglomerative clustering
  22. Architecture of Model • Reduce documents duplication by RegExps •

    Use set of ngrams with Jaccard similarity • Use W2V with Cosine similarity • Store similarities in a sparse matrix — discard distant messages (use threshold) • Use Greedy algorithm to form clusters and receive number of them • Upgrade clustering model by Agglomerative • We can retrain daily
  23. How to assess the quality of the model empirically? •

    Labeled data • Synthetic data • Give a look to a user
  24. App Prototype

  25. App UI

  26. Synthetic data example • age is an issue of mind

    over matter. if you don't mind, it doesn't matter • age is an issue that's mind over matter. if you don't mind, it doesn't matter • age is an 03/11/2019t06:49:31 issue of mind over matter. if you don't mind, whole doesn't matter. comeswomennothingthroughpeopletheres ejemplo siendo • age is an 03/28/2020t01:09:01 issue of mind over matter. if you don't mind, it doesn't matter • age an 07/16/2019t18:55:38 issue 85.9.232.124 of mind over matter. if you don't mind, it doesn't prettyyourething matter. pueblo • age is an 12/16/2020t12:52:31 issue different mind over matter. if you don't reyruaazlulxobwuzozhysjeqnmnyk mind, it doesn't matter. partido • age awdq_ruzrholckf_omxhw_kkknxzxi is an issue of over matter. if you don't mind, it doesn't matter. alg�n problemas • is an issue of mind over matter. if you don't mind, it doesn't matter.
  27. Results (1)

  28. Results (2)

  29. Results • Data analysis • Hypothesis testing • Creating a

    clustering model • Support for various projects • Prototyping a custom application to validate the model in production • Assess model on synthetic and labeled data
  30. Developing a Massive Log Classi fi cation System Rudakov Kirill

    Tomsk, 2021