Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kirill Rudakov. Developing a Massive Log Classification System

Kirill Rudakov. Developing a Massive Log Classification System

We invite you to join the International Summer School on Data Science in Software Engineering. The summer school will be held online on 12-16 July, 2021, organized by the Laboratory of Software Testing, Tomsk Polytechnic University.
Participation is free. The official language of the school is English.

Students, young researchers and practitioners interested in applications of modern data science methods to the development and testing of complex software systems are invited to join. Follow the link to learn the full program and register your participation: https://itr-tpu.timepad.ru/event/1629835/

Watch video here: https://youtu.be/jee9bCRvp84
____
To learn more about Exactpro, visit our website https://exactpro.com/

Follow us on
LinkedIn https://www.linkedin.com/company/exactpro-systems-llc
Twitter https://twitter.com/exactpro
Facebook https://www.facebook.com/exactpro/
Instagram https://www.instagram.com/exactpro/
Vkontakte https://vk.com/exactpro_llc

Subscribe to Exactpro YouTube channel https://www.youtube.com/c/exactprosystems

Exactpro
PRO

July 15, 2021
Tweet

More Decks by Exactpro

Other Decks in Technology

Transcript

  1. Developing a Massive
    Log Classi
    fi
    cation
    System
    Rudakov Kirill

    Tomsk, 2021

    View Slide

  2. Agenda
    • Introduction

    • Target data

    • Hypotheses and analysis

    • Model developing

    • Results

    View Slide

  3. Analysis of the target process
    Logs obtaining
    Preprocessing
    Signatures detecting
    Grouping and clustering
    Monitoring and alerting
    Tests launching

    View Slide

  4. Log structure

    View Slide

  5. Tools
    Python: https://www.python.org
    PyCharm: https://www.jetbrains.com/pycharm/
    Project Jupyter: https://jupyter.org

    View Slide

  6. Regular expressions

    View Slide

  7. Techniques
    Bag-of-words
    VS
    TF-IDF
    VS
    ?

    View Slide

  8. Bag-of-word & TF-IDF
    https://miro.medium.com/max/880/1*hLvya7MXjsSc3NS2SoLMEg.png
    https://avatars.mds.yandex.net/get-zen_doc/3986249/pub_5f589066197cd55cd9ab2254_5f5890f7197cd55cd9ac1822/scale_1200

    View Slide

  9. Word2Vec (FastText)
    https://miro.medium.com/max/1400/1*hELlVp9hmZbDZVFstS61pg.png

    View Slide

  10. Techniques
    Bag-of-words
    VS
    TF-IDF
    VS
    ?
    W2V

    View Slide

  11. Techniques
    Bag-of-words
    VS
    TF-IDF
    VS
    ?
    Preprocess
    log instances
    W2V

    View Slide

  12. N-gram Approach
    https://devopedia.org/images/article/219/7356.1569499094.png

    View Slide

  13. Techniques
    Bag-of-words
    VS
    TF-IDF
    VS
    ?
    Preprocess
    log instances
    W2V Set of n-grams

    View Slide

  14. Techniques
    Bag-of-words
    VS
    TF-IDF
    VS
    ?
    Preprocess
    log instances
    W2V Set of n-grams
    Metrics

    View Slide

  15. Jaccard & Cosine similarity
    https://dev-to-uploads.s3.amazonaws.com/i/zbj2nxs9dh9mwohapjng.jpg https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/assets/2b4a7a82-ad4c-4b2a-b808-e423a334de6f.png

    View Slide

  16. Techniques
    Bag-of-words
    VS
    TF-IDF
    VS
    ?
    Preprocess
    log instances
    W2V Set of n-grams
    Metrics
    Jaccard + Cosine
    similarity

    View Slide

  17. Techniques
    Bag-of-words
    VS
    TF-IDF
    VS
    ?
    Dimensionality
    reduction
    (PCA)
    Preprocess
    log instances
    W2V Set of n-grams
    Metrics
    Jaccard + Cosine
    similarity

    View Slide

  18. Techniques
    Bag-of-words
    VS
    TF-IDF
    VS
    ?
    Dimensionality
    reduction
    (PCA)
    Preprocess
    log instances
    W2V Set of n-grams
    Metrics
    Jaccard + Cosine
    similarity
    Unfortunately no.
    We need a clear
    explanation of the
    features

    View Slide

  19. Techniques
    Bag-of-words
    VS
    TF-IDF
    VS
    ?
    Dimensionality
    reduction
    (PCA)
    Preprocess
    log instances
    K-Means
    VS
    K-Medoids
    VS
    ?
    W2V Set of n-grams
    Metrics
    Jaccard + Cosine
    similarity
    Unfortunately no.
    We need a clear
    explanation of the
    features

    View Slide

  20. Clustering
    https://scikit-learn.org/stable/_images/sphx_glr_plot_kmeans_digits_0011.png https://3.bp.blogspot.com/-TQYHVkgesMg/WbTcMIOuquI/AAAAAAAAD3Y/dY4YpxJ3OhU5VGppwcrS6j-ewvlddxSjwCLcBGAs/s1600/hcust.PNG

    View Slide

  21. Techniques
    Bag-of-words
    VS
    TF-IDF
    VS
    ?
    Dimensionality
    reduction
    (PCA)
    Preprocess
    log instances
    K-Means
    VS
    K-Medoids
    VS
    ?
    W2V Set of n-grams
    Metrics
    Jaccard + Cosine
    similarity
    Unfortunately no.
    We need a clear
    explanation of the
    features
    Greedy algorithm +
    Agglomerative clustering

    View Slide

  22. Architecture of Model
    • Reduce documents duplication by RegExps

    • Use set of ngrams with Jaccard similarity

    • Use W2V with Cosine similarity

    • Store similarities in a sparse matrix — discard
    distant messages (use threshold)

    • Use Greedy algorithm to form clusters and receive
    number of them

    • Upgrade clustering model by Agglomerative

    • We can retrain daily

    View Slide

  23. How to assess the quality
    of the model empirically?
    • Labeled data

    • Synthetic data

    • Give a look to a user

    View Slide

  24. App Prototype

    View Slide

  25. App UI

    View Slide

  26. Synthetic data example
    • age is an issue of mind over matter. if you don't mind, it doesn't matter


    • age is an issue that's mind over matter. if you don't mind, it doesn't matter


    • age is an 03/11/2019t06:49:31 issue of mind over matter. if you don't mind, whole doesn't matter.
    comeswomennothingthroughpeopletheres ejemplo siendo


    • age is an 03/28/2020t01:09:01 issue of mind over matter. if you don't mind, it doesn't matter


    • age an 07/16/2019t18:55:38 issue 85.9.232.124 of mind over matter. if you don't mind, it doesn't prettyyourething matter. pueblo


    • age is an 12/16/2020t12:52:31 issue different mind over matter. if you don't reyruaazlulxobwuzozhysjeqnmnyk mind, it doesn't matter.
    partido


    • age awdq_ruzrholckf_omxhw_kkknxzxi is an issue of over matter. if you don't mind, it doesn't matter. alg�n problemas


    • is an issue of mind over matter. if you don't mind, it doesn't matter.

    View Slide

  27. Results (1)

    View Slide

  28. Results (2)

    View Slide

  29. Results
    • Data analysis

    • Hypothesis testing

    • Creating a clustering model

    • Support for various projects

    • Prototyping a custom application to validate the model in
    production

    • Assess model on synthetic and labeled data

    View Slide

  30. Developing a Massive
    Log Classi
    fi
    cation
    System
    Rudakov Kirill

    Tomsk, 2021

    View Slide