Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyCon TW 2020: CTI ANT

Avatar for Chia-En Tsai Chia-En Tsai
September 06, 2020

PyCon TW 2020: CTI ANT

PyCon Taiwan 2020
Natural Language Processing, Intermediate talk
Speaker: Chia-En Tsai

Avatar for Chia-En Tsai

Chia-En Tsai

September 06, 2020
Tweet

More Decks by Chia-En Tsai

Other Decks in Technology

Transcript

  1. My High School Intern Project: Constructing an AI Helper for

    Cyber Threat Intelligence Analysis PYCON TW 2020 PYTHON
  2. SELF-INTRODUCTION Chia-En Tsai (Jacklyn) • Taipei First Girls High School

    Student • Artificial Intelligence Club President • Intern in CyCraft 奧義智慧科技 for 1 year 2
  3. 2020 High School Intern Perspective ACTIVELY REACHING OUT KNOWING YOUR

    GOALS PROBLEM-FINDING & PROBLEM-SOLVING USE & IMPLEMENT 3
  4. PROJECT GOALS helping security analyst quickly realize the articles’ theme

    helping security team quickly identify articles related to their daily missions Setting up recommendation system of prevalent attack methods Recognizing attack technique in articles and labeling with MITRE ATT&CK technique Classifying articles from largest Chinese security website 5
  5. FREEBUF INTERNET SECURITY PLATFORM https://www.freebuf.com/# • Most prestigious cybersecurity website

    in Asia region • Rich and up-to-date simplified Chinese cybersecurity articles and information • Vulnerabilities vs. Enterprise Security 6
  6. 1ST GOAL: determining article belongs to vulnerabilities or enterprise security

    DATA CRAWLING DATA PROCESSING START DATA PIPELINE EVALUATION Results Visualization 7
  7. Crawling Article Contents extracting each news article through each article

    link STEP 2 collecting article links STEP 1 <div class="news-img"><a target="_blank" href="https://www.freeb uf.com/vuls/227971.html "><title="挖洞经验 | 利用 Jira的邮件服务器连通测试功 能发现其CSRF漏 洞"/></a></div> news_output1.txt (article title) (article content...) 8
  8. Classifying articles from largest Chinese security website DATA CRAWLING DATA

    PROCESSING START DATA PIPELINE EVALUATION Results Visualization 9
  9. DATA PROCESSING removing common but meaningless words that interfere with

    classification results example: 可是、因为... cutting large articles to meaningful word segments Purpose: determining article categories with specific keywords REMOVING STOP WORDS TOKENIZING 10
  10. DATA PROCESSING Tools: Simplified Chinese Stop Words List https://github.com/goto456/stopwords/blob/master /cn_stopwords.txt

    一些、不但、而且... Common technical terms in cybersecurity articles: “代码”,”項目”,”信息”... Tools: Jieba Chinese Tokenizing Library REMOVING STOP WORDS TOKENIZING 11
  11. Classifying articles from largest Chinese security website DATA CRAWLING DATA

    PROCESSING START DATA PIPELINE EVALUATION Results Visualization 12
  12. COUNTVECTORIZER Converts text documents to a matrix of word counts

    Function Example Document 1: ”This is a Pycon talk on NLP cyber threat analysis.” Document 2: “I used NLP tools to identify cyber threat techniques in articles.” Document 3: ”In the talk, I will introduce common python NLP tools. python.” Feature Name Pycon talk NLP cyber threat tools Python ... Document 1 1 1 1 1 1 0 0 ... Document 2 0 0 1 1 1 1 0 ... Document 3 0 1 1 0 0 1 2 ... Trained countvectorizer to array Purpose Results used as input data of TF-IDF(introduced next) 14
  13. DATA PIPELINE➝TF-IDFTransformer() Function evaluate the importance of a word to

    a file in a file set Feature scale down the impact of general and common tokens in a file set (empirically less informative) Principle word importance increases the more it appears in a file, word importance decreases if the word exists in many files Purpose for classifier to identify important word tokens and use them as classification basis 16
  14. DATA PIPELINE➝TF-IDFTransformer() D1 The sky is blue. D2 The sky

    is not blue. TF IDF TF-IDF D1 D2 D1 D2 The 1 1 log(2/2) 1*log(2/2) =0 1*log(2/2) =0 sky 1 1 log(2/2) 0 0 is 1 1 log(2/2) 0 0 blue 1 1 log(2/2) 0 0 not 0 1 log(2/1) 0 1*log(2/1) =log(2)≈0.301 17
  15. TF-IDFTransformer() Advantages: • Simple implementation, easy to understand algorithm •

    Can filter out some common, irrelevant words while retaining the important words of the article Drawbacks: • The position information of the word cannot be reflected. When the keyword is extracted, the position information of the word (such as the title, the beginning of or the end of an article) should be given a higher weight 18
  16. DATA PIPELINE➝Stochastic Gradient Descent What Stochastic Gradient Descent Multi-class Classification

    looks like example from sklearn Function Linearly divide many different types of data into different categories Feature only picks one sample for each step in determining classification boundary → efficient Purpose classification between vulnerabilities and enterprise security 20
  17. SGDClassifier() Advantages: • Efficiency : only a single training sample

    being processed by the network for each step • It is computationally fast as only one sample is processed at a time Drawbacks: • Frequent updates are computationally expensive due to using all resources for processing one training sample at a time 23
  18. Classifying articles from largest Chinese security website DATA CRAWLING DATA

    PROCESSING START DATA PIPELINE EVALUATION Results Visualization 24
  19. PROJECT GOALS picking top 10 cybersecruity topics respectively for vulnerabilities

    & enterprise security articles helping security team quickly identify articles related to their daily missions Setting up recommendation system of prevalent cyber topics Recognizing attack technique in articles and labeling with MITRE ATT&CK technique Classifying articles from largest Chinese security website 30
  20. Single Value Decomposition(SVD) Function decompose a complex countvectorizer matrix into

    several component matrices to expose many properties of the original matrix Example japanese research: animal clustering using SVD https://www.frontiersin.org/articles/10.3389/fpsyt.2018.00087/full 31
  21. Single Value Decomposition(SVD) article 2 most likely belongs to the

    subtopic [‘dog’,’cat’] https://www.frontiersin.org/articles/10.3389/fpsyt.2018.00087/full 32
  22. PROJECT GOALS helping security analyst quickly realize the articles’ theme

    helping security team quickly identify articles related to their daily missions Setting up recommendation system of prevalent attack methods Recognizing attack technique in articles and labeling with MITRE ATT&CK technique Classifying articles from largest Chinese security website 35
  23. • MITRE is a not-for-profit US organization that operates research

    and development centers • ATT&CK is a framework of observed and known adversarial tactics, techniques, and procedures (TTP) from cybercriminals • ATT&CK maps and indexes virtually everything regarding an intrusion from both the attack and defense sides https://medium.com/cycraft/cycraft-classroom-mitre-att-ck-vs-cyber-kill-c hain-vs-diamond-model-1cc8fa49a20f 36
  24. VULHUB Chinese Security Vulnerability Portal • MITRE ATT&CK Techniques and

    tools in Simplified Chinese • Select 26 frequent and important ATT&CK methods for identification Includes: ➢ Initial Access ➢ Execution ➢ Persistence ➢ Privilege Escalation ➢ Defense Evasion ➢ Credential Access ➢ Lateral Movement 37
  25. Recognizing attack technique in articles and labeling with MITRE ATT&CK

    technique Data Crawling & Processing START Evaluation Testing with Freebuf Articles Visualizing Results 38 SGD Naive Bayes Decision Tree
  26. DATA CRAWLING/PROCESSING Lorem ipsum dolor sit amet, consectetur adipiscing elit.

    Duis sit amet odio vel purus bibendum luctus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis sit amet odio vel purus bibendum luctus. Splitting Training &Testing sklearn train_test_split function Crawling/Tokenizing/ Removing stop words Labeling Data data labels • sentence from MITRE ATT&CK descriptions • sequential indexes as MITRE ATT&CK labels 攻击者 可能 会 滥用 伪 隐 藏 密钥 隐藏 用于 建立 持 久性 payload / 命令 3510 key: T1112 non-attack descriptions assign specific tag 39
  27. Recognizing attack technique in articles and labeling with MITRE ATT&CK

    technique Data Crawling & Processing START Evaluation Testing with Freebuf Articles Visualizing Results 40 SGD Naive Bayes Decision Tree
  28. Recognizing attack technique in articles and labeling with MITRE ATT&CK

    technique Data Crawling & Processing START Evaluation Testing with Freebuf Articles Visualizing Results 41 SGD Naive Bayes Decision Tree
  29. Recognizing attack technique in articles and labeling with MITRE ATT&CK

    technique Data Crawling & Processing START Evaluation Testing with Freebuf Articles Visualizing Results 45 SGD Naive Bayes Decision Tree
  30. DecisionTreeClassifier() Drawbacks: • Instability: a small change in the data

    can cause a large change in the structure of the decision tree. • Decision tree often involves higher time to train the model. Advantages: • Easy to understand: presents visually all of the decision alternatives in a format that is easy to understand • Versatile: A multitude of business problems can be analyzed and solved with Decision Tree 49
  31. Recognizing attack technique in articles and labeling with MITRE ATT&CK

    technique Data Crawling & Processing START Evaluation Testing with Freebuf Articles Visualizing Results 50 SGD Naive Bayes Decision Tree
  32. Recognizing attack technique in articles and labeling with MITRE ATT&CK

    technique Data Crawling & Processing START Evaluation Testing with Freebuf Articles Visualizing Results 52 SGD Naive Bayes Decision Tree
  33. CONCLUSION Countvec, SVD jieba, Countvec, TF-IDF, SGD Setting up recommendation

    system of prevalent attack methods Recognizing attack technique in articles and labeling with MITRE ATT&CK technique Classifying articles from largest Chinese security website Countvec, TF-IDF, SGD ,Naive Bayes, Decision Tree 54