PyCon TW 2020: CTI ANT

My High School Intern Project: Constructing an AI Helper for
Cyber Threat Intelligence Analysis PYCON TW 2020 PYTHON

SELF-INTRODUCTION Chia-En Tsai (Jacklyn) • Taipei First Girls High School
Student • Artiﬁcial Intelligence Club President • Intern in CyCraft 奧義智慧科技 for 1 year 2

2020 High School Intern Perspective ACTIVELY REACHING OUT KNOWING YOUR
GOALS PROBLEM-FINDING & PROBLEM-SOLVING USE & IMPLEMENT 3

ITHOME Cybersecurity Exhibition 4

PROJECT GOALS helping security analyst quickly realize the articles’ theme
helping security team quickly identify articles related to their daily missions Setting up recommendation system of prevalent attack methods Recognizing attack technique in articles and labeling with MITRE ATT&CK technique Classifying articles from largest Chinese security website 5

FREEBUF INTERNET SECURITY PLATFORM https://www.freebuf.com/# • Most prestigious cybersecurity website
in Asia region • Rich and up-to-date simpliﬁed Chinese cybersecurity articles and information • Vulnerabilities vs. Enterprise Security 6

1ST GOAL: determining article belongs to vulnerabilities or enterprise security
DATA CRAWLING DATA PROCESSING START DATA PIPELINE EVALUATION Results Visualization 7

Crawling Article Contents extracting each news article through each article
link STEP 2 collecting article links STEP 1 <div class="news-img"><a target="_blank" href="https://www.freeb uf.com/vuls/227971.html "><title="挖洞经验 | 利用 Jira的邮件服务器连通测试功能发现其CSRF漏洞"/></a></div> news_output1.txt (article title) (article content...) 8

Classifying articles from largest Chinese security website DATA CRAWLING DATA
PROCESSING START DATA PIPELINE EVALUATION Results Visualization 9

DATA PROCESSING removing common but meaningless words that interfere with
classiﬁcation results example: 可是、因为... cutting large articles to meaningful word segments Purpose: determining article categories with speciﬁc keywords REMOVING STOP WORDS TOKENIZING 10

DATA PROCESSING Tools: Simpliﬁed Chinese Stop Words List https://github.com/goto456/stopwords/blob/master /cn_stopwords.txt
一些、不但、而且... Common technical terms in cybersecurity articles： “代码”,”項目”,”信息”... Tools: Jieba Chinese Tokenizing Library REMOVING STOP WORDS TOKENIZING 11

DATA PIPELINE COUNT VECTORIZER SGD CLASSIFIER TFIDF 13

COUNTVECTORIZER Converts text documents to a matrix of word counts
Function Example Document 1: ”This is a Pycon talk on NLP cyber threat analysis.” Document 2: “I used NLP tools to identify cyber threat techniques in articles.” Document 3: ”In the talk, I will introduce common python NLP tools. python.” Feature Name Pycon talk NLP cyber threat tools Python ... Document 1 1 1 1 1 1 0 0 ... Document 2 0 0 1 1 1 1 0 ... Document 3 0 1 1 0 0 1 2 ... Trained countvectorizer to array Purpose Results used as input data of TF-IDF(introduced next) 14

DATA PIPELINE➝TF-IDFTransformer() Function evaluate the importance of a word to
a file in a file set Feature scale down the impact of general and common tokens in a file set (empirically less informative) Principle word importance increases the more it appears in a file, word importance decreases if the word exists in many files Purpose for classifier to identify important word tokens and use them as classification basis 16

DATA PIPELINE➝TF-IDFTransformer() D1 The sky is blue. D2 The sky
is not blue. TF IDF TF-IDF D1 D2 D1 D2 The 1 1 log(2/2) 1*log(2/2) =0 1*log(2/2) =0 sky 1 1 log(2/2) 0 0 is 1 1 log(2/2) 0 0 blue 1 1 log(2/2) 0 0 not 0 1 log(2/1) 0 1*log(2/1) =log(2)≈0.301 17

TF-IDFTransformer() Advantages: • Simple implementation, easy to understand algorithm •
Can ﬁlter out some common, irrelevant words while retaining the important words of the article Drawbacks: • The position information of the word cannot be reﬂected. When the keyword is extracted, the position information of the word (such as the title, the beginning of or the end of an article) should be given a higher weight 18

DATA PIPELINE➝Stochastic Gradient Descent What Stochastic Gradient Descent Multi-class Classification
looks like example from sklearn Function Linearly divide many different types of data into different categories Feature only picks one sample for each step in determining classification boundary → efficient Purpose classification between vulnerabilities and enterprise security 20

SGDClassiﬁer() https://www.youtube.com/watch?v=vMh0zPT0tLI&t=298s Example ﬁnding line of regression between weight and
height 21

SGDClassiﬁer() https://www.youtube.com/watch?v=vMh0zPT0tLI&t=298s 22

SGDClassiﬁer() Advantages: • Eﬃciency : only a single training sample
being processed by the network for each step • It is computationally fast as only one sample is processed at a time Drawbacks: • Frequent updates are computationally expensive due to using all resources for processing one training sample at a time 23

EVALUATION Precision Recall F1-score Vulnerabilities 92% 96% 94% Enterprise Security
95% 91% 93% 25

RESULTS VISUALIZATION 26 A correct classiﬁcation!

RESULTS VISUALIZATION 27 remote code execution vulnerability

RESULTS VISUALIZATION 28

RESULTS VISUALIZATION 29

PROJECT GOALS picking top 10 cybersecruity topics respectively for vulnerabilities
& enterprise security articles helping security team quickly identify articles related to their daily missions Setting up recommendation system of prevalent cyber topics Recognizing attack technique in articles and labeling with MITRE ATT&CK technique Classifying articles from largest Chinese security website 30

Single Value Decomposition(SVD) Function decompose a complex countvectorizer matrix into
several component matrices to expose many properties of the original matrix Example japanese research: animal clustering using SVD https://www.frontiersin.org/articles/10.3389/fpsyt.2018.00087/full 31

Single Value Decomposition(SVD) article 2 most likely belongs to the
subtopic [‘dog’,’cat’] https://www.frontiersin.org/articles/10.3389/fpsyt.2018.00087/full 32

Recommendation System of attack methods Results clusters of “vulnerabilities” articles
33

Recommendation System of attack methods Results clusters of “enterprise security”
articles 34

PROJECT GOALS helping security analyst quickly realize the articles’ theme
helping security team quickly identify articles related to their daily missions Setting up recommendation system of prevalent attack methods Recognizing attack technique in articles and labeling with MITRE ATT&CK technique Classifying articles from largest Chinese security website 35

• MITRE is a not-for-proﬁt US organization that operates research
and development centers • ATT&CK is a framework of observed and known adversarial tactics, techniques, and procedures (TTP) from cybercriminals • ATT&CK maps and indexes virtually everything regarding an intrusion from both the attack and defense sides https://medium.com/cycraft/cycraft-classroom-mitre-att-ck-vs-cyber-kill-c hain-vs-diamond-model-1cc8fa49a20f 36

VULHUB Chinese Security Vulnerability Portal • MITRE ATT&CK Techniques and
tools in Simpliﬁed Chinese • Select 26 frequent and important ATT&CK methods for identiﬁcation Includes: ➢ Initial Access ➢ Execution ➢ Persistence ➢ Privilege Escalation ➢ Defense Evasion ➢ Credential Access ➢ Lateral Movement 37

Recognizing attack technique in articles and labeling with MITRE ATT&CK
technique Data Crawling & Processing START Evaluation Testing with Freebuf Articles Visualizing Results 38 SGD Naive Bayes Decision Tree

DATA CRAWLING/PROCESSING Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Duis sit amet odio vel purus bibendum luctus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis sit amet odio vel purus bibendum luctus. Splitting Training &Testing sklearn train_test_split function Crawling/Tokenizing/ Removing stop words Labeling Data data labels • sentence from MITRE ATT&CK descriptions • sequential indexes as MITRE ATT&CK labels 攻击者可能会滥用伪隐藏密钥隐藏用于建立持久性 payload / 命令 3510 key: T1112 non-attack descriptions assign speciﬁc tag 39

2nd CLASSIFIER COUNT VECTORIZER Naive Bayes CLASSIFIER 42

MultinomialNB() Example 43

MultinomialNB() drawbacks 44

3rd CLASSIFIER COUNT VECTORIZER Decision Tree CLASSIFIER TFIDF 46

DecisionTreeClassiﬁer() https://www.youtube.com/watch?v=7VeUPuFGJHk 47

DecisionTreeClassiﬁer() https://www.youtube.com/watch?v=7VeUPuFGJHk 48

DecisionTreeClassiﬁer() Drawbacks: • Instability: a small change in the data
can cause a large change in the structure of the decision tree. • Decision tree often involves higher time to train the model. Advantages: • Easy to understand: presents visually all of the decision alternatives in a format that is easy to understand • Versatile: A multitude of business problems can be analyzed and solved with Decision Tree 49

Evaluation: Accuracy 43.6% 21.6% Multinomial Naive Bayes 37.3% Decision Tree
Stochastic Gradient Descent 3% Random Guess 51

Visualized Demonstration MITRE ATT&CK recognition Flask 53

CONCLUSION Countvec, SVD jieba, Countvec, TF-IDF, SGD Setting up recommendation
system of prevalent attack methods Recognizing attack technique in articles and labeling with MITRE ATT&CK technique Classifying articles from largest Chinese security website Countvec, TF-IDF, SGD ,Naive Bayes, Decision Tree 54

THANK YOU! Q&A [email protected] 55

Gradient Descent https://www.youtube.com/watch?v=vMh0zPT0tLI&t=298s 56

Gradient Descent https://www.youtube.com/watch?v=vMh0zPT0tLI&t=298s d slope 1,000,000 terms 1,000,000 terms 60

SGDClassiﬁer() https://www.youtube.com/watch?v=vMh0zPT0tLI&t=298s 61

PyCon TW 2020: CTI ANT

PyCon TW 2020: CTI ANT

More Decks by Chia-En Tsai

Other Decks in Technology

Featured

Transcript