Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

About me > Manager for Security R&D team in LINE > Stanford MS in CS(Network and system security) > Seoul Nat’l Univ BS in CSE w/ Economics, Psychology > I love interdisciplinary stuff > 10+ years pen-test in public and private sectors > CISSP

Slide 3

Slide 3 text

Agenda › Small data challenge › Preliminaries › System architecture › Experiment › Conclusion

Slide 4

Slide 4 text

Finding bug › Code auditing is a highly sophisticated task. We need a team of ethical hackers. › Scalability issue › Can we make an autonomous system for this? › One of the best strategy to protect IT systems from malicious hackers.

Slide 5

Slide 5 text

Deep learning? › Natural Language Processing(NLP) has been progressed a lot w/ deep learning! › Programming languages and natural languages share a lot of properties › First of all, the system must be able to understand codes. › We need an intelligent system

Slide 6

Slide 6 text

Small dataset challenge › It is almost impossible to get legitimate security bug samples that much › Most common pain point in the security area › Deep learning requires humongous amount of data for training › More parameters - more data to train them › GPT-3 used 499B tokens

Slide 7

Slide 7 text

Our ingredients Reduce training param Transfer learning Meta training Foreign domain data Few shot learner Meta learning

Slide 8

Slide 8 text

Transfer learning %BUB %PH 8PMG ʜ 3BU %BUB #JLF #JDZDMF ʜ $BS Weight transfer Fine tuning

Slide 9

Slide 9 text

Transformer and BERT .BU.VM .BU.VM 4PGU.BY 4DBMF Q K V .VMUJ)FBE "UUFOUJPO "EE/PSN 'FFE 'PSXBSE "EE/PSN *OQVU &NCFEEJOH Input Positional Encoding &ODPEFS &ODPEFS &ODPEFS &ODPEFS &ODPEFS &NCFEEJOHT

Slide 10

Slide 10 text

Few shot learning 5 way, 1 shot image classification 2 way, 1 shot image classification

Slide 11

Slide 11 text

Meta-Learning › Meta learning › Learning a learner › Given experience on previous tasks, learn a new task quickly. Meta training Meta testing … … … …

Slide 12

Slide 12 text

Model-Agnostic Meta-Learning › REPTILE, Nichol et al., 2018 › First-order approximation › θ ← θ + α 1 n n ∑ i=1 (Uk τi (θ) − θ) › MAML, Finn et al. › Optimization based meta learning › min θ ∑ task i L(θ − α∇ θ L(θ, Dtr i ), Dts i )

Slide 13

Slide 13 text

Architecture Source codes Preprocessing Code slicing JSON query BPE Tokenizer e n c o d e r Bi-LSTM FCNN Softmax FCNN Softmax Start position End position …… e n c o d e r e n c o d e r FCNN Is vuln? ALBERT

Slide 14

Slide 14 text

Architecture Source codes Preprocessing Code slicing JSON query BPE Tokenizer e n c o d e r Bi-LSTM FCNN Softmax FCNN Softmax Start position End position …… e n c o d e r e n c o d e r FCNN Is vuln? ALBERT

Slide 15

Slide 15 text

Architecture Source codes Preprocessing Code slicing JSON query BPE Tokenizer e n c o d e r Bi-LSTM FCNN Softmax FCNN Softmax Start position End position …… e n c o d e r e n c o d e r FCNN Is vuln? ALBERT

Slide 16

Slide 16 text

Architecture Source codes Preprocessing Code slicing JSON query BPE Tokenizer e n c o d e r Bi-LSTM FCNN Softmax FCNN Softmax Start position End position …… e n c o d e r e n c o d e r FCNN Is vuln? ALBERT

Slide 17

Slide 17 text

Prediction process Embedding layer Transformer layer Transformer layer Transformer layer LSTM layer … ▁< script > ▁docu ment . onload = alert ( … . . . Softmax layer

Slide 18

Slide 18 text

Training strategy BPE Tokenizer e n c o d e r …… e n c o d e r e n c o d e r Phase 1 (Pre-training) Phase 2 (Meta-training) Phase 3 (Fine-tuning) BPE Tokenizer e n c o d e r Bi-LSTM FCNN Softmax Start End …… e n c o d e r e n c o d e r FCNN Softmax FCNN Vuln BPE Tokenizer e n c o d e r Bi-LSTM FCNN Softmax Start End …… e n c o d e r e n c o d e r FCNN Softmax FCNN Vuln Training target Training target Training target English ALBERT Phase1 ALBERT

Slide 19

Slide 19 text

Training strategy BPE Tokenizer e n c o d e r …… e n c o d e r e n c o d e r Phase 1 (Pre-training) Phase 2 (Meta-training) Phase 3 (Fine-tuning) BPE Tokenizer e n c o d e r Bi-LSTM FCNN Softmax Start End …… e n c o d e r e n c o d e r FCNN Softmax FCNN Vuln BPE Tokenizer e n c o d e r Bi-LSTM FCNN Softmax Start End …… e n c o d e r e n c o d e r FCNN Softmax FCNN Vuln Training target Training target Training target English ALBERT Phase1 ALBERT

Slide 20

Slide 20 text

Training strategy BPE Tokenizer e n c o d e r …… e n c o d e r e n c o d e r Phase 1 (Pre-training) Phase 2 (Meta-training) Phase 3 (Fine-tuning) BPE Tokenizer e n c o d e r Bi-LSTM FCNN Softmax Start End …… e n c o d e r e n c o d e r FCNN Softmax FCNN Vuln BPE Tokenizer e n c o d e r Bi-LSTM FCNN Softmax Start End …… e n c o d e r e n c o d e r FCNN Softmax FCNN Vuln Training target Training target Training target English ALBERT Phase1 ALBERT

Slide 21

Slide 21 text

Experiment Target › DOM-based XSS › XSS happens in DOM instead of HTML › HTML code is intact -> runtime investigation is necessary › Source and sink document.write("You are visiting: " + document.baseURI); http://www.example.com/vuln.html#alert('xss')

Slide 22

Slide 22 text

Experiment datasets › Meta-learning data (Foreign domain) › The Stanford Question Answering Dataset(SQuAD 2.0) › Generated mini-batch tasks(24 samples for each task) › Fine-tuning data (XSS bug samples) › Patch history from public and private GIT repos › 29 samples of the bug(23 for training, 6 for validation) › Pre-training data › HTML corpus from web(367M) DOM-base XSS bug finding

Slide 23

Slide 23 text

Experiment datasets › Meta-learning data (Foreign domain) › The Stanford Question Answering Dataset(SQuAD 2.0) › Generated mini-batch tasks(24 samples for each task) › Fine-tuning data (XSS bug samples) › Patch history from public and private GIT repos › 29 samples of the bug(23 for training, 6 for validation) › Pre-training data › HTML corpus from web(367M) DOM-base XSS bug finding

Slide 24

Slide 24 text

Experiment datasets › Meta-learning data (Foreign domain) › The Stanford Question Answering Dataset(SQuAD 2.0) › Generated mini-batch tasks(24 samples for each task) › Fine-tuning data (XSS bug samples) › Patch history from public and private GIT repos › 29 samples of the bug(23 for training, 6 for validation) › Pre-training data › HTML corpus from web(367M) DOM-base XSS bug finding

Slide 25

Slide 25 text

Experiment › Experiment setup › Baseline › Random init for RNN, FCNN › Meta-learned › Meta-trained parameters Meta-learning curve Fine-tuning curve comparison EM/F1 score

Slide 26

Slide 26 text

Is it promising? › Parameter size matters › In GPT-3’s few shot learning experiments, it achieved 32.1(125M), 55.9(2.7B), 69.8(175B) for SQuAD 2.0 › Our model has 18M parameters. › F1 score of human performance in SQuAD 2.0 is 89.452 › Even though the task is different, ours got 40.1 › The point of the experiment › Our ingredients actually led to better performance.

Slide 27

Slide 27 text

Where are we? Number of parameters(Billion) 0 45 90 135 180 175 17 1.5 0.34 0.11 0.094 We are here. 0.018B GPT-3 T-NLG GPT-2 BERT-Large GPT ELMo 2018.4 2018.7 2018.10 2019.2 2020.1 2020.6

Slide 28

Slide 28 text

Conclusion › Meta-learning algorithm can be helpful in case of small dataset problems › This is a huge point in security area › Foreign domain can be used for meta-training › Structural similarity required › Transformer model is useful but it requires lots of data

Slide 29

Slide 29 text

Future works › More long term dependency › Handling nested structure › Problem extension › Polyglot, different kind of bugs › Ensemble model › Better performance w/o increasing the number of the parameters › Training is so expensive › Leveraging programming language’s grammar and structure

Slide 30

Slide 30 text

Thank you Paper: Cross-domain meta-learning for bug finding in the source codes with a small dataset Presented at EICC 2020, France