Predicting Line-Level Defects by Capturing Code Contexts with Hierarchical Transformers

Predicting Line-Level Defects by Capturing Code Contexts with Hierarchical Transformers
Parvez Mahbub and Masud Rahman Faculty of Computer Science Dalhousie University, Canada

2 • Defects in Boeing 737 MAX 8 System •
157 people died Boeing 737 MAX Crash • Vulnerability in the Apache Struts web framework • Exposed financial data of ≈143M people Equifax Data Breach • A unit conversion error in the software • Cost of the mission was $327.6M NASA's Mars Orbiter Tales of Software Defects 3%

Existing Work on Defect Prediction • File/Change-Level Defect Prediction •
Kamei et al. TSE 2012: Logistic Regression • Jiang et al. ASE 2013: AST • Wang et al. ICSE 2016: Deep Belief Network • Li et al. QRS 2017: CNN • Dam et al. TSE 2018: LSTM • Line-Level Defect Prediction • Wattanakriengkrai et al. TSE 2020: ML & LIME • DeepLineDP, TSE 2022: Hierarchical Attention & Bidirectional GRU 3

Gaps: Issues in Token Encodings 4

5 Gaps: File-Level Prediction before Line- Level Prediction

File Tokenizer Tokens Line Encoder Embedding Layer Token Embedding Line
Embedding Line Classifier Buggy Lines Cross Entropy Loss Test Result Back Propagation Back Propagation Back Propagation Trained Model (A) (B) (C) (D) (E) 6 Bugsplorer: Proposed Method

Line Encoder Multi-Head Attention Feed Forward Add & Normalize Add
& Normalize Encoder Layer × N Pooling Layer Token Embedding Line Embedding Encoder Stack Softmax Line Embedding Buggy Lines Feed Forward Dropout Add Positional Embedding File Tokenizer Tokens Line Encoder Embedding Layer Token Embedding Line Embedding Line Classifier Buggy Lines Cross Entropy Loss Back Propagation Back Propagation Back Propagation (A) (B) (C) (D) (E) 7 Line Encoder Line Classifier

Bugsplorer: Experiment 8 Classification AuROC – Differentiate between defective &
defect- free code lines. Balanced Accuracy – Adjusted accuracy for imbalanced distributions of classes. Recall@Top20%LOC – Fraction of defects in top 20% suspicious lines. Effort (Lower is better) Effort@Top20%Recall – Effort to find 20% Initial False Alarm – Fraction of false-positives before the first true positive. Performance Metrics Python Defectors: 213K files, 4% defective lines Java LineDP: 73K files, 0.34% defective lines Dataset Byte Pair Encoder (BPE) tokenizer RoBERTa Encoder 122M parameters Other settings

9 RQ1 — Performance of Bugsplorer Across Datasets

10 RQ2 — Effectiveness of Our Contribution Line level optimization
Bidirectional representation

11 RQ3 — Effectiveness of Transformer Architecture RoBERTa

12 RQ4 — Comparison with the Baseline

Take-Home Messages • ~3% of code lines are defective –
Line-level defect predictions • Capturing line-level context could be challenging! • Bugsplorer – captures code context using Hierarchical transformers. • Outperforms the baseline, shows promising results. 13

Take-Home Messages 14 Code Structure Embedding Structural information Training with
examples from the official API documentation Model Architecture Experimenting with variable length architecture like TransformerXL Using decoders (e.g., GPT) instead of encoders

Thank You! Questions? 15 Masud Rahman [email protected]

Predicting Line-Level Defects by Capturing Code Contexts with Hierarchical Transformers

Predicting Line-Level Defects by Capturing Code Contexts with Hierarchical Transformers

Masud Rahman

More Decks by Masud Rahman

Other Decks in Education

Featured

Transcript