Slide 1

Slide 1 text

AIBugHunter 2.0: Automated Defect Prediction, 
 Explanation, Localization, and Repair Dr. Chakkrit (Kla) Tantithamthavorn ARC DECRA Fellow, Senior Lecturer, SE Group Lead, Director of Eng&Imp (2022-onwards) Monash University, Melbourne, Australia. [email protected] @klainfo http://chakkrit.com

Slide 2

Slide 2 text

Software bugs cost $2.84 trillion dollars globally https://www.it-cisq.org/the-cost-of-poor-quality-software-in-the-us-a-2018-report/The-Cost-of-Poor-Quality-Software-in-the-US-2018-Report.pdf A failure to eliminate defects in safety-critical systems could result in serious injury to people, threats to life, death, and disasters https://news.microsoft.com/en-au/features/direct-costs-associated-with-cybersecurity-incidents-costs-australian-businesses-29-billion-per-annum/ 59.5 billions annually for US 29 billions annually for Australia

Slide 3

Slide 3 text

Software evolves extremely fast 50% of the Google’s code base changes every month Windows 8 involves 100K+ code changes Software is written in multiple languages, by many people, over a long period of time 
 in order to fix bugs , add new features , and improve code quality . every day And, software is released faster at massive scale every 6 months every 6 weeks every 6 months

Slide 4

Slide 4 text

How to find bugs? Use unit testing to test the functionality correctness, But manual testing for all files is time-consuming Use static analysis tools check code quality Use code review to find bugs and check code quality Use CI/CD to automatically build, test, and merge with confidence Others: UI testing, fuzzing, load/performance testing, etc.

Slide 5

Slide 5 text

QA activities take too much time (~50% of a project) • Large and complex code base: 1 billion lines of code • > 10K developers in 40+ office locations • 5K+ projects under active development • 17K code reviews per day • 100 million test cases run per day Given limited time, how can we efficiently and effectively perform QA activities on the most risky program elements? ’s rules: All changes must be reviewed* https://www.codegrip.tech/productivity/what-is-googles-internal-code-review-process/ https://eclipsecon.org/2013/sites/eclipsecon.org.2013/files/2013-03-24%20Continuous%20Integration%20at%20Google%20Scale.pdf Within 6 months, 1K developers perform 80K+ code reviews 
 (~77 reviews per person) for 30K+ code changes / one release

Slide 6

Slide 6 text

Automated PR Prioritization 
 Which PRs are the most risky? Automated Defect Prediction 
 Which files are the most risky? Automated Defect Repairs 
 How should a file be fixed? Automated Defect Localization 
 Where should a file be fixed? Automated Explanation Generation 
 Why this file is predicted as risky? Automated QA Planning 
 How should we improve in the future? Pull Requests 
 (PRs) Submit Developer AIBugHunter 2.0

Slide 7

Slide 7 text

Prediction Help developers find defects faster Timeline V2 Testing Data Defect 
 Models Training Data V1 Important Factors Help managers develop quality improvement plans Model-agnostic 
 techniques (LIME) Help developers better understand why a file is predicted as defective Defect Prediction Models: An Overview An AI/ML model to predict if a file will be defective in the future

Slide 8

Slide 8 text

Raw Data ITS Issue 
 Tracking 
 System (ITS) MINING SOFTWARE DEFECTS Issue 
 Reports VCS Version 
 Control 
 System (VCS) Code Changes Code Snapshot Commit Log STEP 1: EXTRACT DATA

Slide 9

Slide 9 text

Reference: https://github.com/apache/lucene-solr/tree/662f8dd3423b3d56e9e1a197fe816393a33155e2 What are the source files in this release? ITS Code Changes Code Snapshot Commit Log VCS Issue 
 Tracking 
 System (ITS) Version 
 Control 
 System (VCS) Issue 
 Reports Raw Data STEP 2: COLLECT METRICS STEP 1: EXTRACT DATA MINING SOFTWARE DEFECTS

Slide 10

Slide 10 text

Reference: https://github.com/apache/lucene-solr/commit/662f8dd3423b3d56e9e1a197fe816393a33155e2 How many lines are added or deleted? ITS VCS Issue 
 Tracking 
 System (ITS) Version 
 Control 
 System (VCS) Raw Data Commit Log Issue 
 Reports MINING SOFTWARE DEFECTS Code Changes Code Snapshot STEP 2: COLLECT METRICS STEP 1: EXTRACT DATA Who edit this file?

Slide 11

Slide 11 text

ITS VCS Issue 
 Tracking 
 System (ITS) Version 
 Control 
 System (VCS) Raw Data STEP 1: EXTRACT DATA CODE METRICS Size, Code Complexity, Cognitive Complexity, 
 OO Design (e.g., coupling, cohesion) PROCESS METRICS Development Practices 
 (e.g., #commits, #dev, churn, #pre- release defects, change complexity) HUMAN FACTORS Code Ownership, #MajorDevelopers, 
 #MinorDevelopers, Author Ownership, 
 Developer Experience Code Changes Code Snapshot Commit Log Issue 
 Reports STEP 2: COLLECT METRICS MINING SOFTWARE DEFECTS

Slide 12

Slide 12 text

Reference: https://issues.apache.org/jira/browse/LUCENE-4128 Issue Reference ID Bug / New Feature Which releases are affected? Which commits belong to this issue report? ITS Code Changes Code Snapshot Commit Log VCS Issue 
 Tracking 
 System (ITS) Version 
 Control 
 System (VCS) Issue 
 Reports Raw Data STEP 1: EXTRACT DATA STEP 2: COLLECT METRICS STEP 3: IDENTIFY DEFECTS MINING SOFTWARE DEFECTS Whether this report is created after the release of interest?

Slide 13

Slide 13 text

ITS Code Changes Code Snapshot Commit Log VCS Issue 
 Tracking 
 System (ITS) Version 
 Control 
 System (VCS) Issue 
 Reports Raw Data STEP 1: EXTRACT DATA STEP 3: IDENTIFY DEFECTS STEP 2: COLLECT METRICS …… …… A B Defect 
 Dataset MINING SOFTWARE DEFECTS Which files were changed to fix the defect? Link Check “Mining Software Defects” paper [Yatish et al., ICSE 2019]

Slide 14

Slide 14 text

Mockus et al., BLTJ’00 Ostrand et al., TSE’05 Kim et al., FSE’15 Caglayan et al., ICSE’15 Tan et al., ICSE’15 Shimagaki et al., ICSE’16 Lewis et al., ICSE’13

Slide 15

Slide 15 text

Automated PR Prioritization 
 Which PRs are the most risky? Automated Defect Prediction 
 Which files are the most risky? Automated Defect Localization 
 Where should a file be fixed? Automated Defect Repairs 
 How should a file be fixed? Automated Explanation Generation 
 Why this file is predicted as risky? Automated QA Planning 
 How should we improve in the future? SQAPlanner (TSE’20) PyExplainer (ASE’21) 
 Actionable Analytics (IEEE Software’21) JITBot (ASE’20) LineDP (TSE’20) 
 JITLine (MSR’21) 
 DeepLineDP (Under Review) AutoTransform (Under Review) LIME-HPO (TSE’20) Practitioners’ Perceptions (MSR’21) Pull Requests 
 (PRs) Submit Developer AIBugHunter 2.0

Slide 16

Slide 16 text

@klainfo http://chakkrit.com Jirayus Jiarpakdee “Kla” Chakkrit Tantithamthavorn John Grundy Jirayus Jiarpakdee, Chakkrit Tantithamthavorn, John Grundy, Practitioners’ Perceptions of the Goals and Visual Explanations of Defect Prediction Models, International Conference on Mining Software Repositories (MSR’21). Practitioners' Perceptions of the Goals and Visual Explanations of Defect Prediction Models

Slide 17

Slide 17 text

40 Years of Defect Prediction Studies Help developers effectively prioritize the limited SQA resources Help managers can develop the most effective improvement plans Researchers’ Assumptions Help researchers develop empirical- grounded theories of software defects Practitioners’ Perceptions vs. etc.

Slide 18

Slide 18 text

Identifying the Goals of Defect Prediction Studies TSE, ICSE, EMSE, FSE, MSR 2015-2020 (as of 11 Jan 2021) defect, fault, bug, predict, quality 2,890 studies Remove duplicates, short, irrelevant, 
 journal-first, abstract, secondary Manually read to investigate what are the common goals and used techniques for defect prediction models 96 primary studies

Slide 19

Slide 19 text

Identifying the Practitioners’ Perceptions Survey Design 
 Participant Recruitment (MTurk) 
 
 Data Verification 
 
 Statistical Analysis Part 1—Demographics - Role, Experience, Country of Residence, Programming Language, Team Size, Static Analysis (Yes/No) Part 2—Perceptions on Goals of Defect Prediction Models - Perceived usefulness of each goal - Willingness to adopt of each goal - Open Question: Why? Part 3—Perceptions on the Visual Explanations - PSSUQ Usability Framework: information usefulness, quality, insightfulness, overall preference - Open Question: Why?

Slide 20

Slide 20 text

91% 41% 4% 0 25 50 75 100 Goal 1 (Predict) Goal 2 (Understand) Goal 3 (Explain) # of studies (%) Goals of developing defect prediction models 18% 0 5 10 15 20 ANO # of studies (%) Techn 0 5 10 15 20 25 # of studies (%) Goal 1 (Predict) Goa (Un Goals of developing defect prediction mo 91% focused on improving accuracy As few as 4% focused on increasing its explainability Practitioners’ Needs 
 (Survey) Researchers’ Focuses (SLR) ANOVA able tance ests Partial Dep. Plot (PDP) Decision Rule/Tree Rule/Tree Goal-3: L Explanati If {LOC>100} then {BUG} 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 41% 4% Local explanations Global explanations Predctions 0 25 50 5 # of studies (%) Goal-1: Prediction Goal-2: Global Explanation Goal-3: Local Explanation Practitioners’ perceived usefulness 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% 41% 4% Goal 1: 
 Predictions Goal 2: 
 Global Explanation Goal 3: 
 Local Explanation Three goals are similarly useful, But… Is the SE community moving to the right direction? Jirayus Jiarpakdee, Chakkrit Tantithamthavorn, John Grundy, Practitioners’ Perceptions of the Goals and Visual Explanations of Defect Prediction Models, International Conference on Mining Software Repositories (MSR’21).

Slide 21

Slide 21 text

Open Questions We call for a new research topic of Explainable AI for Software Engineering (XAI4SE) Three goals are well perceived as useful by practitioners. Sadly, researchers only focus on improving the accuracy, not explainability! Thus, we ask: 
 “How to make software analytics more explainable and actionable?” 
 See: LIME-HPO (TSE’20), SQAPlanner (TSE’20), PyExplainer (ASE’21), Actionable Analytics (SW’21)

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

JITLine: A Simpler, Better, Faster, Finer- grained Just-In-Time Defect Prediction @klainfo http://chakkrit.com Chanathip Pornprasit “Kla” Chakkrit Tantithamthavorn Chanathip Pornprasit, Chakkrit Tantithamthavorn, JITLine: A Simpler, Better, Faster, Finer-grained Just-In-Time 
 Defect Prediction, International Conference on Mining Software Repositories (MSR’21).

Slide 25

Slide 25 text

Critical Challenges in Modern SQA Practices Large-scale software projects receive a large number of newly arrived commits everyday. + - Those commits need to be reviewed prior to integrating into the release branch. —A Google SE Practice— With the limited time and resources, it is infeasible that practitioners can exhaustively review all commits.

Slide 26

Slide 26 text

- + New Commit Testing Data - + - + Commit History - + - + Training Data Timeline Defect-inducing commits Defect-fixing commits - + - + JIT Models Important Factors Provide Immediate Feedback for SQA Planning Risk Score Prioritize SQA Resources 
 on the Most Risky Commits Just-In-Time Defect Prediction Models A classifier to predict if a commit will introduce software defects in the future [Kamei et al. 2013]

Slide 27

Slide 27 text

Each commit is large and complex Commit size varies from 100 to 1000 lines of code On average, 50% of the changes lines for a commit are risky. • Predictions are not understandable by practitioners • Lack of adoption of JIT models in practice • Poor SQA resource prioritization Practitioners still do not know: “Which lines are most risky?”

Slide 28

Slide 28 text

JITLine: Line-Level JIT Defect Prediction using XAI How can we accurately identify which lines are most risky for a given commit? Training Data (Step 1) Extracting Bag- of-Tokens Features (Step 2) Handling class imbalance w/ DE+SMOTE (Step 3) Building a commit-level JIT model Testing Data Each commit + - JIT Models Prediction (Step 4) Explain the prediction to find defective tokens/lines

Slide 29

Slide 29 text

McIntosh & Kamei [TSE’17] #Commits #Tokens %Defective Commits Average 
 Commit Size Average 
 % Defective Lines OpenStack 12K 32K 13% 73 LOC 53% Qt 25K 81K 8% 140 LOC 51% Datasets Experimental Setup Baseline Comparison Effort-Aware JIT 
 Defect Models 
 (Kamei et al TSE’13) DeepJIT 
 CNN-based Models 
 (Hoang et al MSR’19) CC2Vec 
 HAN-based Models 
 (Hoang et al ICSE20)

Slide 30

Slide 30 text

Four Research Questions & Results • JITLine is 26%- 38% more accurate (F-measure) • 94%-97% lower a False Alarm Rate (FAR) (RQ1) Does our JITLine outperform the state-of-the-art JIT defect prediction approaches? Measures: AUC, F1, FAR, D2H • JITLine saves the amount of effort by 89%-96% to find the same number of defect- introducing commits (RQ2) Is our JITLine more cost-effective than the state-of-the-art JIT defect prediction approaches? Measures: Effort@20%Recall, PCI@20%LOC, Popt • JITLine takes 1-3 mins by CPU, 
 which is 70-100 times faster than CC2Vec. (RQ3) Is our JITLine faster than the state- of-the-art JIT defect prediction approaches? Measure: Model training time • 133%-150% more accurate in identifying defective lines than n-gram by Yan et al. (RQ4) How effective is our JITLine for prioritizing defective lines of a given defect-introducing commit? Measures: Top-10, Recall@20%Effort, Effort@20%Recall Take-away: “JITLine may help practitioners to better prioritize defect-introducing commits and better identify defective lines.”

Slide 31

Slide 31 text

Automated PR Prioritization 
 Which PRs are the most risky? Automated Defect Prediction 
 Which files are the most risky? Automated Defect Localization 
 Where should a file be fixed? Automated Defect Repairs 
 How should a file be fixed? Automated Explanation Generation 
 Why this file is predicted as risky? Automated QA Planning 
 How should we improve in the future? AIBugHunter 2.0 AIBugHunter 4.0