AIBugHunter 2.0: Automated Defect Prediction, Explanation, Localization, and Repair

AIBugHunter 2.0: Automated Defect Prediction,   Explanation, Localization, and Repair
Dr. Chakkrit (Kla) Tantithamthavorn ARC DECRA Fellow, Senior Lecturer, SE Group Lead, Director of Eng&Imp (2022-onwards) Monash University, Melbourne, Australia. [email protected] @klainfo http://chakkrit.com

Software bugs cost $2.84 trillion dollars globally https://www.it-cisq.org/the-cost-of-poor-quality-software-in-the-us-a-2018-report/The-Cost-of-Poor-Quality-Software-in-the-US-2018-Report.pdf A failure
to eliminate defects in safety-critical systems could result in serious injury to people, threats to life, death, and disasters https://news.microsoft.com/en-au/features/direct-costs-associated-with-cybersecurity-incidents-costs-australian-businesses-29-billion-per-annum/ 59.5 billions annually for US 29 billions annually for Australia

Software evolves extremely fast 50% of the Google’s code base
changes every month Windows 8 involves 100K+ code changes Software is written in multiple languages, by many people, over a long period of time   in order to fix bugs , add new features , and improve code quality . every day And, software is released faster at massive scale every 6 months every 6 weeks every 6 months

How to find bugs? Use unit testing to test the
functionality correctness, But manual testing for all files is time-consuming Use static analysis tools check code quality Use code review to find bugs and check code quality Use CI/CD to automatically build, test, and merge with confidence Others: UI testing, fuzzing, load/performance testing, etc.

QA activities take too much time (~50% of a project)
• Large and complex code base: 1 billion lines of code • > 10K developers in 40+ office locations • 5K+ projects under active development • 17K code reviews per day • 100 million test cases run per day Given limited time, how can we efficiently and effectively perform QA activities on the most risky program elements? ’s rules: All changes must be reviewed* https://www.codegrip.tech/productivity/what-is-googles-internal-code-review-process/ https://eclipsecon.org/2013/sites/eclipsecon.org.2013/files/2013-03-24%20Continuous%20Integration%20at%20Google%20Scale.pdf Within 6 months, 1K developers perform 80K+ code reviews   (~77 reviews per person) for 30K+ code changes / one release

Automated PR Prioritization   Which PRs are the most risky?
Automated Defect Prediction   Which files are the most risky? Automated Defect Repairs   How should a file be fixed? Automated Defect Localization   Where should a file be fixed? Automated Explanation Generation   Why this file is predicted as risky? Automated QA Planning   How should we improve in the future? Pull Requests   (PRs) Submit Developer AIBugHunter 2.0

Prediction Help developers find defects faster Timeline V2 Testing Data
Defect   Models Training Data V1 Important Factors Help managers develop quality improvement plans Model-agnostic   techniques (LIME) Help developers better understand why a file is predicted as defective Defect Prediction Models: An Overview An AI/ML model to predict if a file will be defective in the future

Raw Data ITS Issue   Tracking   System (ITS) MINING
SOFTWARE DEFECTS Issue   Reports VCS Version   Control   System (VCS) Code Changes Code Snapshot Commit Log STEP 1: EXTRACT DATA

Reference: https://github.com/apache/lucene-solr/tree/662f8dd3423b3d56e9e1a197fe816393a33155e2 What are the source files in this release?
ITS Code Changes Code Snapshot Commit Log VCS Issue   Tracking   System (ITS) Version   Control   System (VCS) Issue   Reports Raw Data STEP 2: COLLECT METRICS STEP 1: EXTRACT DATA MINING SOFTWARE DEFECTS

Reference: https://github.com/apache/lucene-solr/commit/662f8dd3423b3d56e9e1a197fe816393a33155e2 How many lines are added or deleted? ITS
VCS Issue   Tracking   System (ITS) Version   Control   System (VCS) Raw Data Commit Log Issue   Reports MINING SOFTWARE DEFECTS Code Changes Code Snapshot STEP 2: COLLECT METRICS STEP 1: EXTRACT DATA Who edit this file?

ITS VCS Issue   Tracking   System (ITS) Version  
Control   System (VCS) Raw Data STEP 1: EXTRACT DATA CODE METRICS Size, Code Complexity, Cognitive Complexity,   OO Design (e.g., coupling, cohesion) PROCESS METRICS Development Practices   (e.g., #commits, #dev, churn, #pre- release defects, change complexity) HUMAN FACTORS Code Ownership, #MajorDevelopers,   #MinorDevelopers, Author Ownership,   Developer Experience Code Changes Code Snapshot Commit Log Issue   Reports STEP 2: COLLECT METRICS MINING SOFTWARE DEFECTS

Reference: https://issues.apache.org/jira/browse/LUCENE-4128 Issue Reference ID Bug / New Feature Which
releases are affected? Which commits belong to this issue report? ITS Code Changes Code Snapshot Commit Log VCS Issue   Tracking   System (ITS) Version   Control   System (VCS) Issue   Reports Raw Data STEP 1: EXTRACT DATA STEP 2: COLLECT METRICS STEP 3: IDENTIFY DEFECTS MINING SOFTWARE DEFECTS Whether this report is created after the release of interest?

ITS Code Changes Code Snapshot Commit Log VCS Issue  
Tracking   System (ITS) Version   Control   System (VCS) Issue   Reports Raw Data STEP 1: EXTRACT DATA STEP 3: IDENTIFY DEFECTS STEP 2: COLLECT METRICS …… …… A B Defect   Dataset MINING SOFTWARE DEFECTS Which files were changed to fix the defect? Link Check “Mining Software Defects” paper [Yatish et al., ICSE 2019]

Mockus et al., BLTJ’00 Ostrand et al., TSE’05 Kim et
al., FSE’15 Caglayan et al., ICSE’15 Tan et al., ICSE’15 Shimagaki et al., ICSE’16 Lewis et al., ICSE’13

Automated Defect Prediction   Which files are the most risky? Automated Defect Localization   Where should a file be fixed? Automated Defect Repairs   How should a file be fixed? Automated Explanation Generation   Why this file is predicted as risky? Automated QA Planning   How should we improve in the future? SQAPlanner (TSE’20) PyExplainer (ASE’21)   Actionable Analytics (IEEE Software’21) JITBot (ASE’20) LineDP (TSE’20)   JITLine (MSR’21)   DeepLineDP (Under Review) AutoTransform (Under Review) LIME-HPO (TSE’20) Practitioners’ Perceptions (MSR’21) Pull Requests   (PRs) Submit Developer AIBugHunter 2.0

@klainfo http://chakkrit.com Jirayus Jiarpakdee “Kla” Chakkrit Tantithamthavorn John Grundy Jirayus
Jiarpakdee, Chakkrit Tantithamthavorn, John Grundy, Practitioners’ Perceptions of the Goals and Visual Explanations of Defect Prediction Models, International Conference on Mining Software Repositories (MSR’21). Practitioners' Perceptions of the Goals and Visual Explanations of Defect Prediction Models

40 Years of Defect Prediction Studies Help developers effectively prioritize
the limited SQA resources Help managers can develop the most effective improvement plans Researchers’ Assumptions Help researchers develop empirical- grounded theories of software defects Practitioners’ Perceptions vs. etc.

Identifying the Goals of Defect Prediction Studies TSE, ICSE, EMSE,
FSE, MSR 2015-2020 (as of 11 Jan 2021) defect, fault, bug, predict, quality 2,890 studies Remove duplicates, short, irrelevant,   journal-first, abstract, secondary Manually read to investigate what are the common goals and used techniques for defect prediction models 96 primary studies

Identifying the Practitioners’ Perceptions Survey Design   Participant Recruitment (MTurk)
    Data Verification     Statistical Analysis Part 1—Demographics - Role, Experience, Country of Residence, Programming Language, Team Size, Static Analysis (Yes/No) Part 2—Perceptions on Goals of Defect Prediction Models - Perceived usefulness of each goal - Willingness to adopt of each goal - Open Question: Why? Part 3—Perceptions on the Visual Explanations - PSSUQ Usability Framework: information usefulness, quality, insightfulness, overall preference - Open Question: Why?

91% 41% 4% 0 25 50 75 100 Goal 1
(Predict) Goal 2 (Understand) Goal 3 (Explain) # of studies (%) Goals of developing defect prediction models 18% 0 5 10 15 20 ANO # of studies (%) Techn 0 5 10 15 20 25 # of studies (%) Goal 1 (Predict) Goa (Un Goals of developing defect prediction mo 91% focused on improving accuracy As few as 4% focused on increasing its explainability Practitioners’ Needs   (Survey) Researchers’ Focuses (SLR) ANOVA able tance ests Partial Dep. Plot (PDP) Decision Rule/Tree Rule/Tree Goal-3: L Explanati If {LOC>100} then {BUG} 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 41% 4% Local explanations Global explanations Predctions 0 25 50 5 # of studies (%) Goal-1: Prediction Goal-2: Global Explanation Goal-3: Local Explanation Practitioners’ perceived usefulness 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% 41% 4% Goal 1:   Predictions Goal 2:   Global Explanation Goal 3:   Local Explanation Three goals are similarly useful, But… Is the SE community moving to the right direction? Jirayus Jiarpakdee, Chakkrit Tantithamthavorn, John Grundy, Practitioners’ Perceptions of the Goals and Visual Explanations of Defect Prediction Models, International Conference on Mining Software Repositories (MSR’21).

Open Questions We call for a new research topic of
Explainable AI for Software Engineering (XAI4SE) Three goals are well perceived as useful by practitioners. Sadly, researchers only focus on improving the accuracy, not explainability! Thus, we ask:   “How to make software analytics more explainable and actionable?”   See: LIME-HPO (TSE’20), SQAPlanner (TSE’20), PyExplainer (ASE’21), Actionable Analytics (SW’21)

JITLine: A Simpler, Better, Faster, Finer- grained Just-In-Time Defect Prediction
@klainfo http://chakkrit.com Chanathip Pornprasit “Kla” Chakkrit Tantithamthavorn Chanathip Pornprasit, Chakkrit Tantithamthavorn, JITLine: A Simpler, Better, Faster, Finer-grained Just-In-Time   Defect Prediction, International Conference on Mining Software Repositories (MSR’21).

Critical Challenges in Modern SQA Practices Large-scale software projects receive
a large number of newly arrived commits everyday. + - Those commits need to be reviewed prior to integrating into the release branch. —A Google SE Practice— With the limited time and resources, it is infeasible that practitioners can exhaustively review all commits.

- + New Commit Testing Data - + - +
Commit History - + - + Training Data Timeline Defect-inducing commits Defect-fixing commits - + - + JIT Models Important Factors Provide Immediate Feedback for SQA Planning Risk Score Prioritize SQA Resources   on the Most Risky Commits Just-In-Time Defect Prediction Models A classifier to predict if a commit will introduce software defects in the future [Kamei et al. 2013]

Each commit is large and complex Commit size varies from
100 to 1000 lines of code On average, 50% of the changes lines for a commit are risky. • Predictions are not understandable by practitioners • Lack of adoption of JIT models in practice • Poor SQA resource prioritization Practitioners still do not know: “Which lines are most risky?”

JITLine: Line-Level JIT Defect Prediction using XAI How can we
accurately identify which lines are most risky for a given commit? Training Data (Step 1) Extracting Bag- of-Tokens Features (Step 2) Handling class imbalance w/ DE+SMOTE (Step 3) Building a commit-level JIT model Testing Data Each commit + - JIT Models Prediction (Step 4) Explain the prediction to find defective tokens/lines

McIntosh & Kamei [TSE’17] #Commits #Tokens %Defective Commits Average  
Commit Size Average   % Defective Lines OpenStack 12K 32K 13% 73 LOC 53% Qt 25K 81K 8% 140 LOC 51% Datasets Experimental Setup Baseline Comparison Effort-Aware JIT   Defect Models   (Kamei et al TSE’13) DeepJIT   CNN-based Models   (Hoang et al MSR’19) CC2Vec   HAN-based Models   (Hoang et al ICSE20)

Four Research Questions & Results • JITLine is 26%- 38%
more accurate (F-measure) • 94%-97% lower a False Alarm Rate (FAR) (RQ1) Does our JITLine outperform the state-of-the-art JIT defect prediction approaches? Measures: AUC, F1, FAR, D2H • JITLine saves the amount of effort by 89%-96% to find the same number of defect- introducing commits (RQ2) Is our JITLine more cost-effective than the state-of-the-art JIT defect prediction approaches? Measures: Effort@20%Recall, PCI@20%LOC, Popt • JITLine takes 1-3 mins by CPU,   which is 70-100 times faster than CC2Vec. (RQ3) Is our JITLine faster than the state- of-the-art JIT defect prediction approaches? Measure: Model training time • 133%-150% more accurate in identifying defective lines than n-gram by Yan et al. (RQ4) How effective is our JITLine for prioritizing defective lines of a given defect-introducing commit? Measures: Top-10, Recall@20%Effort, Effort@20%Recall Take-away: “JITLine may help practitioners to better prioritize defect-introducing commits and better identify defective lines.”

Automated Defect Prediction   Which files are the most risky? Automated Defect Localization   Where should a file be fixed? Automated Defect Repairs   How should a file be fixed? Automated Explanation Generation   Why this file is predicted as risky? Automated QA Planning   How should we improve in the future? AIBugHunter 2.0 AIBugHunter 4.0

AIBugHunter 2.0: Automated Defect Prediction, E...

AIBugHunter 2.0: Automated Defect Prediction, Explanation, Localization, and Repair

Dr. Kla Tantithamthavorn

More Decks by Dr. Kla Tantithamthavorn

Other Decks in Technology

Featured

Transcript

AIBugHunter 2.0: Automated Defect Prediction,   Explanation, Localization, and Repair

Software bugs cost $2.84 trillion dollars globally https://www.it-cisq.org/the-cost-of-poor-quality-software-in-the-us-a-2018-report/The-Cost-of-Poor-Quality-Software-in-the-US-2018-Report.pdf A failure

Software evolves extremely fast 50% of the Google’s code base

How to find bugs? Use unit testing to test the

QA activities take too much time (~50% of a project)

Automated PR Prioritization   Which PRs are the most risky?

Prediction Help developers find defects faster Timeline V2 Testing Data

Raw Data ITS Issue   Tracking   System (ITS) MINING

Reference: https://github.com/apache/lucene-solr/tree/662f8dd3423b3d56e9e1a197fe816393a33155e2 What are the source files in this release?

Reference: https://github.com/apache/lucene-solr/commit/662f8dd3423b3d56e9e1a197fe816393a33155e2 How many lines are added or deleted? ITS

ITS VCS Issue   Tracking   System (ITS) Version

Reference: https://issues.apache.org/jira/browse/LUCENE-4128 Issue Reference ID Bug / New Feature Which

ITS Code Changes Code Snapshot Commit Log VCS Issue

Mockus et al., BLTJ’00 Ostrand et al., TSE’05 Kim et

Automated PR Prioritization   Which PRs are the most risky?

@klainfo http://chakkrit.com Jirayus Jiarpakdee “Kla” Chakkrit Tantithamthavorn John Grundy Jirayus

40 Years of Defect Prediction Studies Help developers effectively prioritize

Identifying the Goals of Defect Prediction Studies TSE, ICSE, EMSE,

Identifying the Practitioners’ Perceptions Survey Design   Participant Recruitment (MTurk)

91% 41% 4% 0 25 50 75 100 Goal 1

Open Questions We call for a new research topic of

JITLine: A Simpler, Better, Faster, Finer- grained Just-In-Time Defect Prediction

Critical Challenges in Modern SQA Practices Large-scale software projects receive

- + New Commit Testing Data - + - +

Each commit is large and complex Commit size varies from

JITLine: Line-Level JIT Defect Prediction using XAI How can we

McIntosh & Kamei [TSE’17] #Commits #Tokens %Defective Commits Average

Four Research Questions & Results • JITLine is 26%- 38%

Automated PR Prioritization   Which PRs are the most risky?