Explainable AI for Software Engineering (https://xai4se.github.io/)

Acknowledgement Actionable Analytics: Stop Telling Me What It Is; Please
Tell Me What To Do, IEEE Software 2021. SQAPlanner: Generating data-informed software quality improvement plans, TSE 2021. Predicting Defective Lines Using a Model-Agnostic Technique, TSE 2021. An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. JITLine: A Simpler, Better, Faster, Finer-grained Just-In-Time Defect Prediction, MSR 2021. Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021. Many thanks my colleagues and collaborators who support my research in the past few years. 2

• Follow-up after the tutorial: [email protected] • Twitter: @klainfo •
Materials: http://xai4se.github.io   Ask questions in Zoom chat 3

• Follow-up after the tutorial: [email protected] • Twitter: @klainfo •
Materials: http://xai4se.github.io   Ask questions in Zoom chat Click to access ‘Binder’ or ‘Colab’   to interactively access the notebooks 3

Disclaimers 4

Disclaimers • This tutorial is not a comprehensive introduction to
Explainable AI theories or algorithms. 4

Explainable AI theories or algorithms. • This tutorial is not an exhaustive survey of XAI, since there is a massive body of Explainable AI Research (1,000+ papers) in many disciplines (AI, ML, HCI, Social Science, and Software Engineering). 4

Explainable AI theories or algorithms. • This tutorial is not an exhaustive survey of XAI, since there is a massive body of Explainable AI Research (1,000+ papers) in many disciplines (AI, ML, HCI, Social Science, and Software Engineering). • We are SE researchers that aim to make AI4SE research more explainable and actionable for practitioners, leading to more adoption in practice and creating more worldwide impact.   4

Explainable AI theories or algorithms. • This tutorial is not an exhaustive survey of XAI, since there is a massive body of Explainable AI Research (1,000+ papers) in many disciplines (AI, ML, HCI, Social Science, and Software Engineering). • We are SE researchers that aim to make AI4SE research more explainable and actionable for practitioners, leading to more adoption in practice and creating more worldwide impact.   • This tutorial aims to: 4

Explainable AI theories or algorithms. • This tutorial is not an exhaustive survey of XAI, since there is a massive body of Explainable AI Research (1,000+ papers) in many disciplines (AI, ML, HCI, Social Science, and Software Engineering). • We are SE researchers that aim to make AI4SE research more explainable and actionable for practitioners, leading to more adoption in practice and creating more worldwide impact.   • This tutorial aims to: • Motivate the importance of XAI for SE 4

Explainable AI theories or algorithms. • This tutorial is not an exhaustive survey of XAI, since there is a massive body of Explainable AI Research (1,000+ papers) in many disciplines (AI, ML, HCI, Social Science, and Software Engineering). • We are SE researchers that aim to make AI4SE research more explainable and actionable for practitioners, leading to more adoption in practice and creating more worldwide impact.   • This tutorial aims to: • Motivate the importance of XAI for SE • Provide a concise yet essential introduction to the most important aspects of XAI 4

Explainable AI theories or algorithms. • This tutorial is not an exhaustive survey of XAI, since there is a massive body of Explainable AI Research (1,000+ papers) in many disciplines (AI, ML, HCI, Social Science, and Software Engineering). • We are SE researchers that aim to make AI4SE research more explainable and actionable for practitioners, leading to more adoption in practice and creating more worldwide impact.   • This tutorial aims to: • Motivate the importance of XAI for SE • Provide a concise yet essential introduction to the most important aspects of XAI • Demonstrate some potential applications of XAI for SE 4

Explainable AI theories or algorithms. • This tutorial is not an exhaustive survey of XAI, since there is a massive body of Explainable AI Research (1,000+ papers) in many disciplines (AI, ML, HCI, Social Science, and Software Engineering). • We are SE researchers that aim to make AI4SE research more explainable and actionable for practitioners, leading to more adoption in practice and creating more worldwide impact.   • This tutorial aims to: • Motivate the importance of XAI for SE • Provide a concise yet essential introduction to the most important aspects of XAI • Demonstrate some potential applications of XAI for SE • Convince everyone to tackle lots of open research questions of XAI4SE 4

Agenda • Part 1: Explainable AI for Software Engineering (XAI4SE)
+ A Live Demo • Part 2: Other Potential Usage Scenarios of XAI4SE • Part 3: Lessons Learned and Open Questions 5

Software development involves complex and critical decision-making When should we
release? Managers Designers Developers Testers QA   Engineers How effective is our test suite? Is software design good enough? How should I fi x this bug? Which modules should I test fi rst? Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021. 6

AI/ML IS ADOPTED IN SOFTWARE ENGINEERING IMPROVE SOFTWARE QUALITY Predict
defects, vulnerabilities, malware Generate test cases 7

AI/ML IS ADOPTED IN SOFTWARE ENGINEERING IMPROVE SOFTWARE QUALITY Predict
defects, vulnerabilities, malware   Predict developer/team productivity   Recommend developers/reviewers Identify developer turnover IMPROVE PRODUCTIVITY 8

9 Developer Pull Requests / Commits / Files (1) Submit
Defect Prediction: An Overview An AI/ML model to predict if a file will be defective in the future PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

Defect Prediction: An Overview An AI/ML model to predict if a file will be defective in the future Reviewer (2) Review Too many PRs + too large PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

Defect Prediction: An Overview An AI/ML model to predict if a file will be defective in the future Defect Prediction Risky Clean Reviewer (2) Review Too many PRs + too large PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

Defect Prediction: An Overview An AI/ML model to predict if a file will be defective in the future Defect Prediction Risky Clean Reviewer (2) Review Too many PRs + too large Why a commit is predicted as defective? What should they do to improve it? PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

Defect Prediction: An Overview An AI/ML model to predict if a file will be defective in the future Defect Prediction Risky Clean Reviewer (2) Review Too many PRs + too large Why a commit is predicted as defective? What should they do to improve it? Lack of Explainability = Lack of Trust = Lack of Adoption in Practice PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

Explainable AI: Objective & Definitions 10 The Explainable AI (XAI)
aims to create a suite of AI/ML techniques that: (David Gunning, 2016) • Produce more explainable models, while maintaining a high level of prediction accuracy; and • Enable human users to understand and build an appropriate trust to the predictions Source: https://www.darpa.mil/program/explainable-artificial-intelligence

aims to create a suite of AI/ML techniques that: (David Gunning, 2016) • Produce more explainable models, while maintaining a high level of prediction accuracy; and • Enable human users to understand and build an appropriate trust to the predictions Source: https://www.darpa.mil/program/explainable-artificial-intelligence XAI Data AI Algorithm Model Data AI Algorithm XAI Explanations

aims to create a suite of AI/ML techniques that: (David Gunning, 2016) • Produce more explainable models, while maintaining a high level of prediction accuracy; and • Enable human users to understand and build an appropriate trust to the predictions Definitions: Interpretable ML – using a white-box model. Advantages of interpretable ML are mainly for high- stakes decisions. Explainable AI – using a black box and explaining it afterwards. Source: https://www.darpa.mil/program/explainable-artificial-intelligence XAI Data AI Algorithm Model Data AI Algorithm XAI Explanations

11 Defect Prediction Risky Clean PyExplainer: Explaining the Predictions of
Just-In-Time Defect Models, ASE 2021. PyExplainer: To Explain Defect Predictions A local rule-based model agnostic technique to generation explanation and actionable guidance

PyExplainer Model Instance {Churn > 100 & Reviewers < 2}
DEFECT A rule based explanation 11 Defect Prediction Risky Clean PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. PyExplainer: To Explain Defect Predictions A local rule-based model agnostic technique to generation explanation and actionable guidance

DEFECT A rule based explanation 11 Defect Prediction Risky Clean Why a commit is predicted as defective? “A commit is predicted as defective,   since Churn > 100 and Reviewers < 2” PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. PyExplainer: To Explain Defect Predictions A local rule-based model agnostic technique to generation explanation and actionable guidance

DEFECT A rule based explanation 11 Defect Prediction Risky Clean Why a commit is predicted as defective? “A commit is predicted as defective,   since Churn > 100 and Reviewers < 2” PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. PyExplainer: To Explain Defect Predictions A local rule-based model agnostic technique to generation explanation and actionable guidance Help developers understand the most important aspects that are associated with defects.

DEFECT A rule based explanation 11 Defect Prediction Risky Clean Why a commit is predicted as defective? “A commit is predicted as defective,   since Churn > 100 and Reviewers < 2” PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. PyExplainer: To Explain Defect Predictions A local rule-based model agnostic technique to generation explanation and actionable guidance Help developers understand the most important aspects that are associated with defects. Help developers understand the risk threshold (how small it should be?).

12 PyExplainer: To Generate Actionable Guidance   What-if we change
this, would it reverse the prediction of defect models? PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

this, would it reverse the prediction of defect models? {Churn > 100 & Reviewers < 2} DEFECT This rule is explainable, but not understandable and not actionable. PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

this, would it reverse the prediction of defect models? {Churn > 100 & Reviewers < 2} DEFECT This rule is explainable, but not understandable and not actionable. Thus, we design a proof-of-concept of an interactive PyExplainer’s visual explanation. PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

this, would it reverse the prediction of defect models? {Churn > 100 & Reviewers < 2} DEFECT This rule is explainable, but not understandable and not actionable. Thus, we design a proof-of-concept of an interactive PyExplainer’s visual explanation. Risk score PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

this, would it reverse the prediction of defect models? {Churn > 100 & Reviewers < 2} DEFECT This rule is explainable, but not understandable and not actionable. Thus, we design a proof-of-concept of an interactive PyExplainer’s visual explanation. Risk score Explanation + Guidance PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

this, would it reverse the prediction of defect models? {Churn > 100 & Reviewers < 2} DEFECT This rule is explainable, but not understandable and not actionable. Thus, we design a proof-of-concept of an interactive PyExplainer’s visual explanation. Risk score Explanation + Guidance What-if: we change this? PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

this, would it reverse the prediction of defect models? {Churn > 100 & Reviewers < 2} DEFECT This rule is explainable, but not understandable and not actionable. Thus, we design a proof-of-concept of an interactive PyExplainer’s visual explanation. Risk score Explanation + Guidance What-if: we change this? Risk score is updated PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

13 A Live Demo of PyExplainer https://xai4se.github.io/tutorials/pyexplainer-live-demo.html (Step 1) Click
to access ‘Binder’ to interactively access the notebooks (Step 2) Click to access ‘Binder’ to interactively access the notebook

14 Please take 1 minute to complete our survey  
http://tiny.cc/xai4se-survey

15 Input=[an instance, a global model] PyExplainer Key Intuition—“To build
a local model to approximate the behaviours of the global model”— PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

15 + - An Instance to be explained (Step 1)
Generate   synthetic instances Synthetic Neighbours Input=[an instance, a global model] 1. Generates synthetic samples around   the neighbourhood. PyExplainer Key Intuition—“To build a local model to approximate the behaviours of the global model”— PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

Generate   synthetic instances Synthetic Neighbours Input=[an instance, a global model] 1. Generates synthetic samples around   the neighbourhood. An instance to   be explained PyExplainer Key Intuition—“To build a local model to approximate the behaviours of the global model”— PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

Generate   synthetic instances Synthetic Neighbours Input=[an instance, a global model] 1. Generates synthetic samples around   the neighbourhood. An instance to   be explained Synthetic Neighbours PyExplainer Key Intuition—“To build a local model to approximate the behaviours of the global model”— PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

Generate   synthetic instances Synthetic Neighbours Input=[an instance, a global model] 1. Generates synthetic samples around   the neighbourhood. Neighbourhood An instance to   be explained Synthetic Neighbours PyExplainer Key Intuition—“To build a local model to approximate the behaviours of the global model”— PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

15 A global black-   box model (Step 2) Obtain
predictions   from the global model (Y’) Synthetic neighbours around   the instance to be explained Y’ F T T F F X’ + - An Instance to be explained (Step 1) Generate   synthetic instances Synthetic Neighbours Input=[an instance, a global model] 1. Generates synthetic samples around   the neighbourhood. Neighbourhood An instance to   be explained Synthetic Neighbours PyExplainer Key Intuition—“To build a local model to approximate the behaviours of the global model”— PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

predictions   from the global model (Y’) Synthetic neighbours around   the instance to be explained Y’ F T T F F X’ (Step 3) Build a RuleFit   model + - An Instance to be explained (Step 1) Generate   synthetic instances Synthetic Neighbours Input=[an instance, a global model] 1. Generates synthetic samples around   the neighbourhood. 2. Builds a local interpretable model   using RuleFit to locally approximate the   predictions of the underlying global model. Neighbourhood An instance to   be explained Synthetic Neighbours PyExplainer Key Intuition—“To build a local model to approximate the behaviours of the global model”— PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

predictions   from the global model (Y’) Synthetic neighbours around   the instance to be explained Y’ F T T F F X’ A Local Model (Step 3) Build a RuleFit   model + - An Instance to be explained (Step 1) Generate   synthetic instances Synthetic Neighbours Local Explanation (Step 4) Generate   an explanation Input=[an instance, a global model] 3. Generates a rule-based explanation by identifying the most important rules for the individual prediction 1. Generates synthetic samples around   the neighbourhood. 2. Builds a local interpretable model   using RuleFit to locally approximate the   predictions of the underlying global model. Neighbourhood An instance to   be explained Synthetic Neighbours PyExplainer Key Intuition—“To build a local model to approximate the behaviours of the global model”— PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

16 PyExplainer vs LIME (State-of-the-art) PyExplainer: Explaining the Predictions of
Just-In-Time Defect Models, ASE 2021.

16 PyExplainer vs LIME (State-of-the-art) Synthetic   Neighbours Generate XAI
A General   XAI Concept PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

16 PyExplainer vs LIME (State-of-the-art) Local Models Build Synthetic  
Neighbours Generate XAI A General   XAI Concept PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

16 PyExplainer vs LIME (State-of-the-art) Local Models Build Explanations Generate
Synthetic   Neighbours Generate XAI A General   XAI Concept PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

16 PyExplainer vs LIME (State-of-the-art) Quality of Explanation ~ Neighbourhood
+ Local Models Local Models Build Explanations Generate Synthetic   Neighbours Generate XAI A General   XAI Concept PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

+ Local Models Local Models Build Explanations Generate Synthetic   Neighbours Generate XAI A General   XAI Concept Random Perturbation LIME Generate PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

+ Local Models Local Models Build Explanations Generate Synthetic   Neighbours Generate XAI A General   XAI Concept Random Perturbation LIME Generate K-Lasso   Model Build PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

+ Local Models Local Models Build Explanations Generate Synthetic   Neighbours Generate XAI A General   XAI Concept Random Perturbation LIME Generate K-Lasso   Model Build Generate Explanations PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.

+ Local Models Local Models Build Explanations Generate Synthetic   Neighbours Generate XAI A General   XAI Concept Random Perturbation LIME Generate K-Lasso   Model Build Generate Explanations PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. Random=No heuristics =Neighbours too large

+ Local Models Local Models Build Explanations Generate Synthetic   Neighbours Generate XAI A General   XAI Concept Random Perturbation LIME Generate K-Lasso   Model Build Generate Explanations PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. Inaccurate models =   Poor approximation Random=No heuristics =Neighbours too large

+ Local Models Local Models Build Explanations Generate Synthetic   Neighbours Generate XAI A General   XAI Concept Random Perturbation LIME Generate K-Lasso   Model Build Generate Explanations PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. Inaccurate models =   Poor approximation Incorrect explanation   = Incorrect insights Random=No heuristics =Neighbours too large

+ Local Models Local Models Build Explanations Generate Synthetic   Neighbours Generate XAI A General   XAI Concept Random Perturbation LIME Generate K-Lasso   Model Build Generate Explanations PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. PyExplainer Crossover+   Mutation Generate Inaccurate models =   Poor approximation Incorrect explanation   = Incorrect insights Random=No heuristics =Neighbours too large

+ Local Models Local Models Build Explanations Generate Synthetic   Neighbours Generate XAI A General   XAI Concept Random Perturbation LIME Generate K-Lasso   Model Build Generate Explanations PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. RuleFit Build PyExplainer Crossover+   Mutation Generate Inaccurate models =   Poor approximation Incorrect explanation   = Incorrect insights Random=No heuristics =Neighbours too large

+ Local Models Local Models Build Explanations Generate Synthetic   Neighbours Generate XAI A General   XAI Concept Random Perturbation LIME Generate K-Lasso   Model Build Generate Explanations PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. Generate Explanations RuleFit Build PyExplainer Crossover+   Mutation Generate Inaccurate models =   Poor approximation Incorrect explanation   = Incorrect insights Random=No heuristics =Neighbours too large

17 PyExplainer vs LIME (State-of-the-art) Random Perturbation LIME Generate K-Lasso
  Model Build Generate Explanations PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. Generate Explanations RuleFit Build PyExplainer Crossover+   Mutation Generate Experimental Results PyExplainer produces more similar synthetic neighbours and a more accurate local model More unique + consistent explanations =

+ A Live Demo • Part 2: Potential Usage Scenarios of XAI4SE • Part 3: Lessons Learned and Open Questions 18

19 Potential Usage Scenarios of XAI4SE Actionable Analytics: Stop Telling
Me What It Is; Please Tell Me What To Do, IEEE Software 2021.

19 Potential Usage Scenarios of XAI4SE Developer Pull Requests /
Commits / Files (1) Submit Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021.

Commits / Files (1) Submit Reviewer (2) Review Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021.

Commits / Files (1) Submit Reviewer (2) Review Defect Prediction Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021.

Commits / Files (1) Submit Reviewer (2) Review Defect Prediction Line-level DP “Which lines should I look at? Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021.

Commits / Files (1) Submit Reviewer (2) Review Defect Prediction Line-level DP “Which lines should I look at? Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Explainable DP “Why it is predicted as risky?

Commits / Files (1) Submit Reviewer (2) Review Defect Prediction Actionable DP “What should I do to improve the code quality?” Line-level DP “Which lines should I look at? Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Explainable DP “Why it is predicted as risky?

20 Example 1: Explainable Defect Prediction Researchers raised concerns that
a lack of explainability could lead to a lack of trust when adopting defect predictions in practice (Dam et al. ICSE-NIER 2018) An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.

20 Example 1: Explainable Defect Prediction Practitioners perceived that explanations
are   equally useful as predictions (Jiarpakdee et al MSR’21) Researchers raised concerns that a lack of explainability could lead to a lack of trust when adopting defect predictions in practice (Dam et al. ICSE-NIER 2018) An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.

are   equally useful as predictions (Jiarpakdee et al MSR’21) Researchers raised concerns that a lack of explainability could lead to a lack of trust when adopting defect predictions in practice (Dam et al. ICSE-NIER 2018) Challenge Many ML-based defect models are globally   explainable, but not locally explainable An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.

are   equally useful as predictions (Jiarpakdee et al MSR’21) Researchers raised concerns that a lack of explainability could lead to a lack of trust when adopting defect predictions in practice (Dam et al. ICSE-NIER 2018) Challenge Many ML-based defect models are globally   explainable, but not locally explainable Global explanation (model level) is derived from historical data,   but may not be applicable to unseen data (a testing instance) An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.

are   equally useful as predictions (Jiarpakdee et al MSR’21) Researchers raised concerns that a lack of explainability could lead to a lack of trust when adopting defect predictions in practice (Dam et al. ICSE-NIER 2018) Challenge Many ML-based defect models are globally   explainable, but not locally explainable Practitioners still do not know why a file is predicted as defective Global explanation (model level) is derived from historical data,   but may not be applicable to unseen data (a testing instance) An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.

How LIME can be used to answer Why-questions (to generate
contrastive explanations)? 21 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. LIME-XDP: LIME for Explainable Defect Prediction

contrastive explanations)? Plain-fact: Why is this file predicted as defective? LIME Support Explanation: Because of #ClassCoupled>5 => DEFECT   Implication: To mitigate the risk, reducing the #ClassCoupled. 21 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. LIME-XDP: LIME for Explainable Defect Prediction

contrastive explanations)? Plain-fact: Why is this file predicted as defective? LIME Support Explanation: Because of #ClassCoupled>5 => DEFECT   Implication: To mitigate the risk, reducing the #ClassCoupled. Property-contrast: Why is this file predicted as defective, rather than clean?   LIME Contradict Explanation: Because of #DEV<=2 => CLEAN   Implication: Maintaining the #DEV. 21 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. LIME-XDP: LIME for Explainable Defect Prediction

contrastive explanations)? Plain-fact: Why is this file predicted as defective? LIME Support Explanation: Because of #ClassCoupled>5 => DEFECT   Implication: To mitigate the risk, reducing the #ClassCoupled. Property-contrast: Why is this file predicted as defective, rather than clean?   LIME Contradict Explanation: Because of #DEV<=2 => CLEAN   Implication: Maintaining the #DEV. V2 V1 21 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. LIME-XDP: LIME for Explainable Defect Prediction

contrastive explanations)? Plain-fact: Why is this file predicted as defective? LIME Support Explanation: Because of #ClassCoupled>5 => DEFECT   Implication: To mitigate the risk, reducing the #ClassCoupled. Property-contrast: Why is this file predicted as defective, rather than clean?   LIME Contradict Explanation: Because of #DEV<=2 => CLEAN   Implication: Maintaining the #DEV. V2 V1 Time-contrast: Why is file A predicted as Defective in Release 1.0, while predicted as Clean in Release 2.0? 21 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. LIME-XDP: LIME for Explainable Defect Prediction

contrastive explanations)? Plain-fact: Why is this file predicted as defective? LIME Support Explanation: Because of #ClassCoupled>5 => DEFECT   Implication: To mitigate the risk, reducing the #ClassCoupled. Property-contrast: Why is this file predicted as defective, rather than clean?   LIME Contradict Explanation: Because of #DEV<=2 => CLEAN   Implication: Maintaining the #DEV. V2 V1 Time-contrast: Why is file A predicted as Defective in Release 1.0, while predicted as Clean in Release 2.0? Object-contrast: Why is file A predicted as Defective,   while file B is predicted as Clean? 21 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. LIME-XDP: LIME for Explainable Defect Prediction

Research Questions & Results 22 An Empirical Study of Model-Agnostics
Techniques for Defect Prediction Models, TSE 2020.

Research Questions & Results Do different predictions have different explanations?
Motivation: Global explanation is too general, but   what about local explanations? 22 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.

Research Questions & Results • Given the same defect models,
different predictions have different local explanations,   highlighting the need of XAI tools for SE. Do different predictions have different explanations? Motivation: Global explanation is too general, but   what about local explanations? 22 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.

different predictions have different local explanations,   highlighting the need of XAI tools for SE. Do different predictions have different explanations? Motivation: Global explanation is too general, but   what about local explanations? Are local explanation overlap with global explanation? Motivation: Ideally, local models should accurately mimic the predictions of global models. 22 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.

different predictions have different local explanations,   highlighting the need of XAI tools for SE. • Top-10 features of local explanations are mostly overlapping but not the same as the global explanation. Do different predictions have different explanations? Motivation: Global explanation is too general, but   what about local explanations? Are local explanation overlap with global explanation? Motivation: Ideally, local models should accurately mimic the predictions of global models. 22 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.

different predictions have different local explanations,   highlighting the need of XAI tools for SE. • Top-10 features of local explanations are mostly overlapping but not the same as the global explanation. Are model-agnostic techniques rely on random seeds? Motivation: Model-agnostic techniques involve   randomization, which could impact local explanations. Do different predictions have different explanations? Motivation: Global explanation is too general, but   what about local explanations? Are local explanation overlap with global explanation? Motivation: Ideally, local models should accurately mimic the predictions of global models. 22 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.

different predictions have different local explanations,   highlighting the need of XAI tools for SE. • Top-10 features of local explanations are mostly overlapping but not the same as the global explanation. • Random seeds need to be defined to increase the stability of the model-agnostic techniques. Are model-agnostic techniques rely on random seeds? Motivation: Model-agnostic techniques involve   randomization, which could impact local explanations. Do different predictions have different explanations? Motivation: Global explanation is too general, but   what about local explanations? Are local explanation overlap with global explanation? Motivation: Ideally, local models should accurately mimic the predictions of global models. 22 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.

different predictions have different local explanations,   highlighting the need of XAI tools for SE. • Top-10 features of local explanations are mostly overlapping but not the same as the global explanation. • Random seeds need to be defined to increase the stability of the model-agnostic techniques. Are model-agnostic techniques rely on random seeds? Motivation: Model-agnostic techniques involve   randomization, which could impact local explanations. Do different predictions have different explanations? Motivation: Global explanation is too general, but   what about local explanations? Are local explanation overlap with global explanation? Motivation: Ideally, local models should accurately mimic the predictions of global models. 22 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. How do practitioners perceive the generated explanation? Motivation: LIME can be used to generate local explanation,   but little is known how do practitioners perceive.

different predictions have different local explanations,   highlighting the need of XAI tools for SE. • Top-10 features of local explanations are mostly overlapping but not the same as the global explanation. • Random seeds need to be defined to increase the stability of the model-agnostic techniques. Are model-agnostic techniques rely on random seeds? Motivation: Model-agnostic techniques involve   randomization, which could impact local explanations. Do different predictions have different explanations? Motivation: Global explanation is too general, but   what about local explanations? Are local explanation overlap with global explanation? Motivation: Ideally, local models should accurately mimic the predictions of global models. 22 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. • 65% of the participants agree that Time- contrast explanations are most useful. How do practitioners perceive the generated explanation? Motivation: LIME can be used to generate local explanation,   but little is known how do practitioners perceive.

23 Example 2: Line-Level Defect Prediction Predicting Defective Lines Using
a Model-Agnostic Technique, TSE 2021. Researchers raised concerns that fine-grained defect prediction is needed   e.g., Pascarella et al. (JSS 2019), Wan et al. (TSE 2020)

a Model-Agnostic Technique, TSE 2021. Practitioners still Developers still spend lots of effort to locate where is a defect in a file Researchers raised concerns that fine-grained defect prediction is needed   e.g., Pascarella et al. (JSS 2019), Wan et al. (TSE 2020)

a Model-Agnostic Technique, TSE 2021. The dimension of defect dataset is too large (i.e., 230,898 code tokens and 259,617 lines of code) Practitioners still Developers still spend lots of effort to locate where is a defect in a file Researchers raised concerns that fine-grained defect prediction is needed   e.g., Pascarella et al. (JSS 2019), Wan et al. (TSE 2020) Challenge

a Model-Agnostic Technique, TSE 2021. The ratio of defective lines is extremely low (i.e., 1%-3%) The dimension of defect dataset is too large (i.e., 230,898 code tokens and 259,617 lines of code) Practitioners still Developers still spend lots of effort to locate where is a defect in a file Researchers raised concerns that fine-grained defect prediction is needed   e.g., Pascarella et al. (JSS 2019), Wan et al. (TSE 2020) Challenge

a Model-Agnostic Technique, TSE 2021. ML-based line-level defect prediction often performs poorly The ratio of defective lines is extremely low (i.e., 1%-3%) The dimension of defect dataset is too large (i.e., 230,898 code tokens and 259,617 lines of code) Practitioners still Developers still spend lots of effort to locate where is a defect in a file Researchers raised concerns that fine-grained defect prediction is needed   e.g., Pascarella et al. (JSS 2019), Wan et al. (TSE 2020) Challenge

a Model-Agnostic Technique, TSE 2021. ML-based line-level defect prediction often performs poorly The ratio of defective lines is extremely low (i.e., 1%-3%) The dimension of defect dataset is too large (i.e., 230,898 code tokens and 259,617 lines of code) Never seen any deep neural network predict defective lines Practitioners still Developers still spend lots of effort to locate where is a defect in a file Researchers raised concerns that fine-grained defect prediction is needed   e.g., Pascarella et al. (JSS 2019), Wan et al. (TSE 2020) Challenge

24 Identified defect-prone lines File-Level Defect Prediction Model A File
of Interest (Testing file) LIME oldCurrent current node closure Defective (LIME Score >0) Clean (LIME score < 0) 0.8 0.1 -0.3 -0.7 Ranking tokens based on LIME scores Mapping tokens to lines if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } Defect Dataset Token_1 Token_2 Token_N … File_A.java File_B.java File_C.java Bag-of-Tokens feature vectors Building file- level defect models Identifying defect- prone lines Files of Interest (Testing files) Identified defect- prone lines File-Level Defect Model Extracting features Ranking defect-prone lines Most risky Least risky LINE-DP: Line-Level Defect Prediction Predicting Defective Lines Using a Model-Agnostic Technique, TSE 2021.

of Interest (Testing file) LIME oldCurrent current node closure Defective (LIME Score >0) Clean (LIME score < 0) 0.8 0.1 -0.3 -0.7 Ranking tokens based on LIME scores Mapping tokens to lines if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } Defect Dataset Token_1 Token_2 Token_N … File_A.java File_B.java File_C.java Bag-of-Tokens feature vectors Building file- level defect models Identifying defect- prone lines Files of Interest (Testing files) Identified defect- prone lines File-Level Defect Model Extracting features Ranking defect-prone lines Most risky Least risky LINE-DP: Line-Level Defect Prediction Predicting Defective Lines Using a Model-Agnostic Technique, TSE 2021.

of Interest (Testing file) LIME oldCurrent current node closure Defective (LIME Score >0) Clean (LIME score < 0) 0.8 0.1 -0.3 -0.7 Ranking tokens based on LIME scores Mapping tokens to lines if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } Defect Dataset Token_1 Token_2 Token_N … File_A.java File_B.java File_C.java Bag-of-Tokens feature vectors Building file- level defect models Identifying defect- prone lines Files of Interest (Testing files) Identified defect- prone lines File-Level Defect Model Extracting features Ranking defect-prone lines Most risky Least risky Predicting Defective Lines Using a Model-Agnostic Technique, TSE 2021. LINE-DP: Line-Level Defect Prediction

30 LINE-DP: Experimental Results Predicting Defective Lines Using a Model-Agnostic
Technique, TSE 2021. “LINE-DP achieves an average recall of 0.61 and a Recall@Top20%LOC recall of 0.27, which outperforms other baselines for both within-releases and cross-releases settings.” Benchmark Dataset: https://github.com/awsm-research/line-level-defect-prediction

31 Inventing the next-generation of defect prediction technologies Research  
Translation Evaluating a proof-of-concept of AIBugHunter 2.0 with Practitioners ARC DECRA 2020-2023: Practical and Explainable Analytics to Prevent Future Software Defects. Practical Explainable Actionable Objectives Aim Industry   Problems Developers still spend lots of effort   to locate where is a defect in a file. Developers still do not trust the predictions. Developers still do not know what should they do to improve quality. 1. LineDP: Predicting Defective Lines Using a Model-Agnostic Technique (TSE’21) 2. JITLine: A Simpler, Better, Faster, Finer-grained Just-In-Time Defect Prediction (MSR’21) 3. DeepLineDP: Towards a Deep Learning Approach for Line-Level Defect Prediction (Under Review)   … 1. Survey: Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models (MSR’21) 2. LIME-HPO: An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models (TSE’20) 3. JITBot: An Explainable Just-In- Time Defect Prediction (ASE’20)   …. 1. SQAPlanner: Generating data- informed software quality improvement plans (TSE’21) 2. PyExplainer: Explaining the Predictions of Just-In-Time Defect Models (ASE’21) 3. Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do (Software’21)   …. Technical   Research Expected   Benefits To improve the efficiency and effectiveness of SQA process (e.g., code review & testing)

+ A Live Demo • Part 2: Other Potential Usage Scenarios of XAI4SE • Part 3: Lessons Learned and Open Questions 32

XAI4SE is raised for a decade, but not there yet
33 2010 2012 2014 2016 2018 2020 Concerns Raised

33 2010 2012 2014 2016 2018 2020 Goldfish Bowl Panel, Software Development Analytics, ICSE’2012 Concerns Raised

33 2010 2012 2014 2016 2018 2020 Goldfish Bowl Panel, Software Development Analytics, ICSE’2012 Concerns Raised Improve ML Modelling Process (Predict + Explain)

33 2010 2012 2014 2016 2018 2020 Goldfish Bowl Panel, Software Development Analytics, ICSE’2012 Explainable   Software Analytics,   ICSE-NIER’18 Concerns Raised Improve ML Modelling Process (Predict + Explain)

33 2010 2012 2014 2016 2018 2020 Goldfish Bowl Panel, Software Development Analytics, ICSE’2012 Explainable   Software Analytics,   ICSE-NIER’18 Concerns Raised Improve ML Modelling Process (Predict + Explain) XAI4SE…

Early Days of Explainability in SE (2000-2015) Techniques: Linear /
Non-Linear / Logistic Regression Analysis,   ANOVA, Naive Bayes, Association Rules, Decision Trees Transparent Machine Learning Variable   Importance … 34

Non-Linear / Logistic Regression Analysis,   ANOVA, Naive Bayes, Association Rules, Decision Trees Transparent Machine Learning Variable   Importance … 34 Globally explainable, but not accurate and not locally explainable

Non-Linear / Logistic Regression Analysis,   ANOVA, Naive Bayes, Association Rules, Decision Trees Transparent Machine Learning Variable   Importance … 34 AI4SE Booming Age (2015-2020) … Neural Network Predictions Techniques: DBN, MLP, RNN, CNN, LSTM, Transformer Globally explainable, but not accurate and not locally explainable

Non-Linear / Logistic Regression Analysis,   ANOVA, Naive Bayes, Association Rules, Decision Trees Transparent Machine Learning Variable   Importance … 34 AI4SE Booming Age (2015-2020) … Neural Network Predictions Techniques: DBN, MLP, RNN, CNN, LSTM, Transformer Globally explainable, but not accurate and not locally explainable Very accurate, but not explainable

40 Years of Defect Prediction Studies Help developers effectively prioritize
the limited SQA resources Help managers develop the most effective improvement plans Researchers’ Assumptions Help researchers generate empirical- grounded theories of software defects Practitioners’ Perceptions vs. etc. 35 Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021.

Practitioners’ Needs   (User Survey) Researchers’ Focuses (Literature Review) 36
Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021.

Practitioners’ Needs   (User Survey) Researchers’ Focuses (Literature Review) ANOVA
Regression Models able tance dom ests Partial Dep. Plot (PDP) Decision Rule/Tree Decision Rule/Tree LIME BreakDown SHAP Anchor Goal-3: Local Explanations If {LOC>100} then {BUG} 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 41% 4% Local explanations Global explanations Predctions 0 25 50 5 # of studies (%) Goal-1: Prediction Goal-2: Global Explanation Goal-3: Local Explanation Practitioners’ perceived usefulness 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% 41% 4% Goal 1:   Accuracy Goal 2:   Global Explainability Goal 3:   Local Explainability Three goals are similarly useful, But… 36 Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021.

91% 41% 4% 0 25 50 75 100 Goal 1
(Predict) Goal 2 (Understand) Goal 3 (Explain) # of studies (%) Goals of developing defect prediction models 18% 0 5 10 15 20 ANO # of studies (%) Techn 5 10 15 20 25 # of studies (%) Goal 1 (Predict) Goa (Un Goals of developing defect prediction mo 91% focused on improving accuracy As few as 4% focused on increasing its explainability Practitioners’ Needs   (User Survey) Researchers’ Focuses (Literature Review) ANOVA Regression Models able tance dom ests Partial Dep. Plot (PDP) Decision Rule/Tree Decision Rule/Tree LIME BreakDown SHAP Anchor Goal-3: Local Explanations If {LOC>100} then {BUG} 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 41% 4% Local explanations Global explanations Predctions 0 25 50 5 # of studies (%) Goal-1: Prediction Goal-2: Global Explanation Goal-3: Local Explanation Practitioners’ perceived usefulness 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% 41% 4% Goal 1:   Accuracy Goal 2:   Global Explainability Goal 3:   Local Explainability Three goals are similarly useful, But… 36 Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021.

On The Needs of Actionable+Explainable Analytics Actionable Analytics: Stop Telling
Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Practitioners’ Needs 37

Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Lewis et al., Does Bug Prediction Support Human Developers? Findings From a Google Case Study, ICSE’13 “Defect models should be more actionable to help software engineers debug their programs” Practitioners’ Needs 37

Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Lewis et al., Does Bug Prediction Support Human Developers? Findings From a Google Case Study, ICSE’13 “Defect models should be more actionable to help software engineers debug their programs” Tan et al., Online Defect Prediction for Imbalanced Data, ICSE- SEIP 2015. “Developers need to be convinced and prediction results need to be actionable” Practitioners’ Needs 37

Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Lewis et al., Does Bug Prediction Support Human Developers? Findings From a Google Case Study, ICSE’13 “Defect models should be more actionable to help software engineers debug their programs” Tan et al., Online Defect Prediction for Imbalanced Data, ICSE- SEIP 2015. “Developers need to be convinced and prediction results need to be actionable” Practitioners’ Needs Jiarpakdee et al., Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021. “Improving explainability is perceived as equally useful as improving accuracy” 37

Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Lewis et al., Does Bug Prediction Support Human Developers? Findings From a Google Case Study, ICSE’13 “Defect models should be more actionable to help software engineers debug their programs” Tan et al., Online Defect Prediction for Imbalanced Data, ICSE- SEIP 2015. “Developers need to be convinced and prediction results need to be actionable” Practitioners’ Needs Researchers’ Concerns & Jiarpakdee et al., Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021. “Improving explainability is perceived as equally useful as improving accuracy” 37

Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Lewis et al., Does Bug Prediction Support Human Developers? Findings From a Google Case Study, ICSE’13 “Defect models should be more actionable to help software engineers debug their programs” Tan et al., Online Defect Prediction for Imbalanced Data, ICSE- SEIP 2015. “Developers need to be convinced and prediction results need to be actionable” Practitioners’ Needs Researchers’ Concerns & Dam et al. Explainable Software Analytics, ICSE-NIER’18 “Explainability should therefore be a key measure for evaluating software analytics” Jiarpakdee et al., Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021. “Improving explainability is perceived as equally useful as improving accuracy” 37

Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Lewis et al., Does Bug Prediction Support Human Developers? Findings From a Google Case Study, ICSE’13 “Defect models should be more actionable to help software engineers debug their programs” Tan et al., Online Defect Prediction for Imbalanced Data, ICSE- SEIP 2015. “Developers need to be convinced and prediction results need to be actionable” Practitioners’ Needs Researchers’ Concerns & Dam et al. Explainable Software Analytics, ICSE-NIER’18 “Explainability should therefore be a key measure for evaluating software analytics” Tantithamthavorn et al. Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021. “Explainable and actionable software analytics is urgently and critically needed” Jiarpakdee et al., Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021. “Improving explainability is perceived as equally useful as improving accuracy” 37

Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Lewis et al., Does Bug Prediction Support Human Developers? Findings From a Google Case Study, ICSE’13 “Defect models should be more actionable to help software engineers debug their programs” Tan et al., Online Defect Prediction for Imbalanced Data, ICSE- SEIP 2015. “Developers need to be convinced and prediction results need to be actionable” Practitioners’ Needs Researchers’ Concerns & Dam et al. Explainable Software Analytics, ICSE-NIER’18 “Explainability should therefore be a key measure for evaluating software analytics” Tantithamthavorn et al. Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021. “Explainable and actionable software analytics is urgently and critically needed” Choetkiertikul et al. A deep learning model for estimating story points, TSE’19 “Explainability of a model is important for full adoption of machine learning techniques” Jiarpakdee et al., Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021. “Improving explainability is perceived as equally useful as improving accuracy” 37

38 Explainable AI is much needed in software engineering  
(including other SE tasks), but remains largely unexplored 1

39 Many XAI Open-Source Toolkits

40 Explainability and Explanation

• Explainability is the degree to which a human can
understand the reasons behind a prediction. 40 Explainability and Explanation

understand the reasons behind a prediction. • Explanation serves as an interface between Humans and Machines.     40 Explainability and Explanation Complex AI   Models XAI Explainable AI Methods Explanation Users Humans Machines

understand the reasons behind a prediction. • Explanation serves as an interface between Humans and Machines.     • An explanation is an answer to a why-question (Tim Miller, 2018). 40 Explainability and Explanation Complex AI   Models XAI Explainable AI Methods Explanation Users Humans Machines

understand the reasons behind a prediction. • Explanation serves as an interface between Humans and Machines.     • An explanation is an answer to a why-question (Tim Miller, 2018). • Challenge 1: The effectiveness of an explanation depends on the questions asked. 40 Explainability and Explanation Complex AI   Models XAI Explainable AI Methods Explanation Users Humans Machines

understand the reasons behind a prediction. • Explanation serves as an interface between Humans and Machines.     • An explanation is an answer to a why-question (Tim Miller, 2018). • Challenge 1: The effectiveness of an explanation depends on the questions asked. • Challenge 2: One explanation can be presented in various scopes and forms to serve a goal. 40 Explainability and Explanation Complex AI   Models XAI Explainable AI Methods Explanation Users Humans Machines

41 Different Techniques = Different Explanations Global Local Practitioners Perceptions
of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021. Different techniques produce different forms of explanations with different information.

42 Different stakeholders have different questions Complex AI   Models
However, most XAI toolkits are algorithmic-driven, not yet human-centred.

Why am I getting this prediction? How can I get a better decision? Predictions However, most XAI toolkits are algorithmic-driven, not yet human-centred.

Why am I getting this prediction? How can I get a better decision? Predictions Software Engineers • Why a file/commit is predicted as defective? • What to do to make it better? However, most XAI toolkits are algorithmic-driven, not yet human-centred.

Why am I getting this prediction? How can I get a better decision? Predictions AI Experts • Which AI models should I select? • How do I monitor and debug this models? Software Engineers • Why a file/commit is predicted as defective? • What to do to make it better? However, most XAI toolkits are algorithmic-driven, not yet human-centred.

Why am I getting this prediction? How can I get a better decision? Predictions Domain Experts • Are the AI models learned correctly? • Is the AI model logic reasonable? AI Experts • Which AI models should I select? • How do I monitor and debug this models? Software Engineers • Why a file/commit is predicted as defective? • What to do to make it better? However, most XAI toolkits are algorithmic-driven, not yet human-centred.

Why am I getting this prediction? How can I get a better decision? Predictions Business Leaders Policy Maker • Are the predictions/recommendations/text generations of AI models aligned with company values? • Are AI models recommend tasks to developers fairly • Are AI models ready to be deployed? Domain Experts • Are the AI models learned correctly? • Is the AI model logic reasonable? AI Experts • Which AI models should I select? • How do I monitor and debug this models? Software Engineers • Why a file/commit is predicted as defective? • What to do to make it better? However, most XAI toolkits are algorithmic-driven, not yet human-centred.

Why am I getting this prediction? How can I get a better decision? Predictions Regulators • Do AI models (e.g., developer productivity predictions) conform with laws/regulations like GDPR? • Is software development productivity estimated from gender, age, race, marital status? Business Leaders Policy Maker • Are the predictions/recommendations/text generations of AI models aligned with company values? • Are AI models recommend tasks to developers fairly • Are AI models ready to be deployed? Domain Experts • Are the AI models learned correctly? • Is the AI model logic reasonable? AI Experts • Which AI models should I select? • How do I monitor and debug this models? Software Engineers • Why a file/commit is predicted as defective? • What to do to make it better? However, most XAI toolkits are algorithmic-driven, not yet human-centred.

43 A Human-Centric XAI Recipe (Do-Re-Mi)

To make AI more explainable, we need to: 43 A
Human-Centric XAI Recipe (Do-Re-Mi)

To make AI more explainable, we need to: • (Step
1) Domain Analysis to understand the AI problem, social contexts, and stakeholders • Stakeholders: Who do we want to explain? e.g., developers 43 A Human-Centric XAI Recipe (Do-Re-Mi)

1) Domain Analysis to understand the AI problem, social contexts, and stakeholders • Stakeholders: Who do we want to explain? e.g., developers • (Step 2) Requirement Elicitation to understand practitioners’ needs • Goals: What is their purpose? e.g., gain deeper insights • Questions: What do we want to explain? e.g., why a file/commit is predicted as defective? 43 A Human-Centric XAI Recipe (Do-Re-Mi)

1) Domain Analysis to understand the AI problem, social contexts, and stakeholders • Stakeholders: Who do we want to explain? e.g., developers • (Step 2) Requirement Elicitation to understand practitioners’ needs • Goals: What is their purpose? e.g., gain deeper insights • Questions: What do we want to explain? e.g., why a file/commit is predicted as defective? • (Step 3) Multimodal explanation design • Scopes: Global Level, Local Level (Instance to be explained) • AI Models: What kind of AI models trying to explain? e.g., classification, regression, NLP, etc. • Forms: Variable Importance, Rule, Integrated Gradients, Example-based, Attention, Heatmap • XAI Techniques: LIME, LORE, SHAP, Anchors, PDP, DICE, Surrogate. 43 A Human-Centric XAI Recipe (Do-Re-Mi)

44 Human-centric XAI approaches must be used to design explanations
that most suit practitioners’ needs. 2

Explainable AI for Software Engineering (https:...

Explainable AI for Software Engineering (https://xai4se.github.io/)

More Decks by Dr. Kla Tantithamthavorn

Other Decks in Technology

Featured

Transcript