Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Explainable AI for Software Engineering (https://xai4se.github.io/)

Explainable AI for Software Engineering (https://xai4se.github.io/)

The success of software engineering projects largely depends on complex decision-making. For example, which tasks should a developer do first, who should perform this task, is the software of high quality, is a software system reliable and resilient enough to deploy, etc. However, erroneous decision-making for these complex questions is costly in terms of money and reputation. Thus, Artificial Intelligence/Machine Learning (AI/ML) techniques have been widely used in software engineering for developing software analytics tools and techniques to improve decision-making, developer productivity, and software quality. However, the predictions of such AI/ML models for software engineering are still not practical (i.e., fine-grained), not explainable, and not actionable. These concerns often hinder the adoption of AI/ML models in software engineering practices. In addition, many recent studies still focus on improving the accuracy, while a few of them focus on improving explainability. Are we moving in the right direction? How can we better improve the SE community (both research and education)? In this book, we first provide a concise yet essential introduction to the most important aspects of Explainable AI and a hands-on tutorial of Explainable AI tools and techniques. Then, we introduce the fundamental knowledge of defect prediction (an example application of AI for Software Engineering). Finally, we demonstrate three successful case studies on how Explainable AI techniques can be used to address the aforementioned challenges by making the predictions of software defect prediction models more practical, explainable, and actionable.

More Decks by Chakkrit (Kla) Tantithamthavorn

Other Decks in Technology

Transcript

  1. 1

  2. Acknowledgement Actionable Analytics: Stop Telling Me What It Is; Please

    Tell Me What To Do, IEEE Software 2021. SQAPlanner: Generating data-informed software quality improvement plans, TSE 2021. Predicting Defective Lines Using a Model-Agnostic Technique, TSE 2021. An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. JITLine: A Simpler, Better, Faster, Finer-grained Just-In-Time Defect Prediction, MSR 2021. Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021. Many thanks my colleagues and collaborators who support my research in the past few years. 2
  3. • Follow-up after the tutorial: [email protected] • Twitter: @klainfo •

    Materials: http://xai4se.github.io 
 Ask questions in Zoom chat 3
  4. • Follow-up after the tutorial: [email protected] • Twitter: @klainfo •

    Materials: http://xai4se.github.io 
 Ask questions in Zoom chat Click to access ‘Binder’ or ‘Colab’ 
 to interactively access the notebooks 3
  5. Disclaimers • This tutorial is not a comprehensive introduction to

    Explainable AI theories or algorithms. • This tutorial is not an exhaustive survey of XAI, since there is a massive body of Explainable AI Research (1,000+ papers) in many disciplines (AI, ML, HCI, Social Science, and Software Engineering). 4
  6. Disclaimers • This tutorial is not a comprehensive introduction to

    Explainable AI theories or algorithms. • This tutorial is not an exhaustive survey of XAI, since there is a massive body of Explainable AI Research (1,000+ papers) in many disciplines (AI, ML, HCI, Social Science, and Software Engineering). • We are SE researchers that aim to make AI4SE research more explainable and actionable for practitioners, leading to more adoption in practice and creating more worldwide impact. 
 4
  7. Disclaimers • This tutorial is not a comprehensive introduction to

    Explainable AI theories or algorithms. • This tutorial is not an exhaustive survey of XAI, since there is a massive body of Explainable AI Research (1,000+ papers) in many disciplines (AI, ML, HCI, Social Science, and Software Engineering). • We are SE researchers that aim to make AI4SE research more explainable and actionable for practitioners, leading to more adoption in practice and creating more worldwide impact. 
 • This tutorial aims to: 4
  8. Disclaimers • This tutorial is not a comprehensive introduction to

    Explainable AI theories or algorithms. • This tutorial is not an exhaustive survey of XAI, since there is a massive body of Explainable AI Research (1,000+ papers) in many disciplines (AI, ML, HCI, Social Science, and Software Engineering). • We are SE researchers that aim to make AI4SE research more explainable and actionable for practitioners, leading to more adoption in practice and creating more worldwide impact. 
 • This tutorial aims to: • Motivate the importance of XAI for SE 4
  9. Disclaimers • This tutorial is not a comprehensive introduction to

    Explainable AI theories or algorithms. • This tutorial is not an exhaustive survey of XAI, since there is a massive body of Explainable AI Research (1,000+ papers) in many disciplines (AI, ML, HCI, Social Science, and Software Engineering). • We are SE researchers that aim to make AI4SE research more explainable and actionable for practitioners, leading to more adoption in practice and creating more worldwide impact. 
 • This tutorial aims to: • Motivate the importance of XAI for SE • Provide a concise yet essential introduction to the most important aspects of XAI 4
  10. Disclaimers • This tutorial is not a comprehensive introduction to

    Explainable AI theories or algorithms. • This tutorial is not an exhaustive survey of XAI, since there is a massive body of Explainable AI Research (1,000+ papers) in many disciplines (AI, ML, HCI, Social Science, and Software Engineering). • We are SE researchers that aim to make AI4SE research more explainable and actionable for practitioners, leading to more adoption in practice and creating more worldwide impact. 
 • This tutorial aims to: • Motivate the importance of XAI for SE • Provide a concise yet essential introduction to the most important aspects of XAI • Demonstrate some potential applications of XAI for SE 4
  11. Disclaimers • This tutorial is not a comprehensive introduction to

    Explainable AI theories or algorithms. • This tutorial is not an exhaustive survey of XAI, since there is a massive body of Explainable AI Research (1,000+ papers) in many disciplines (AI, ML, HCI, Social Science, and Software Engineering). • We are SE researchers that aim to make AI4SE research more explainable and actionable for practitioners, leading to more adoption in practice and creating more worldwide impact. 
 • This tutorial aims to: • Motivate the importance of XAI for SE • Provide a concise yet essential introduction to the most important aspects of XAI • Demonstrate some potential applications of XAI for SE • Convince everyone to tackle lots of open research questions of XAI4SE 4
  12. Agenda • Part 1: Explainable AI for Software Engineering (XAI4SE)

    + A Live Demo • Part 2: Other Potential Usage Scenarios of XAI4SE • Part 3: Lessons Learned and Open Questions 5
  13. Software development involves complex and critical decision-making When should we

    release? Managers Designers Developers Testers QA 
 Engineers How effective is our test suite? Is software design good enough? How should I fi x this bug? Which modules should I test fi rst? Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021. 6
  14. AI/ML IS ADOPTED IN SOFTWARE ENGINEERING IMPROVE SOFTWARE QUALITY Predict

    defects, vulnerabilities, malware Generate test cases 7
  15. AI/ML IS ADOPTED IN SOFTWARE ENGINEERING IMPROVE SOFTWARE QUALITY Predict

    defects, vulnerabilities, malware 
 Predict developer/team productivity 
 Recommend developers/reviewers Identify developer turnover IMPROVE PRODUCTIVITY 8
  16. 9 Developer Pull Requests / Commits / Files (1) Submit

    Defect Prediction: An Overview An AI/ML model to predict if a file will be defective in the future PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  17. 9 Developer Pull Requests / Commits / Files (1) Submit

    Defect Prediction: An Overview An AI/ML model to predict if a file will be defective in the future Reviewer (2) Review Too many PRs + too large PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  18. 9 Developer Pull Requests / Commits / Files (1) Submit

    Defect Prediction: An Overview An AI/ML model to predict if a file will be defective in the future Defect Prediction Risky Clean Reviewer (2) Review Too many PRs + too large PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  19. 9 Developer Pull Requests / Commits / Files (1) Submit

    Defect Prediction: An Overview An AI/ML model to predict if a file will be defective in the future Defect Prediction Risky Clean Reviewer (2) Review Too many PRs + too large PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  20. 9 Developer Pull Requests / Commits / Files (1) Submit

    Defect Prediction: An Overview An AI/ML model to predict if a file will be defective in the future Defect Prediction Risky Clean Reviewer (2) Review Too many PRs + too large Why a commit is predicted as defective? What should they do to improve it? PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  21. 9 Developer Pull Requests / Commits / Files (1) Submit

    Defect Prediction: An Overview An AI/ML model to predict if a file will be defective in the future Defect Prediction Risky Clean Reviewer (2) Review Too many PRs + too large Why a commit is predicted as defective? What should they do to improve it? Lack of Explainability = Lack of Trust = Lack of Adoption in Practice PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  22. Explainable AI: Objective & Definitions 10 The Explainable AI (XAI)

    aims to create a suite of AI/ML techniques that: (David Gunning, 2016) • Produce more explainable models, while maintaining a high level of prediction accuracy; and • Enable human users to understand and build an appropriate trust to the predictions Source: https://www.darpa.mil/program/explainable-artificial-intelligence
  23. Explainable AI: Objective & Definitions 10 The Explainable AI (XAI)

    aims to create a suite of AI/ML techniques that: (David Gunning, 2016) • Produce more explainable models, while maintaining a high level of prediction accuracy; and • Enable human users to understand and build an appropriate trust to the predictions Source: https://www.darpa.mil/program/explainable-artificial-intelligence XAI Data AI Algorithm Model Data AI Algorithm XAI Explanations
  24. Explainable AI: Objective & Definitions 10 The Explainable AI (XAI)

    aims to create a suite of AI/ML techniques that: (David Gunning, 2016) • Produce more explainable models, while maintaining a high level of prediction accuracy; and • Enable human users to understand and build an appropriate trust to the predictions Definitions: Interpretable ML – using a white-box model. Advantages of interpretable ML are mainly for high- stakes decisions. Explainable AI – using a black box and explaining it afterwards. Source: https://www.darpa.mil/program/explainable-artificial-intelligence XAI Data AI Algorithm Model Data AI Algorithm XAI Explanations
  25. 11 Defect Prediction Risky Clean PyExplainer: Explaining the Predictions of

    Just-In-Time Defect Models, ASE 2021. PyExplainer: To Explain Defect Predictions A local rule-based model agnostic technique to generation explanation and actionable guidance
  26. PyExplainer Model Instance {Churn > 100 & Reviewers < 2}

    DEFECT A rule based explanation 11 Defect Prediction Risky Clean PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. PyExplainer: To Explain Defect Predictions A local rule-based model agnostic technique to generation explanation and actionable guidance
  27. PyExplainer Model Instance {Churn > 100 & Reviewers < 2}

    DEFECT A rule based explanation 11 Defect Prediction Risky Clean Why a commit is predicted as defective? “A commit is predicted as defective, 
 since Churn > 100 and Reviewers < 2” PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. PyExplainer: To Explain Defect Predictions A local rule-based model agnostic technique to generation explanation and actionable guidance
  28. PyExplainer Model Instance {Churn > 100 & Reviewers < 2}

    DEFECT A rule based explanation 11 Defect Prediction Risky Clean Why a commit is predicted as defective? “A commit is predicted as defective, 
 since Churn > 100 and Reviewers < 2” PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. PyExplainer: To Explain Defect Predictions A local rule-based model agnostic technique to generation explanation and actionable guidance Help developers understand the most important aspects that are associated with defects.
  29. PyExplainer Model Instance {Churn > 100 & Reviewers < 2}

    DEFECT A rule based explanation 11 Defect Prediction Risky Clean Why a commit is predicted as defective? “A commit is predicted as defective, 
 since Churn > 100 and Reviewers < 2” PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. PyExplainer: To Explain Defect Predictions A local rule-based model agnostic technique to generation explanation and actionable guidance Help developers understand the most important aspects that are associated with defects. Help developers understand the risk threshold (how small it should be?).
  30. 12 PyExplainer: To Generate Actionable Guidance 
 What-if we change

    this, would it reverse the prediction of defect models? PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  31. 12 PyExplainer: To Generate Actionable Guidance 
 What-if we change

    this, would it reverse the prediction of defect models? {Churn > 100 & Reviewers < 2} DEFECT This rule is explainable, but not understandable and not actionable. PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  32. 12 PyExplainer: To Generate Actionable Guidance 
 What-if we change

    this, would it reverse the prediction of defect models? {Churn > 100 & Reviewers < 2} DEFECT This rule is explainable, but not understandable and not actionable. Thus, we design a proof-of-concept of an interactive PyExplainer’s visual explanation. PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  33. 12 PyExplainer: To Generate Actionable Guidance 
 What-if we change

    this, would it reverse the prediction of defect models? {Churn > 100 & Reviewers < 2} DEFECT This rule is explainable, but not understandable and not actionable. Thus, we design a proof-of-concept of an interactive PyExplainer’s visual explanation. PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  34. 12 PyExplainer: To Generate Actionable Guidance 
 What-if we change

    this, would it reverse the prediction of defect models? {Churn > 100 & Reviewers < 2} DEFECT This rule is explainable, but not understandable and not actionable. Thus, we design a proof-of-concept of an interactive PyExplainer’s visual explanation. Risk score PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  35. 12 PyExplainer: To Generate Actionable Guidance 
 What-if we change

    this, would it reverse the prediction of defect models? {Churn > 100 & Reviewers < 2} DEFECT This rule is explainable, but not understandable and not actionable. Thus, we design a proof-of-concept of an interactive PyExplainer’s visual explanation. Risk score Explanation + Guidance PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  36. 12 PyExplainer: To Generate Actionable Guidance 
 What-if we change

    this, would it reverse the prediction of defect models? {Churn > 100 & Reviewers < 2} DEFECT This rule is explainable, but not understandable and not actionable. Thus, we design a proof-of-concept of an interactive PyExplainer’s visual explanation. Risk score Explanation + Guidance What-if: we change this? PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  37. 12 PyExplainer: To Generate Actionable Guidance 
 What-if we change

    this, would it reverse the prediction of defect models? {Churn > 100 & Reviewers < 2} DEFECT This rule is explainable, but not understandable and not actionable. Thus, we design a proof-of-concept of an interactive PyExplainer’s visual explanation. Risk score Explanation + Guidance What-if: we change this? Risk score is updated PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  38. 13 A Live Demo of PyExplainer https://xai4se.github.io/tutorials/pyexplainer-live-demo.html (Step 1) Click

    to access ‘Binder’ to interactively access the notebooks (Step 2) Click to access ‘Binder’ to interactively access the notebook
  39. 15 Input=[an instance, a global model] PyExplainer Key Intuition—“To build

    a local model to approximate the behaviours of the global model”— PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  40. 15 + - An Instance to be explained (Step 1)

    Generate 
 synthetic instances Synthetic Neighbours Input=[an instance, a global model] 1. Generates synthetic samples around 
 the neighbourhood. PyExplainer Key Intuition—“To build a local model to approximate the behaviours of the global model”— PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  41. 15 + - An Instance to be explained (Step 1)

    Generate 
 synthetic instances Synthetic Neighbours Input=[an instance, a global model] 1. Generates synthetic samples around 
 the neighbourhood. An instance to 
 be explained PyExplainer Key Intuition—“To build a local model to approximate the behaviours of the global model”— PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  42. 15 + - An Instance to be explained (Step 1)

    Generate 
 synthetic instances Synthetic Neighbours Input=[an instance, a global model] 1. Generates synthetic samples around 
 the neighbourhood. An instance to 
 be explained Synthetic Neighbours PyExplainer Key Intuition—“To build a local model to approximate the behaviours of the global model”— PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  43. 15 + - An Instance to be explained (Step 1)

    Generate 
 synthetic instances Synthetic Neighbours Input=[an instance, a global model] 1. Generates synthetic samples around 
 the neighbourhood. Neighbourhood An instance to 
 be explained Synthetic Neighbours PyExplainer Key Intuition—“To build a local model to approximate the behaviours of the global model”— PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  44. 15 A global black- 
 box model (Step 2) Obtain

    predictions 
 from the global model (Y’) Synthetic neighbours around 
 the instance to be explained Y’ F T T F F X’ + - An Instance to be explained (Step 1) Generate 
 synthetic instances Synthetic Neighbours Input=[an instance, a global model] 1. Generates synthetic samples around 
 the neighbourhood. Neighbourhood An instance to 
 be explained Synthetic Neighbours PyExplainer Key Intuition—“To build a local model to approximate the behaviours of the global model”— PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  45. 15 A global black- 
 box model (Step 2) Obtain

    predictions 
 from the global model (Y’) Synthetic neighbours around 
 the instance to be explained Y’ F T T F F X’ (Step 3) Build a RuleFit 
 model + - An Instance to be explained (Step 1) Generate 
 synthetic instances Synthetic Neighbours Input=[an instance, a global model] 1. Generates synthetic samples around 
 the neighbourhood. 2. Builds a local interpretable model 
 using RuleFit to locally approximate the 
 predictions of the underlying global model. Neighbourhood An instance to 
 be explained Synthetic Neighbours PyExplainer Key Intuition—“To build a local model to approximate the behaviours of the global model”— PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  46. 15 A global black- 
 box model (Step 2) Obtain

    predictions 
 from the global model (Y’) Synthetic neighbours around 
 the instance to be explained Y’ F T T F F X’ A Local Model (Step 3) Build a RuleFit 
 model + - An Instance to be explained (Step 1) Generate 
 synthetic instances Synthetic Neighbours Local Explanation (Step 4) Generate 
 an explanation Input=[an instance, a global model] 3. Generates a rule-based explanation by identifying the most important rules for the individual prediction 1. Generates synthetic samples around 
 the neighbourhood. 2. Builds a local interpretable model 
 using RuleFit to locally approximate the 
 predictions of the underlying global model. Neighbourhood An instance to 
 be explained Synthetic Neighbours PyExplainer Key Intuition—“To build a local model to approximate the behaviours of the global model”— PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  47. 16 PyExplainer vs LIME (State-of-the-art) Synthetic 
 Neighbours Generate XAI

    A General 
 XAI Concept PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  48. 16 PyExplainer vs LIME (State-of-the-art) Local Models Build Synthetic 


    Neighbours Generate XAI A General 
 XAI Concept PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  49. 16 PyExplainer vs LIME (State-of-the-art) Local Models Build Explanations Generate

    Synthetic 
 Neighbours Generate XAI A General 
 XAI Concept PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  50. 16 PyExplainer vs LIME (State-of-the-art) Local Models Build Explanations Generate

    Synthetic 
 Neighbours Generate XAI A General 
 XAI Concept PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  51. 16 PyExplainer vs LIME (State-of-the-art) Quality of Explanation ~ Neighbourhood

    + Local Models Local Models Build Explanations Generate Synthetic 
 Neighbours Generate XAI A General 
 XAI Concept PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  52. 16 PyExplainer vs LIME (State-of-the-art) Quality of Explanation ~ Neighbourhood

    + Local Models Local Models Build Explanations Generate Synthetic 
 Neighbours Generate XAI A General 
 XAI Concept Random Perturbation LIME Generate PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  53. 16 PyExplainer vs LIME (State-of-the-art) Quality of Explanation ~ Neighbourhood

    + Local Models Local Models Build Explanations Generate Synthetic 
 Neighbours Generate XAI A General 
 XAI Concept Random Perturbation LIME Generate K-Lasso 
 Model Build PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  54. 16 PyExplainer vs LIME (State-of-the-art) Quality of Explanation ~ Neighbourhood

    + Local Models Local Models Build Explanations Generate Synthetic 
 Neighbours Generate XAI A General 
 XAI Concept Random Perturbation LIME Generate K-Lasso 
 Model Build Generate Explanations PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021.
  55. 16 PyExplainer vs LIME (State-of-the-art) Quality of Explanation ~ Neighbourhood

    + Local Models Local Models Build Explanations Generate Synthetic 
 Neighbours Generate XAI A General 
 XAI Concept Random Perturbation LIME Generate K-Lasso 
 Model Build Generate Explanations PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. Random=No heuristics =Neighbours too large
  56. 16 PyExplainer vs LIME (State-of-the-art) Quality of Explanation ~ Neighbourhood

    + Local Models Local Models Build Explanations Generate Synthetic 
 Neighbours Generate XAI A General 
 XAI Concept Random Perturbation LIME Generate K-Lasso 
 Model Build Generate Explanations PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. Inaccurate models = 
 Poor approximation Random=No heuristics =Neighbours too large
  57. 16 PyExplainer vs LIME (State-of-the-art) Quality of Explanation ~ Neighbourhood

    + Local Models Local Models Build Explanations Generate Synthetic 
 Neighbours Generate XAI A General 
 XAI Concept Random Perturbation LIME Generate K-Lasso 
 Model Build Generate Explanations PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. Inaccurate models = 
 Poor approximation Incorrect explanation 
 = Incorrect insights Random=No heuristics =Neighbours too large
  58. 16 PyExplainer vs LIME (State-of-the-art) Quality of Explanation ~ Neighbourhood

    + Local Models Local Models Build Explanations Generate Synthetic 
 Neighbours Generate XAI A General 
 XAI Concept Random Perturbation LIME Generate K-Lasso 
 Model Build Generate Explanations PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. PyExplainer Crossover+ 
 Mutation Generate Inaccurate models = 
 Poor approximation Incorrect explanation 
 = Incorrect insights Random=No heuristics =Neighbours too large
  59. 16 PyExplainer vs LIME (State-of-the-art) Quality of Explanation ~ Neighbourhood

    + Local Models Local Models Build Explanations Generate Synthetic 
 Neighbours Generate XAI A General 
 XAI Concept Random Perturbation LIME Generate K-Lasso 
 Model Build Generate Explanations PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. RuleFit Build PyExplainer Crossover+ 
 Mutation Generate Inaccurate models = 
 Poor approximation Incorrect explanation 
 = Incorrect insights Random=No heuristics =Neighbours too large
  60. 16 PyExplainer vs LIME (State-of-the-art) Quality of Explanation ~ Neighbourhood

    + Local Models Local Models Build Explanations Generate Synthetic 
 Neighbours Generate XAI A General 
 XAI Concept Random Perturbation LIME Generate K-Lasso 
 Model Build Generate Explanations PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. Generate Explanations RuleFit Build PyExplainer Crossover+ 
 Mutation Generate Inaccurate models = 
 Poor approximation Incorrect explanation 
 = Incorrect insights Random=No heuristics =Neighbours too large
  61. 17 PyExplainer vs LIME (State-of-the-art) Random Perturbation LIME Generate K-Lasso

    
 Model Build Generate Explanations PyExplainer: Explaining the Predictions of Just-In-Time Defect Models, ASE 2021. Generate Explanations RuleFit Build PyExplainer Crossover+ 
 Mutation Generate Experimental Results PyExplainer produces more similar synthetic neighbours and a more accurate local model More unique + consistent explanations =
  62. Agenda • Part 1: Explainable AI for Software Engineering (XAI4SE)

    + A Live Demo • Part 2: Potential Usage Scenarios of XAI4SE • Part 3: Lessons Learned and Open Questions 18
  63. 19 Potential Usage Scenarios of XAI4SE Actionable Analytics: Stop Telling

    Me What It Is; Please Tell Me What To Do, IEEE Software 2021.
  64. 19 Potential Usage Scenarios of XAI4SE Developer Pull Requests /

    Commits / Files (1) Submit Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021.
  65. 19 Potential Usage Scenarios of XAI4SE Developer Pull Requests /

    Commits / Files (1) Submit Reviewer (2) Review Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021.
  66. 19 Potential Usage Scenarios of XAI4SE Developer Pull Requests /

    Commits / Files (1) Submit Reviewer (2) Review Defect Prediction Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021.
  67. 19 Potential Usage Scenarios of XAI4SE Developer Pull Requests /

    Commits / Files (1) Submit Reviewer (2) Review Defect Prediction Line-level DP “Which lines should I look at? Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021.
  68. 19 Potential Usage Scenarios of XAI4SE Developer Pull Requests /

    Commits / Files (1) Submit Reviewer (2) Review Defect Prediction Line-level DP “Which lines should I look at? Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Explainable DP “Why it is predicted as risky?
  69. 19 Potential Usage Scenarios of XAI4SE Developer Pull Requests /

    Commits / Files (1) Submit Reviewer (2) Review Defect Prediction Actionable DP “What should I do to improve the code quality?” Line-level DP “Which lines should I look at? Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Explainable DP “Why it is predicted as risky?
  70. 20 Example 1: Explainable Defect Prediction Researchers raised concerns that

    a lack of explainability could lead to a lack of trust when adopting defect predictions in practice (Dam et al. ICSE-NIER 2018) An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.
  71. 20 Example 1: Explainable Defect Prediction Practitioners perceived that explanations

    are 
 equally useful as predictions (Jiarpakdee et al MSR’21) Researchers raised concerns that a lack of explainability could lead to a lack of trust when adopting defect predictions in practice (Dam et al. ICSE-NIER 2018) An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.
  72. 20 Example 1: Explainable Defect Prediction Practitioners perceived that explanations

    are 
 equally useful as predictions (Jiarpakdee et al MSR’21) Researchers raised concerns that a lack of explainability could lead to a lack of trust when adopting defect predictions in practice (Dam et al. ICSE-NIER 2018) Challenge Many ML-based defect models are globally 
 explainable, but not locally explainable An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.
  73. 20 Example 1: Explainable Defect Prediction Practitioners perceived that explanations

    are 
 equally useful as predictions (Jiarpakdee et al MSR’21) Researchers raised concerns that a lack of explainability could lead to a lack of trust when adopting defect predictions in practice (Dam et al. ICSE-NIER 2018) Challenge Many ML-based defect models are globally 
 explainable, but not locally explainable Global explanation (model level) is derived from historical data, 
 but may not be applicable to unseen data (a testing instance) An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.
  74. 20 Example 1: Explainable Defect Prediction Practitioners perceived that explanations

    are 
 equally useful as predictions (Jiarpakdee et al MSR’21) Researchers raised concerns that a lack of explainability could lead to a lack of trust when adopting defect predictions in practice (Dam et al. ICSE-NIER 2018) Challenge Many ML-based defect models are globally 
 explainable, but not locally explainable Practitioners still do not know why a file is predicted as defective Global explanation (model level) is derived from historical data, 
 but may not be applicable to unseen data (a testing instance) An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.
  75. How LIME can be used to answer Why-questions (to generate

    contrastive explanations)? 21 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. LIME-XDP: LIME for Explainable Defect Prediction
  76. How LIME can be used to answer Why-questions (to generate

    contrastive explanations)? Plain-fact: Why is this file predicted as defective? LIME Support Explanation: Because of #ClassCoupled>5 => DEFECT 
 Implication: To mitigate the risk, reducing the #ClassCoupled. 21 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. LIME-XDP: LIME for Explainable Defect Prediction
  77. How LIME can be used to answer Why-questions (to generate

    contrastive explanations)? Plain-fact: Why is this file predicted as defective? LIME Support Explanation: Because of #ClassCoupled>5 => DEFECT 
 Implication: To mitigate the risk, reducing the #ClassCoupled. Property-contrast: Why is this file predicted as defective, rather than clean? 
 LIME Contradict Explanation: Because of #DEV<=2 => CLEAN 
 Implication: Maintaining the #DEV. 21 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. LIME-XDP: LIME for Explainable Defect Prediction
  78. How LIME can be used to answer Why-questions (to generate

    contrastive explanations)? Plain-fact: Why is this file predicted as defective? LIME Support Explanation: Because of #ClassCoupled>5 => DEFECT 
 Implication: To mitigate the risk, reducing the #ClassCoupled. Property-contrast: Why is this file predicted as defective, rather than clean? 
 LIME Contradict Explanation: Because of #DEV<=2 => CLEAN 
 Implication: Maintaining the #DEV. V2 V1 21 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. LIME-XDP: LIME for Explainable Defect Prediction
  79. How LIME can be used to answer Why-questions (to generate

    contrastive explanations)? Plain-fact: Why is this file predicted as defective? LIME Support Explanation: Because of #ClassCoupled>5 => DEFECT 
 Implication: To mitigate the risk, reducing the #ClassCoupled. Property-contrast: Why is this file predicted as defective, rather than clean? 
 LIME Contradict Explanation: Because of #DEV<=2 => CLEAN 
 Implication: Maintaining the #DEV. V2 V1 Time-contrast: Why is file A predicted as Defective in Release 1.0, while predicted as Clean in Release 2.0? 21 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. LIME-XDP: LIME for Explainable Defect Prediction
  80. How LIME can be used to answer Why-questions (to generate

    contrastive explanations)? Plain-fact: Why is this file predicted as defective? LIME Support Explanation: Because of #ClassCoupled>5 => DEFECT 
 Implication: To mitigate the risk, reducing the #ClassCoupled. Property-contrast: Why is this file predicted as defective, rather than clean? 
 LIME Contradict Explanation: Because of #DEV<=2 => CLEAN 
 Implication: Maintaining the #DEV. V2 V1 Time-contrast: Why is file A predicted as Defective in Release 1.0, while predicted as Clean in Release 2.0? Object-contrast: Why is file A predicted as Defective, 
 while file B is predicted as Clean? 21 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. LIME-XDP: LIME for Explainable Defect Prediction
  81. Research Questions & Results 22 An Empirical Study of Model-Agnostics

    Techniques for Defect Prediction Models, TSE 2020.
  82. Research Questions & Results Do different predictions have different explanations?

    Motivation: Global explanation is too general, but 
 what about local explanations? 22 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.
  83. Research Questions & Results • Given the same defect models,

    different predictions have different local explanations, 
 highlighting the need of XAI tools for SE. Do different predictions have different explanations? Motivation: Global explanation is too general, but 
 what about local explanations? 22 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.
  84. Research Questions & Results • Given the same defect models,

    different predictions have different local explanations, 
 highlighting the need of XAI tools for SE. Do different predictions have different explanations? Motivation: Global explanation is too general, but 
 what about local explanations? Are local explanation overlap with global explanation? Motivation: Ideally, local models should accurately mimic the predictions of global models. 22 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.
  85. Research Questions & Results • Given the same defect models,

    different predictions have different local explanations, 
 highlighting the need of XAI tools for SE. • Top-10 features of local explanations are mostly overlapping but not the same as the global explanation. Do different predictions have different explanations? Motivation: Global explanation is too general, but 
 what about local explanations? Are local explanation overlap with global explanation? Motivation: Ideally, local models should accurately mimic the predictions of global models. 22 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.
  86. Research Questions & Results • Given the same defect models,

    different predictions have different local explanations, 
 highlighting the need of XAI tools for SE. • Top-10 features of local explanations are mostly overlapping but not the same as the global explanation. Are model-agnostic techniques rely on random seeds? Motivation: Model-agnostic techniques involve 
 randomization, which could impact local explanations. Do different predictions have different explanations? Motivation: Global explanation is too general, but 
 what about local explanations? Are local explanation overlap with global explanation? Motivation: Ideally, local models should accurately mimic the predictions of global models. 22 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.
  87. Research Questions & Results • Given the same defect models,

    different predictions have different local explanations, 
 highlighting the need of XAI tools for SE. • Top-10 features of local explanations are mostly overlapping but not the same as the global explanation. • Random seeds need to be defined to increase the stability of the model-agnostic techniques. Are model-agnostic techniques rely on random seeds? Motivation: Model-agnostic techniques involve 
 randomization, which could impact local explanations. Do different predictions have different explanations? Motivation: Global explanation is too general, but 
 what about local explanations? Are local explanation overlap with global explanation? Motivation: Ideally, local models should accurately mimic the predictions of global models. 22 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020.
  88. Research Questions & Results • Given the same defect models,

    different predictions have different local explanations, 
 highlighting the need of XAI tools for SE. • Top-10 features of local explanations are mostly overlapping but not the same as the global explanation. • Random seeds need to be defined to increase the stability of the model-agnostic techniques. Are model-agnostic techniques rely on random seeds? Motivation: Model-agnostic techniques involve 
 randomization, which could impact local explanations. Do different predictions have different explanations? Motivation: Global explanation is too general, but 
 what about local explanations? Are local explanation overlap with global explanation? Motivation: Ideally, local models should accurately mimic the predictions of global models. 22 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. How do practitioners perceive the generated explanation? Motivation: LIME can be used to generate local explanation, 
 but little is known how do practitioners perceive.
  89. Research Questions & Results • Given the same defect models,

    different predictions have different local explanations, 
 highlighting the need of XAI tools for SE. • Top-10 features of local explanations are mostly overlapping but not the same as the global explanation. • Random seeds need to be defined to increase the stability of the model-agnostic techniques. Are model-agnostic techniques rely on random seeds? Motivation: Model-agnostic techniques involve 
 randomization, which could impact local explanations. Do different predictions have different explanations? Motivation: Global explanation is too general, but 
 what about local explanations? Are local explanation overlap with global explanation? Motivation: Ideally, local models should accurately mimic the predictions of global models. 22 An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models, TSE 2020. • 65% of the participants agree that Time- contrast explanations are most useful. How do practitioners perceive the generated explanation? Motivation: LIME can be used to generate local explanation, 
 but little is known how do practitioners perceive.
  90. 23 Example 2: Line-Level Defect Prediction Predicting Defective Lines Using

    a Model-Agnostic Technique, TSE 2021. Researchers raised concerns that fine-grained defect prediction is needed 
 e.g., Pascarella et al. (JSS 2019), Wan et al. (TSE 2020)
  91. 23 Example 2: Line-Level Defect Prediction Predicting Defective Lines Using

    a Model-Agnostic Technique, TSE 2021. Practitioners still Developers still spend lots of effort to locate where is a defect in a file Researchers raised concerns that fine-grained defect prediction is needed 
 e.g., Pascarella et al. (JSS 2019), Wan et al. (TSE 2020)
  92. 23 Example 2: Line-Level Defect Prediction Predicting Defective Lines Using

    a Model-Agnostic Technique, TSE 2021. The dimension of defect dataset is too large (i.e., 230,898 code tokens and 259,617 lines of code) Practitioners still Developers still spend lots of effort to locate where is a defect in a file Researchers raised concerns that fine-grained defect prediction is needed 
 e.g., Pascarella et al. (JSS 2019), Wan et al. (TSE 2020) Challenge
  93. 23 Example 2: Line-Level Defect Prediction Predicting Defective Lines Using

    a Model-Agnostic Technique, TSE 2021. The ratio of defective lines is extremely low (i.e., 1%-3%) The dimension of defect dataset is too large (i.e., 230,898 code tokens and 259,617 lines of code) Practitioners still Developers still spend lots of effort to locate where is a defect in a file Researchers raised concerns that fine-grained defect prediction is needed 
 e.g., Pascarella et al. (JSS 2019), Wan et al. (TSE 2020) Challenge
  94. 23 Example 2: Line-Level Defect Prediction Predicting Defective Lines Using

    a Model-Agnostic Technique, TSE 2021. ML-based line-level defect prediction often performs poorly The ratio of defective lines is extremely low (i.e., 1%-3%) The dimension of defect dataset is too large (i.e., 230,898 code tokens and 259,617 lines of code) Practitioners still Developers still spend lots of effort to locate where is a defect in a file Researchers raised concerns that fine-grained defect prediction is needed 
 e.g., Pascarella et al. (JSS 2019), Wan et al. (TSE 2020) Challenge
  95. 23 Example 2: Line-Level Defect Prediction Predicting Defective Lines Using

    a Model-Agnostic Technique, TSE 2021. ML-based line-level defect prediction often performs poorly The ratio of defective lines is extremely low (i.e., 1%-3%) The dimension of defect dataset is too large (i.e., 230,898 code tokens and 259,617 lines of code) Never seen any deep neural network predict defective lines Practitioners still Developers still spend lots of effort to locate where is a defect in a file Researchers raised concerns that fine-grained defect prediction is needed 
 e.g., Pascarella et al. (JSS 2019), Wan et al. (TSE 2020) Challenge
  96. 24 Identified defect-prone lines File-Level Defect Prediction Model A File

    of Interest (Testing file) LIME oldCurrent current node closure Defective (LIME Score >0) Clean (LIME score < 0) 0.8 0.1 -0.3 -0.7 Ranking tokens based on LIME scores Mapping tokens to lines if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } Defect Dataset Token_1 Token_2 Token_N … File_A.java File_B.java File_C.java Bag-of-Tokens feature vectors Building file- level defect models Identifying defect- prone lines Files of Interest (Testing files) Identified defect- prone lines File-Level Defect Model Extracting features Ranking defect-prone lines Most risky Least risky LINE-DP: Line-Level Defect Prediction Predicting Defective Lines Using a Model-Agnostic Technique, TSE 2021.
  97. 25 Identified defect-prone lines File-Level Defect Prediction Model A File

    of Interest (Testing file) LIME oldCurrent current node closure Defective (LIME Score >0) Clean (LIME score < 0) 0.8 0.1 -0.3 -0.7 Ranking tokens based on LIME scores Mapping tokens to lines if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } Defect Dataset Token_1 Token_2 Token_N … File_A.java File_B.java File_C.java Bag-of-Tokens feature vectors Building file- level defect models Identifying defect- prone lines Files of Interest (Testing files) Identified defect- prone lines File-Level Defect Model Extracting features Ranking defect-prone lines Most risky Least risky LINE-DP: Line-Level Defect Prediction Predicting Defective Lines Using a Model-Agnostic Technique, TSE 2021.
  98. 26 Identified defect-prone lines File-Level Defect Prediction Model A File

    of Interest (Testing file) LIME oldCurrent current node closure Defective (LIME Score >0) Clean (LIME score < 0) 0.8 0.1 -0.3 -0.7 Ranking tokens based on LIME scores Mapping tokens to lines if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } Defect Dataset Token_1 Token_2 Token_N … File_A.java File_B.java File_C.java Bag-of-Tokens feature vectors Building file- level defect models Identifying defect- prone lines Files of Interest (Testing files) Identified defect- prone lines File-Level Defect Model Extracting features Ranking defect-prone lines Most risky Least risky Predicting Defective Lines Using a Model-Agnostic Technique, TSE 2021. LINE-DP: Line-Level Defect Prediction
  99. 27 Identified defect-prone lines File-Level Defect Prediction Model A File

    of Interest (Testing file) LIME oldCurrent current node closure Defective (LIME Score >0) Clean (LIME score < 0) 0.8 0.1 -0.3 -0.7 Ranking tokens based on LIME scores Mapping tokens to lines if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } Defect Dataset Token_1 Token_2 Token_N … File_A.java File_B.java File_C.java Bag-of-Tokens feature vectors Building file- level defect models Identifying defect- prone lines Files of Interest (Testing files) Identified defect- prone lines File-Level Defect Model Extracting features Ranking defect-prone lines Most risky Least risky Predicting Defective Lines Using a Model-Agnostic Technique, TSE 2021. LINE-DP: Line-Level Defect Prediction
  100. 28 Identified defect-prone lines File-Level Defect Prediction Model A File

    of Interest (Testing file) LIME oldCurrent current node closure Defective (LIME Score >0) Clean (LIME score < 0) 0.8 0.1 -0.3 -0.7 Ranking tokens based on LIME scores Mapping tokens to lines if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } Defect Dataset Token_1 Token_2 Token_N … File_A.java File_B.java File_C.java Bag-of-Tokens feature vectors Building file- level defect models Identifying defect- prone lines Files of Interest (Testing files) Identified defect- prone lines File-Level Defect Model Extracting features Ranking defect-prone lines Most risky Least risky Predicting Defective Lines Using a Model-Agnostic Technique, TSE 2021. LINE-DP: Line-Level Defect Prediction
  101. 29 Identified defect-prone lines File-Level Defect Prediction Model A File

    of Interest (Testing file) LIME oldCurrent current node closure Defective (LIME Score >0) Clean (LIME score < 0) 0.8 0.1 -0.3 -0.7 Ranking tokens based on LIME scores Mapping tokens to lines if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } Defect Dataset Token_1 Token_2 Token_N … File_A.java File_B.java File_C.java Bag-of-Tokens feature vectors Building file- level defect models Identifying defect- prone lines Files of Interest (Testing files) Identified defect- prone lines File-Level Defect Model Extracting features Ranking defect-prone lines Most risky Least risky Predicting Defective Lines Using a Model-Agnostic Technique, TSE 2021. LINE-DP: Line-Level Defect Prediction
  102. 30 LINE-DP: Experimental Results Predicting Defective Lines Using a Model-Agnostic

    Technique, TSE 2021. “LINE-DP achieves an average recall of 0.61 and a Recall@Top20%LOC recall of 0.27, which outperforms other baselines for both within-releases and cross-releases settings.” Benchmark Dataset: https://github.com/awsm-research/line-level-defect-prediction
  103. 31 Inventing the next-generation of defect prediction technologies Research 


    Translation Evaluating a proof-of-concept of AIBugHunter 2.0 with Practitioners ARC DECRA 2020-2023: Practical and Explainable Analytics to Prevent Future Software Defects. Practical Explainable Actionable Objectives Aim Industry 
 Problems Developers still spend lots of effort 
 to locate where is a defect in a file. Developers still do not trust the predictions. Developers still do not know what should they do to improve quality. 1. LineDP: Predicting Defective Lines Using a Model-Agnostic Technique (TSE’21) 2. JITLine: A Simpler, Better, Faster, Finer-grained Just-In-Time Defect Prediction (MSR’21) 3. DeepLineDP: Towards a Deep Learning Approach for Line-Level Defect Prediction (Under Review) 
 … 1. Survey: Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models (MSR’21) 2. LIME-HPO: An Empirical Study of Model-Agnostics Techniques for Defect Prediction Models (TSE’20) 3. JITBot: An Explainable Just-In- Time Defect Prediction (ASE’20) 
 …. 1. SQAPlanner: Generating data- informed software quality improvement plans (TSE’21) 2. PyExplainer: Explaining the Predictions of Just-In-Time Defect Models (ASE’21) 3. Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do (Software’21) 
 …. Technical 
 Research Expected 
 Benefits To improve the efficiency and effectiveness of SQA process (e.g., code review & testing)
  104. Agenda • Part 1: Explainable AI for Software Engineering (XAI4SE)

    + A Live Demo • Part 2: Other Potential Usage Scenarios of XAI4SE • Part 3: Lessons Learned and Open Questions 32
  105. XAI4SE is raised for a decade, but not there yet

    33 2010 2012 2014 2016 2018 2020 Concerns Raised
  106. XAI4SE is raised for a decade, but not there yet

    33 2010 2012 2014 2016 2018 2020 Concerns Raised
  107. XAI4SE is raised for a decade, but not there yet

    33 2010 2012 2014 2016 2018 2020 Goldfish Bowl Panel, Software Development Analytics, ICSE’2012 Concerns Raised
  108. XAI4SE is raised for a decade, but not there yet

    33 2010 2012 2014 2016 2018 2020 Goldfish Bowl Panel, Software Development Analytics, ICSE’2012 Concerns Raised
  109. XAI4SE is raised for a decade, but not there yet

    33 2010 2012 2014 2016 2018 2020 Goldfish Bowl Panel, Software Development Analytics, ICSE’2012 Concerns Raised Improve ML Modelling Process (Predict + Explain)
  110. XAI4SE is raised for a decade, but not there yet

    33 2010 2012 2014 2016 2018 2020 Goldfish Bowl Panel, Software Development Analytics, ICSE’2012 Concerns Raised Improve ML Modelling Process (Predict + Explain)
  111. XAI4SE is raised for a decade, but not there yet

    33 2010 2012 2014 2016 2018 2020 Goldfish Bowl Panel, Software Development Analytics, ICSE’2012 Concerns Raised Improve ML Modelling Process (Predict + Explain)
  112. XAI4SE is raised for a decade, but not there yet

    33 2010 2012 2014 2016 2018 2020 Goldfish Bowl Panel, Software Development Analytics, ICSE’2012 Concerns Raised Improve ML Modelling Process (Predict + Explain)
  113. XAI4SE is raised for a decade, but not there yet

    33 2010 2012 2014 2016 2018 2020 Goldfish Bowl Panel, Software Development Analytics, ICSE’2012 Concerns Raised Improve ML Modelling Process (Predict + Explain)
  114. XAI4SE is raised for a decade, but not there yet

    33 2010 2012 2014 2016 2018 2020 Goldfish Bowl Panel, Software Development Analytics, ICSE’2012 Explainable 
 Software Analytics, 
 ICSE-NIER’18 Concerns Raised Improve ML Modelling Process (Predict + Explain)
  115. XAI4SE is raised for a decade, but not there yet

    33 2010 2012 2014 2016 2018 2020 Goldfish Bowl Panel, Software Development Analytics, ICSE’2012 Explainable 
 Software Analytics, 
 ICSE-NIER’18 Concerns Raised Improve ML Modelling Process (Predict + Explain) XAI4SE…
  116. Early Days of Explainability in SE (2000-2015) Techniques: Linear /

    Non-Linear / Logistic Regression Analysis, 
 ANOVA, Naive Bayes, Association Rules, Decision Trees Transparent Machine Learning Variable 
 Importance … 34
  117. Early Days of Explainability in SE (2000-2015) Techniques: Linear /

    Non-Linear / Logistic Regression Analysis, 
 ANOVA, Naive Bayes, Association Rules, Decision Trees Transparent Machine Learning Variable 
 Importance … 34 Globally explainable, but not accurate and not locally explainable
  118. Early Days of Explainability in SE (2000-2015) Techniques: Linear /

    Non-Linear / Logistic Regression Analysis, 
 ANOVA, Naive Bayes, Association Rules, Decision Trees Transparent Machine Learning Variable 
 Importance … 34 AI4SE Booming Age (2015-2020) … Neural Network Predictions Techniques: DBN, MLP, RNN, CNN, LSTM, Transformer Globally explainable, but not accurate and not locally explainable
  119. Early Days of Explainability in SE (2000-2015) Techniques: Linear /

    Non-Linear / Logistic Regression Analysis, 
 ANOVA, Naive Bayes, Association Rules, Decision Trees Transparent Machine Learning Variable 
 Importance … 34 AI4SE Booming Age (2015-2020) … Neural Network Predictions Techniques: DBN, MLP, RNN, CNN, LSTM, Transformer Globally explainable, but not accurate and not locally explainable Very accurate, but not explainable
  120. 40 Years of Defect Prediction Studies Help developers effectively prioritize

    the limited SQA resources Help managers develop the most effective improvement plans Researchers’ Assumptions Help researchers generate empirical- grounded theories of software defects Practitioners’ Perceptions vs. etc. 35 Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021.
  121. Practitioners’ Needs 
 (User Survey) Researchers’ Focuses (Literature Review) 36

    Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021.
  122. Practitioners’ Needs 
 (User Survey) Researchers’ Focuses (Literature Review) ANOVA

    Regression Models able tance dom ests Partial Dep. Plot (PDP) Decision Rule/Tree Decision Rule/Tree LIME BreakDown SHAP Anchor Goal-3: Local Explanations If {LOC>100} then {BUG} 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 41% 4% Local explanations Global explanations Predctions 0 25 50 5 # of studies (%) Goal-1: Prediction Goal-2: Global Explanation Goal-3: Local Explanation Practitioners’ perceived usefulness 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% 41% 4% Goal 1: 
 Accuracy Goal 2: 
 Global Explainability Goal 3: 
 Local Explainability Three goals are similarly useful, But… 36 Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021.
  123. 91% 41% 4% 0 25 50 75 100 Goal 1

    (Predict) Goal 2 (Understand) Goal 3 (Explain) # of studies (%) Goals of developing defect prediction models 18% 0 5 10 15 20 ANO # of studies (%) Techn 5 10 15 20 25 # of studies (%) Goal 1 (Predict) Goa (Un Goals of developing defect prediction mo 91% focused on improving accuracy As few as 4% focused on increasing its explainability Practitioners’ Needs 
 (User Survey) Researchers’ Focuses (Literature Review) ANOVA Regression Models able tance dom ests Partial Dep. Plot (PDP) Decision Rule/Tree Decision Rule/Tree LIME BreakDown SHAP Anchor Goal-3: Local Explanations If {LOC>100} then {BUG} 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 41% 4% Local explanations Global explanations Predctions 0 25 50 5 # of studies (%) Goal-1: Prediction Goal-2: Global Explanation Goal-3: Local Explanation Practitioners’ perceived usefulness 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% (Goal 3) Understanding the most important characteristics that contributed to a prediction of a file (Goal 2) Understanding the characteristics that are associated with software defects in the past (Goal 1) Prioritizing the limited SQA resources on the most risky files 100 50 0 50 100 Percentage Response Not at all useful Not useful Neutral Useful Extremely useful 6% 10% 6% 84% 82% 82% 10% 8% 12% 41% 4% Goal 1: 
 Accuracy Goal 2: 
 Global Explainability Goal 3: 
 Local Explainability Three goals are similarly useful, But… 36 Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021.
  124. On The Needs of Actionable+Explainable Analytics Actionable Analytics: Stop Telling

    Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Practitioners’ Needs 37
  125. On The Needs of Actionable+Explainable Analytics Actionable Analytics: Stop Telling

    Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Lewis et al., Does Bug Prediction Support Human Developers? Findings From a Google Case Study, ICSE’13 “Defect models should be more actionable to help software engineers debug their programs” Practitioners’ Needs 37
  126. On The Needs of Actionable+Explainable Analytics Actionable Analytics: Stop Telling

    Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Lewis et al., Does Bug Prediction Support Human Developers? Findings From a Google Case Study, ICSE’13 “Defect models should be more actionable to help software engineers debug their programs” Tan et al., Online Defect Prediction for Imbalanced Data, ICSE- SEIP 2015. “Developers need to be convinced and prediction results need to be actionable” Practitioners’ Needs 37
  127. On The Needs of Actionable+Explainable Analytics Actionable Analytics: Stop Telling

    Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Lewis et al., Does Bug Prediction Support Human Developers? Findings From a Google Case Study, ICSE’13 “Defect models should be more actionable to help software engineers debug their programs” Tan et al., Online Defect Prediction for Imbalanced Data, ICSE- SEIP 2015. “Developers need to be convinced and prediction results need to be actionable” Practitioners’ Needs Jiarpakdee et al., Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021. “Improving explainability is perceived as equally useful as improving accuracy” 37
  128. On The Needs of Actionable+Explainable Analytics Actionable Analytics: Stop Telling

    Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Lewis et al., Does Bug Prediction Support Human Developers? Findings From a Google Case Study, ICSE’13 “Defect models should be more actionable to help software engineers debug their programs” Tan et al., Online Defect Prediction for Imbalanced Data, ICSE- SEIP 2015. “Developers need to be convinced and prediction results need to be actionable” Practitioners’ Needs Researchers’ Concerns & Jiarpakdee et al., Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021. “Improving explainability is perceived as equally useful as improving accuracy” 37
  129. On The Needs of Actionable+Explainable Analytics Actionable Analytics: Stop Telling

    Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Lewis et al., Does Bug Prediction Support Human Developers? Findings From a Google Case Study, ICSE’13 “Defect models should be more actionable to help software engineers debug their programs” Tan et al., Online Defect Prediction for Imbalanced Data, ICSE- SEIP 2015. “Developers need to be convinced and prediction results need to be actionable” Practitioners’ Needs Researchers’ Concerns & Dam et al. Explainable Software Analytics, ICSE-NIER’18 “Explainability should therefore be a key measure for evaluating software analytics” Jiarpakdee et al., Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021. “Improving explainability is perceived as equally useful as improving accuracy” 37
  130. On The Needs of Actionable+Explainable Analytics Actionable Analytics: Stop Telling

    Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Lewis et al., Does Bug Prediction Support Human Developers? Findings From a Google Case Study, ICSE’13 “Defect models should be more actionable to help software engineers debug their programs” Tan et al., Online Defect Prediction for Imbalanced Data, ICSE- SEIP 2015. “Developers need to be convinced and prediction results need to be actionable” Practitioners’ Needs Researchers’ Concerns & Dam et al. Explainable Software Analytics, ICSE-NIER’18 “Explainability should therefore be a key measure for evaluating software analytics” Tantithamthavorn et al. Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021. “Explainable and actionable software analytics is urgently and critically needed” Jiarpakdee et al., Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021. “Improving explainability is perceived as equally useful as improving accuracy” 37
  131. On The Needs of Actionable+Explainable Analytics Actionable Analytics: Stop Telling

    Me What It Is; Please Tell Me What To Do, IEEE Software 2021. Lewis et al., Does Bug Prediction Support Human Developers? Findings From a Google Case Study, ICSE’13 “Defect models should be more actionable to help software engineers debug their programs” Tan et al., Online Defect Prediction for Imbalanced Data, ICSE- SEIP 2015. “Developers need to be convinced and prediction results need to be actionable” Practitioners’ Needs Researchers’ Concerns & Dam et al. Explainable Software Analytics, ICSE-NIER’18 “Explainability should therefore be a key measure for evaluating software analytics” Tantithamthavorn et al. Actionable Analytics: Stop Telling Me What It Is; Please Tell Me What To Do, IEEE Software 2021. “Explainable and actionable software analytics is urgently and critically needed” Choetkiertikul et al. A deep learning model for estimating story points, TSE’19 “Explainability of a model is important for full adoption of machine learning techniques” Jiarpakdee et al., Practitioners Perceptions of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021. “Improving explainability is perceived as equally useful as improving accuracy” 37
  132. 38 Explainable AI is much needed in software engineering 


    (including other SE tasks), but remains largely unexplored 1
  133. • Explainability is the degree to which a human can

    understand the reasons behind a prediction. 40 Explainability and Explanation
  134. • Explainability is the degree to which a human can

    understand the reasons behind a prediction. • Explanation serves as an interface between Humans and Machines. 
 
 40 Explainability and Explanation Complex AI 
 Models XAI Explainable AI Methods Explanation Users Humans Machines
  135. • Explainability is the degree to which a human can

    understand the reasons behind a prediction. • Explanation serves as an interface between Humans and Machines. 
 
 • An explanation is an answer to a why-question (Tim Miller, 2018). 40 Explainability and Explanation Complex AI 
 Models XAI Explainable AI Methods Explanation Users Humans Machines
  136. • Explainability is the degree to which a human can

    understand the reasons behind a prediction. • Explanation serves as an interface between Humans and Machines. 
 
 • An explanation is an answer to a why-question (Tim Miller, 2018). • Challenge 1: The effectiveness of an explanation depends on the questions asked. 40 Explainability and Explanation Complex AI 
 Models XAI Explainable AI Methods Explanation Users Humans Machines
  137. • Explainability is the degree to which a human can

    understand the reasons behind a prediction. • Explanation serves as an interface between Humans and Machines. 
 
 • An explanation is an answer to a why-question (Tim Miller, 2018). • Challenge 1: The effectiveness of an explanation depends on the questions asked. • Challenge 2: One explanation can be presented in various scopes and forms to serve a goal. 40 Explainability and Explanation Complex AI 
 Models XAI Explainable AI Methods Explanation Users Humans Machines
  138. • Explainability is the degree to which a human can

    understand the reasons behind a prediction. • Explanation serves as an interface between Humans and Machines. 
 
 • An explanation is an answer to a why-question (Tim Miller, 2018). • Challenge 1: The effectiveness of an explanation depends on the questions asked. • Challenge 2: One explanation can be presented in various scopes and forms to serve a goal. 40 Explainability and Explanation Complex AI 
 Models XAI Explainable AI Methods Explanation Users Humans Machines
  139. 41 Different Techniques = Different Explanations Global Local Practitioners Perceptions

    of the Goals and Visual Explanations of Defect Prediction Models, MSR 2021. Different techniques produce different forms of explanations with different information.
  140. 42 Different stakeholders have different questions Complex AI 
 Models

    However, most XAI toolkits are algorithmic-driven, not yet human-centred.
  141. 42 Different stakeholders have different questions Complex AI 
 Models

    Why am I getting this prediction? How can I get a better decision? Predictions However, most XAI toolkits are algorithmic-driven, not yet human-centred.
  142. 42 Different stakeholders have different questions Complex AI 
 Models

    Why am I getting this prediction? How can I get a better decision? Predictions Software Engineers • Why a file/commit is predicted as defective? • What to do to make it better? However, most XAI toolkits are algorithmic-driven, not yet human-centred.
  143. 42 Different stakeholders have different questions Complex AI 
 Models

    Why am I getting this prediction? How can I get a better decision? Predictions AI Experts • Which AI models should I select? • How do I monitor and debug this models? Software Engineers • Why a file/commit is predicted as defective? • What to do to make it better? However, most XAI toolkits are algorithmic-driven, not yet human-centred.
  144. 42 Different stakeholders have different questions Complex AI 
 Models

    Why am I getting this prediction? How can I get a better decision? Predictions Domain Experts • Are the AI models learned correctly? • Is the AI model logic reasonable? AI Experts • Which AI models should I select? • How do I monitor and debug this models? Software Engineers • Why a file/commit is predicted as defective? • What to do to make it better? However, most XAI toolkits are algorithmic-driven, not yet human-centred.
  145. 42 Different stakeholders have different questions Complex AI 
 Models

    Why am I getting this prediction? How can I get a better decision? Predictions Business Leaders Policy Maker • Are the predictions/recommendations/text generations of AI models aligned with company values? • Are AI models recommend tasks to developers fairly • Are AI models ready to be deployed? Domain Experts • Are the AI models learned correctly? • Is the AI model logic reasonable? AI Experts • Which AI models should I select? • How do I monitor and debug this models? Software Engineers • Why a file/commit is predicted as defective? • What to do to make it better? However, most XAI toolkits are algorithmic-driven, not yet human-centred.
  146. 42 Different stakeholders have different questions Complex AI 
 Models

    Why am I getting this prediction? How can I get a better decision? Predictions Regulators • Do AI models (e.g., developer productivity predictions) conform with laws/regulations like GDPR? • Is software development productivity estimated from gender, age, race, marital status? Business Leaders Policy Maker • Are the predictions/recommendations/text generations of AI models aligned with company values? • Are AI models recommend tasks to developers fairly • Are AI models ready to be deployed? Domain Experts • Are the AI models learned correctly? • Is the AI model logic reasonable? AI Experts • Which AI models should I select? • How do I monitor and debug this models? Software Engineers • Why a file/commit is predicted as defective? • What to do to make it better? However, most XAI toolkits are algorithmic-driven, not yet human-centred.
  147. To make AI more explainable, we need to: 43 A

    Human-Centric XAI Recipe (Do-Re-Mi)
  148. To make AI more explainable, we need to: • (Step

    1) Domain Analysis to understand the AI problem, social contexts, and stakeholders • Stakeholders: Who do we want to explain? e.g., developers 43 A Human-Centric XAI Recipe (Do-Re-Mi)
  149. To make AI more explainable, we need to: • (Step

    1) Domain Analysis to understand the AI problem, social contexts, and stakeholders • Stakeholders: Who do we want to explain? e.g., developers • (Step 2) Requirement Elicitation to understand practitioners’ needs • Goals: What is their purpose? e.g., gain deeper insights • Questions: What do we want to explain? e.g., why a file/commit is predicted as defective? 43 A Human-Centric XAI Recipe (Do-Re-Mi)
  150. To make AI more explainable, we need to: • (Step

    1) Domain Analysis to understand the AI problem, social contexts, and stakeholders • Stakeholders: Who do we want to explain? e.g., developers • (Step 2) Requirement Elicitation to understand practitioners’ needs • Goals: What is their purpose? e.g., gain deeper insights • Questions: What do we want to explain? e.g., why a file/commit is predicted as defective? • (Step 3) Multimodal explanation design • Scopes: Global Level, Local Level (Instance to be explained) • AI Models: What kind of AI models trying to explain? e.g., classification, regression, NLP, etc. • Forms: Variable Importance, Rule, Integrated Gradients, Example-based, Attention, Heatmap • XAI Techniques: LIME, LORE, SHAP, Anchors, PDP, DICE, Surrogate. 43 A Human-Centric XAI Recipe (Do-Re-Mi)