Towards Enhancing the Reproducibility of Deep Learning Bugs: An Empirical Study

Towards Enhancing the Reproducibility of Deep Learning Bugs: An Empirical
Study Mehil B Shah [email protected] Masud Rahman [email protected] Foutse Khomh [email protected] RAISE Lab Intelligent Automation in Software EngineeRing

Deep Learning Bugs in Real Life 2 Mehil Shah, Dalhousie
University 30 June 2025

Steps of Debugging 3 Bug Localization Bug Reproduction Bug Fixing
Only 3% of Deep Learning Bugs are Reproducible. Mehil Shah, Dalhousie University 30 June 2025

Motivating Example Question not resolved as of today (~5 years)
4 30 June 2025

Related Work 5 Mondal et al., MSR’19 Rahman et al.,
ICSME’20 Chen et al., ICSE’22 Moravati et al., EMSE’23 Investigates the reproducibility of software bugs – edit actions Understanding the challenges of non- reproducibility of software bugs Provides guidelines for reproducibility of DL models. First benchmark dataset for deep learning bugs. Existing work lacks focus on understanding the challenges of reproducing deep learning bugs and providing guidelines for improving reproducibility. Mehil Shah, Dalhousie University 30 June 2025

Schematic Diagram 6 Defects4ML Stack Overflow Posts Filtration and Sampling
Dataset Construction Critical Information Edit Actions Edit Actions and Type of Bugs Actionable Insights Developer Study Filtration using various criteria: keywords, timeline, presence of accepted answer, removal of queries with keywords, presence of code snippet. Taxonomy from existing literature to determine the tags for types of bug. E.g, model, layer tags for model bug, loss-function, training tags for training bug, etc. Final Dataset: 568 bugs from Stack Overflow + 100 bugs from Defects4ML Total Bugs Selected for Reproduction: 165, Bugs Reproduced: 148 Bugs Mehil Shah, Dalhousie University 30 June 2025

Research Questions RQ1: What are the edit actions that can
be used to reproduce deep learning bugs? RQ2: What types of component information and edit actions are useful for reproducing specific types of deep learning bugs? RQ3: How do the suggested edit actions and information affect the reproducibility of deep learning bugs? 7 Mehil Shah, Dalhousie University 30 June 2025

RQ1: Edit Actions Input Data Generation: Create simulated training data
Neural Network Definition: Recreate the neural network Hyperparameter Initialization: Initialize the training hyperparameters Import Addition and Dependency Resolution: Add required imports Logging: Capture relevant information 8 Mehil Shah, Dalhousie University 30 June 2025

RQ1: Edit Actions Obsolete Parameter Removal: Remove the outdated parameters
Version Migration: Update code to latest version Dataset Procurement: Acquire the necessary datasets Downloading Models & Tokenizers: Fetch the required assets Compiler Error Resolution: Resolve compiler errors 9 Mehil Shah, Dalhousie University 30 June 2025

RQ2: Critical Information Data – 77.4% (e.g., data shape, data
distribution) Neural Network – 58.1% (e.g., neural network architecture) Hyperparameters – 47.9% (e.g., batch size, loss, epochs) Training Code Snippet – 82.1% (e.g., model training, evaluation code) Logs – 87.6% (e.g., compiler logs, training logs) 10 Mehil Shah, Dalhousie University 30 June 2025

RQ2: Apriori Analysis 11 Bug type → Component information present
in corresponding bug report or edit action used to reproduce the bug Sample Transactions: M → LNC (Model bug → Original bug report has information about the logs, and neural network, and code snippet.) T → IC (Training bug → Bug was reproduced using edit operations of input data generation, and compiler error resolution) Mehil Shah, Dalhousie University 30 June 2025

RQ2: Component Information and Deep Learning Bugs 12 Training Model
Tensor API Code Snippet (0.86) Logs (0.79) Data (0.96) Logs (0.85) Data (0.82) Code Snippet (0.71) Logs (0.93) Code Snippet (0.75) Logs (0.76) Model (0.64) Code Snippet (0.72) Neural Network (0.70) Mehil Shah, Dalhousie University 30 June 2025

RQ2: Association between Bugs and Edit Actions. 13 Training Model
Input Data Generation (0.5625) Hyperparameter Initialization (0.5142) Import Addition (0.4583) Dataset Procurement (0.4390) Compiler Error Resolution (0.3750) Compiler Error Resolution (0.4146) Dataset Procurement (0.3542) Import Addition (0.4146) Hyperparameter Initialization (0.3333) Neural Network Construction (0.3659) Mehil Shah, Dalhousie University 30 June 2025

RQ3: User Study 14 • User Study with 22 Participants:
10 from Academia, 12 from Industry • Questionnaire Preparation: 4 sets of 2 bugs each based on the difficulty (e.g., Set 1 – easy training bug + difficult API bug, Set 2 – easy model bug + difficult training bug). • Creating the Control and Experimental Group: Divide the study participants into control and experimental group, based on their experience. Mehil Shah, Dalhousie University 30 June 2025

RQ3: Workflow of User Study 30 June 2025 Mehil Shah,
Dalhousie University 15 S1: Provide the demographic Information S2: Describe the challenges while reproducing deep learning bugs S3: Perform the bug reproduction with hints (experimental group) and without hints (control group) S4: Reports the edit action and component information S5: Provides rationale behind the used edit action(s) and the component information S6: Provides information about any other edit action used, which were not covered by our findings

RQ3: Improvement in Reproducibility Rate 16 Average Improvement in Reproducibility:
22.92% Mehil Shah, Dalhousie University 30 June 2025

RQ3: Reduction in Time Spent 17 Average Decrease in Time
to Reproduce: 24.35% Mehil Shah, Dalhousie University 30 June 2025

RQ3: Statistical Significance using GLM 18 Variable Estimate Std. Error
z value Pr > |z| Effect Size (OR) Intercept -3.4229 1.898 -1.803 0.071 - DLBugFixExp_0 0 0.2251 0.856 0.263 0.793 DLBugFixExp_1 -0.2449 0.564 -0.435 0.664 0.782786 DLBugFixExp_2 -3.4032 2.283 -1.491 0.136 0.033268 Hints 3.0899 0.991 3.118 0.002 21.974308 DLExp 2.5448 1.725 1.475 0.140 12.740844 Field 0.1915 1.145 0.167 0.867 1.211112 Mehil Shah, Dalhousie University Hints have a statistically significant impact on reproduction of deep learning bugs, whereas other factors do not play a significant role! 30 June 2025

Summary 10 key edit actions for bug reproduction 5 types
of component information for bug reproduction Better understanding of reproducibility for different types of DL bugs Developers reproduced 22.92% more bugs, and spent 24.35% less time 19 Mehil Shah, Dalhousie University 30 June 2025

Take Home Messages 20 Mehil Shah, Dalhousie University 30 June
2025 Reproducibility is a challenge for DL bugs but an essential step for debugging. Our proposed edit actions and component information can help improve the reproducibility of deep learning bugs Automated Bug Reproduction is the next step towards reliable DL-based systems

Thank You! Questions? Feel free to contact me at [email protected]
30 June 2025 Mehil Shah, Dalhousie University 21 Replication Package Preprint

Towards Enhancing the Reproducibility of Deep L...

Towards Enhancing the Reproducibility of Deep Learning Bugs: An Empirical Study

Masud Rahman

More Decks by Masud Rahman

Other Decks in Research

Featured

Transcript

Towards Enhancing the Reproducibility of Deep Learning Bugs: An Empirical

Deep Learning Bugs in Real Life 2 Mehil Shah, Dalhousie

Steps of Debugging 3 Bug Localization Bug Reproduction Bug Fixing

Motivating Example Question not resolved as of today (~5 years)

Related Work 5 Mondal et al., MSR’19 Rahman et al.,

Schematic Diagram 6 Defects4ML Stack Overflow Posts Filtration and Sampling

Research Questions RQ1: What are the edit actions that can

RQ1: Edit Actions Input Data Generation: Create simulated training data

RQ1: Edit Actions Obsolete Parameter Removal: Remove the outdated parameters

RQ2: Critical Information Data – 77.4% (e.g., data shape, data

RQ2: Apriori Analysis 11 Bug type → Component information present

RQ2: Component Information and Deep Learning Bugs 12 Training Model

RQ2: Association between Bugs and Edit Actions. 13 Training Model

RQ3: User Study 14 • User Study with 22 Participants:

RQ3: Workflow of User Study 30 June 2025 Mehil Shah,

RQ3: Improvement in Reproducibility Rate 16 Average Improvement in Reproducibility:

RQ3: Reduction in Time Spent 17 Average Decrease in Time

RQ3: Statistical Significance using GLM 18 Variable Estimate Std. Error

Summary 10 key edit actions for bug reproduction 5 types

Take Home Messages 20 Mehil Shah, Dalhousie University 30 June

Thank You! Questions? Feel free to contact me at [email protected]