Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Searching for Cost Effective Test Inputs for DNN Testing

Searching for Cost Effective Test Inputs for DNN Testing

This is the slides for my keynote talk for DeepTest 2020 (https://deeptestconf.github.io).

Shin Yoo

July 01, 2020
Tweet

More Decks by Shin Yoo

Other Decks in Research

Transcript

  1. Shin Yoo | COINSE, KAIST Searching for Cost Effective Test

    Inputs for DNN Testing Keynote | DeepTest 2020 1
  2. This is a bleeding edge that is also moving very

    fast: if there is something I completely missed, kindly enlighten me later! 3
  3. Rewinding back to ICSE 2019 What did I say (SBST

    2019 Keynote)? Traditional Code DL System 5
  4. Rewinding back to ICSE 2019 What did I say (SBST

    2019 Keynote)? Traditional Code DL System Specification 5
  5. Rewinding back to ICSE 2019 What did I say (SBST

    2019 Keynote)? Traditional Code DL System Specification Training Data 5
  6. Rewinding back to ICSE 2019 What did I say (SBST

    2019 Keynote)? Traditional Code DL System Specification Training Data Logic as Control Flow 5
  7. Rewinding back to ICSE 2019 What did I say (SBST

    2019 Keynote)? Traditional Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow 5
  8. Rewinding back to ICSE 2019 What did I say (SBST

    2019 Keynote)? Traditional Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written 5
  9. Rewinding back to ICSE 2019 What did I say (SBST

    2019 Keynote)? Traditional Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained 5
  10. Rewinding back to ICSE 2019 What did I say (SBST

    2019 Keynote)? Traditional Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested 5
  11. Rewinding back to ICSE 2019 What did I say (SBST

    2019 Keynote)? Traditional Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults 5
  12. Rewinding back to ICSE 2019 What did I say (SBST

    2019 Keynote)? Traditional Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? 5
  13. Rewinding back to ICSE 2019 What did I say (SBST

    2019 Keynote)? Traditional Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched 5
  14. Rewinding back to ICSE 2019 What did I say (SBST

    2019 Keynote)? Traditional Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? 5
  15. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? 6
  16. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) 6
  17. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. 6
  18. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) 6
  19. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles 6
  20. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (MODE@FSE’18, Apricot@ASE’19) 6
  21. • Neither a grand unifying theory, an one-stop solution, nor

    a visionary perspective… but what I know the best, which is: • Advances made by COINSE and collaborators recently :) • Acknowledgements: my students, Prof. Robert Feldt, R&D Division at Hyundai Motors Company Today’s Topic James Laughlin (1914-1997) 7
  22. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (MODE@FSE’18, Apricot@ASE’19) 8
  23. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (MODE@FSE’18, Apricot@ASE’19) Non Image Classifiers 8
  24. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (MODE@FSE’18, Apricot@ASE’19) SA & Cost Effectiveness Non Image Classifiers 8
  25. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (MODE@FSE’18, Apricot@ASE’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers 8
  26. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (MODE@FSE’18, Apricot@ASE’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair 8
  27. Cost Effective Test Input for DNNs Will reveal unexpected behaviours

    better Should be possible to identify for various DNN models 9
  28. Cost Effective Test Input for DNNs Will reveal unexpected behaviours

    better Will help reduce the development cost of DNN based systems Should be possible to identify for various DNN models 9
  29. Cost Effective Test Input for DNNs Will reveal unexpected behaviours

    better Will help reduce the development cost of DNN based systems Should be possible to identify for various DNN models Better if possible to sample or search freely 9
  30. Cost Effective Test Input for DNNs Will reveal unexpected behaviours

    better Will help reduce the development cost of DNN based systems Should be possible to identify for various DNN models Better if possible to sample or search freely Will be also helpful for correcting the unexpected behaviour 9
  31. Our guide to cost effectiveness • Intuitively speaking, SA is

    a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels than its own • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. 10
  32. Our guide to cost effectiveness • Intuitively speaking, SA is

    a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels than its own • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better 10
  33. Identifying cost effective inputs for various DNN models Does our

    guide (SA) work for DNN models that are not image classifiers? 11
  34. I suspect images form a very continuous and smooth input

    space. The vast majority of DNN testing research is done to image classifiers. Are we overfitting? 12
  35. (The space of images is inherently continuous) • A small

    perturbation to the original image should not result in a vastly different execution of a DNN. • A small perturbation to the original image should not result in a vastly different execution of a DNN. • A small perturbation to the original image should result in a vastly different execution of a DNN. • See what I did there? :) The Continuity Assumption 5 5 five fave 13
  36. (The space of images is inherently continuous) • A small

    perturbation to the original image should not result in a vastly different execution of a DNN. • A small perturbation to the original image should not result in a vastly different execution of a DNN. • A small perturbation to the original image should result in a vastly different execution of a DNN. • See what I did there? :) The Continuity Assumption 5 5 five fave 13
  37. Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin

    Yoo DeepTest 2020 We apply SA analysis to Question Answering task.
 Stay tuned for the talk, which is right after this keynote :) 14
  38. Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin

    Yoo DeepTest 2020 We apply SA analysis to Question Answering task.
 Stay tuned for the talk, which is right after this keynote :) Should be possible to identify for various DNN models 14
  39. (c) Selected test inputs based on LSA in CIFAR-10 (d)

    Selected test inputs based on DSA in CIFAR-10 Fig. 2: Accuracy of test inputs in MNIST and CIFAR-10 dataset, selected from the input with the lowest SA, increas- ingly including inputs with higher SA, and vice versa (i.e., from the input with the highest SA to inputs with lower SA). Fig. 4 and C If we feed more surprising inputs first, classification accuracy suffer, and vice versa.
 In other words, we can identify the inputs that induce unexpected behaviours pretty reliably. What can we do with this? J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. 16
  40. Then we started an industry collaboration. On semantic segmentation for

    autonomous driving. Fisheye Camera View Semantic Segmentation 17
  41. Then we started an industry collaboration. On semantic segmentation for

    autonomous driving. Fisheye Camera View Lanes Semantic Segmentation 17
  42. Then we started an industry collaboration. On semantic segmentation for

    autonomous driving. Fisheye Camera View Lanes Road Semantic Segmentation 17
  43. Then we started an industry collaboration. On semantic segmentation for

    autonomous driving. Fisheye Camera View Lanes Road Vehicle Semantic Segmentation 17
  44. What drives the development cost of DNN based systems? “The

    Surprising Truth About What it Takes to Build a Machine Learning Product” Josh Logan (Tech Lead, Cloud AI group at Google)
 https://medium.com/thelaunchpad/the-ml-surprise-f54706361a6c 18
  45. What drives the development cost of DNN based systems? “The

    Surprising Truth About What it Takes to Build a Machine Learning Product” Josh Logan (Tech Lead, Cloud AI group at Google)
 https://medium.com/thelaunchpad/the-ml-surprise-f54706361a6c 18
  46. What drives the development cost of DNN based systems? “The

    Surprising Truth About What it Takes to Build a Machine Learning Product” Josh Logan (Tech Lead, Cloud AI group at Google)
 https://medium.com/thelaunchpad/the-ml-surprise-f54706361a6c 18
  47. Reducing DNN Labelling Cost using Surprise Adequacy: An Industrial Case

    Study for Autonomous Driving Jinhan Kim, Jeongil Ju, Robert Feldt, and Shin Yoo https://arxiv.org/abs/2006.00894 Surprise Adequacy Semantic Segmentation ? Surprise Adequacy ? We collaborated with R&D Division at Hyundai Motors Group to investigate two research questions. 19
  48. IoU is the standard evaluation metric. A B C IoU

    = B A + B + C Intersection over Union 20
  49. IoU is the standard evaluation metric. A B C DNN

    Segmentation IoU = B A + B + C Intersection over Union 20
  50. IoU is the standard evaluation metric. A B C DNN

    Segmentation Actual Object IoU = B A + B + C Intersection over Union 20
  51. SA works with semantic segmentation (With some architectural adjustments -

    see paper for details) SA and IoU (Intersection over Union) are negatively correlated: each dot shows average SA of class pixels as well as the class IoU. 21
  52. Human Label Inferred Segmentation Raw Input Frame SA > threshold

    SA Heatmap Result Diff 22 https://www.youtube.com/watch?v=N7wKFx8pcsU
  53. “So what about the cost?” Extremely low SA ≃ model

    highly likely to be correct ≃ no need to label…? 23
  54. IoU Inaccuracy Tradeoff 1) If we do not label x%

    of images with lowest SA, and accept the inferred segmentation as IoU = 1.0, 24
  55. IoU Inaccuracy Tradeoff 1) If we do not label x%

    of images with lowest SA, and accept the inferred segmentation as IoU = 1.0, 2) the actual IoU for a class will be y off from 1.0. 24
  56. IoU Inaccuracy Tradeoff 1) If we do not label x%

    of images with lowest SA, and accept the inferred segmentation as IoU = 1.0, 2) the actual IoU for a class will be y off from 1.0. 3) 30% saving at the cost of IoU inaccuracy of 0.03! 24
  57. Classification Tradeoff If only images with IoU < 0.5 are

    “bugs”… Not labelling 55% of images, and we still lose 5% of images whose Road Marker IoUs are less than 0.5. 25
  58. Classification Tradeoff If only images with IoU < 0.5 are

    “bugs”… Not labelling 55% of images, and we still lose 5% of images whose Road Marker IoUs are less than 0.5. Will help reduce the development cost of DNN based systems 25
  59. SA vs. AL • SA-based partial labelling is conducted post

    training to reduce evaluation cost. • Active Learning is conducted during training to reduce training cost. A Mirror Image 26
  60. Navigating the space of possible inputs to be more cost

    effective Can we be free from the metamorphic perturbations? 27
  61. A Live Poll Please visit https://sli.do and use event number

    58054 (no longer active). 28 These are all inputs that cause a trained MNIST classifier misbehave. 
 
 Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify.
  62. A Live Poll Please visit https://sli.do and use event number

    58054 (no longer active). 28 These are all inputs that cause a trained MNIST classifier misbehave. 
 
 Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1
  63. A Live Poll Please visit https://sli.do and use event number

    58054 (no longer active). 28 These are all inputs that cause a trained MNIST classifier misbehave. 
 
 Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1 2
  64. A Live Poll Please visit https://sli.do and use event number

    58054 (no longer active). 28 These are all inputs that cause a trained MNIST classifier misbehave. 
 
 Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1 2 3
  65. A Live Poll Please visit https://sli.do and use event number

    58054 (no longer active). 28 These are all inputs that cause a trained MNIST classifier misbehave. 
 
 Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1 2 3 4
  66. A Live Poll Please visit https://sli.do and use event number

    58054 (no longer active). 28 These are all inputs that cause a trained MNIST classifier misbehave. 
 
 Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1 2 3 5 4
  67. A Live Poll Please visit https://sli.do and use event number

    58054 (no longer active). 28 These are all inputs that cause a trained MNIST classifier misbehave. 
 
 Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1 2 3 5 4 Sorry, I’ve tricked all of you - they are all generated by machine. But I claim that some are more realistic than others.
  68. A Live Poll Please visit https://sli.do and use event number

    58054 (no longer active). 28 These are all inputs that cause a trained MNIST classifier misbehave. 
 
 Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1 2 3 5 4
  69. A Live Poll Please visit https://sli.do and use event number

    58054 (no longer active). 28 These are all inputs that cause a trained MNIST classifier misbehave. 
 
 Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1 2 3 5 4 C&W Pixel Level Optimisation SINVAD SINVAD FGSM
  70. A Live Poll Please visit https://sli.do and use event number

    58054 (no longer active). 28 These are all inputs that cause a trained MNIST classifier misbehave. 
 
 Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1 2 3 5 4 C&W Pixel Level Optimisation SINVAD SINVAD FGSM
  71. Unlike primitive data, DNN input space is hard to navigate.

    How do we make it more explorable? 29
  72. Unlike primitive data, DNN input space is hard to navigate.

    How do we make it more explorable? Seed 29
  73. Unlike primitive data, DNN input space is hard to navigate.

    How do we make it more explorable? Darken Seed 29
  74. Unlike primitive data, DNN input space is hard to navigate.

    How do we make it more explorable? Occlusion Darken Seed 29
  75. Unlike primitive data, DNN input space is hard to navigate.

    How do we make it more explorable? Occlusion Darken Different Weather Condition Seed 29
  76. Unlike primitive data, DNN input space is hard to navigate.

    How do we make it more explorable? Occlusion Darken Different Weather Condition Seed Boundary of correct functional behaviour 29
  77. Unlike primitive data, DNN input space is hard to navigate.

    How do we make it more explorable? Occlusion Darken Different Weather Condition Seed Boundary of correct functional behaviour How can we more freely navigate this space? 29
  78. Unlike primitive data, DNN input space is hard to navigate.

    How do we make it more explorable? Occlusion Darken Different Weather Condition Seed Boundary of correct functional behaviour How can we more freely navigate this space? Add Noise 29
  79. Unlike primitive data, DNN input space is hard to navigate.

    How do we make it more explorable? Occlusion Darken Different Weather Condition Seed Boundary of correct functional behaviour How can we more freely navigate this space? Parameterised Model (Riccio & Tonella, 2020) Add Noise 29
  80. Unlike primitive data, DNN input space is hard to navigate.

    How do we make it more explorable? Occlusion Darken Different Weather Condition Seed Boundary of correct functional behaviour How can we more freely navigate this space? Parameterised Model (Riccio & Tonella, 2020) Variational Autoencoder (Kang et al., 2020) Add Noise 29
  81. Unlike primitive data, DNN input space is hard to navigate.

    How do we make it more explorable? Occlusion Darken Different Weather Condition Seed Boundary of correct functional behaviour How can we more freely navigate this space? Parameterised Model (Riccio & Tonella, 2020) Variational Autoencoder (Kang et al., 2020) Add Noise 29
  82. SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test

    Input Generation Sungmin Kang, Robert Feldt, and Shin Yoo https://arxiv.org/abs/2005.09296 (SBST 2020) Encoder Decoder (a) VAE Raw t=0 0.2 0.4 0.6 0. Trajectory of AT through interpolation green: 4 red: 9 VAE Please watch the SBST presentation (https://www.youtube.com/watch?v=_psDl3wUh-4) and participate in online interactive session - 13:00 UTC, 2nd July 2020. 30
  83. SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test

    Input Generation Sungmin Kang, Robert Feldt, and Shin Yoo https://arxiv.org/abs/2005.09296 (SBST 2020) Encoder Decoder (a) VAE Raw t=0 0.2 0.4 0.6 0. Trajectory of AT through interpolation green: 4 red: 9 VAE Please watch the SBST presentation (https://www.youtube.com/watch?v=_psDl3wUh-4) and participate in online interactive session - 13:00 UTC, 2nd July 2020. Better if possible to sample or search freely 30
  84. Helping to improve the behaviour of a DNN model What

    corresponds to patching the code? 31
  85. Retraining has been expensive and indirect. • MODE (Ma et

    al., FSE 2018) uses GAN to augment training data with inputs similar to those that induce misbehaviour. • Apricot (Zhang and Chan, ASE 2019) trains multiple sub-DNNs, and adjust the weights of the original DNN w.r.t. the sub-DNNs. Data = $ Training = 33
  86. Is it feasible to do direct repair à la APR?

    Code → AST → GP DNN → Neural Weights → ? 34
  87. Imagine each starling as a DNN (in a high dimension

    space) and the whole murmuration as a search process for the best set of neural weights. 35
  88. Imagine each starling as a DNN (in a high dimension

    space) and the whole murmuration as a search process for the best set of neural weights. 35
  89. Search Based Repair of Deep Neural Networks Jeongju Sohn, Sungmin

    Kang, and Shin Yoo https://arxiv.org/abs/1912.12463 Misbehaviour Localise Understand Patch Test Again 36
  90. Search Based Repair of Deep Neural Networks Jeongju Sohn, Sungmin

    Kang, and Shin Yoo https://arxiv.org/abs/1912.12463 Misbehaviour Localise Understand Patch Test Again Arachne 36
  91. Search Based Repair of Deep Neural Networks Jeongju Sohn, Sungmin

    Kang, and Shin Yoo https://arxiv.org/abs/1912.12463 Misbehaviour Localise Understand Patch Test Again Choose weights of neurons that 1) are most responsible for misbehaviour, and 2) are most influential to the outcome Arachne 36
  92. Search Based Repair of Deep Neural Networks Jeongju Sohn, Sungmin

    Kang, and Shin Yoo https://arxiv.org/abs/1912.12463 Misbehaviour Localise Understand Patch Test Again Choose weights of neurons that 1) are most responsible for misbehaviour, and 2) are most influential to the outcome Use both positive and negative test cases, as in GenProg Arachne 36
  93. • Arachne can re-adjust the class boundary between a specific

    pair (top plot), or multiple pairs (bottom plot). • The readjustment generalises to test data! • When repairing a single “bug” (misclassification label pair), there is relatively minor impact on other labels. Direct Repair seems to work. 3FQBJSJOHUIFNPTUGSFRVFOUUZQF  PGNJTCFIBWJPVSPG$*'"3 3FQBJSJOHNVMUJQMFUZQFTPGNJTCFIBWJPVSPG$*'"3 37
  94. • Retraining is more disruptive • However, it is also

    more widely impactful and repairs more instances • I still believe there may be a nice use case for a direct repair like this… Direct repair can be more precise. TBNQMFEGSPNUIFNPTUGSFRVFOUUZQF  PGNJTCFIBWJPVSJO$*'"3 Ineg "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSUIFSFQBJSCZ"SBDIOF "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSSFUSBJOJOH 38
  95. • Retraining is more disruptive • However, it is also

    more widely impactful and repairs more instances • I still believe there may be a nice use case for a direct repair like this… Direct repair can be more precise. TBNQMFEGSPNUIFNPTUGSFRVFOUUZQF  PGNJTCFIBWJPVSJO$*'"3 Ineg "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSUIFSFQBJSCZ"SBDIOF "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSSFUSBJOJOH Will be also helpful for correcting the unexpected behaviour 38
  96. 39

  97. Will help reduce the development cost of DNN based systems

    Should be possible to identify for various DNN models 39
  98. Will help reduce the development cost of DNN based systems

    Should be possible to identify for various DNN models Better if possible to sample or search freely 39
  99. Will help reduce the development cost of DNN based systems

    Should be possible to identify for various DNN models Better if possible to sample or search freely Will be also helpful for correcting the unexpected behaviour 39
  100. Will help reduce the development cost of DNN based systems

    Should be possible to identify for various DNN models Better if possible to sample or search freely Will be also helpful for correcting the unexpected behaviour What next? 39
  101. Road Ahead Random thoughts on what we need… We need

    more systematic guidelines for retraining. We should start thinking about benchmarks now. Diversify subjects beyond image classifiers. 40
  102. Recommendations for Going Forward …on a higher, strategic level Contextualise

    firmly in SE practice. Pay close attention to ML literature. Take replicability and transparency seriously 41
  103. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair [email protected]
 https://coinse.io 42
  104. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair Our guide to cost effectiveness • Intuitively speaking, SA is a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better [email protected]
 https://coinse.io 42
  105. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair Our guide to cost effectiveness • Intuitively speaking, SA is a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin Yoo DeepTest 2020 We apply SA analysis to Question Answering task.
 Stay tuned for the talk, which is right after this keynote :) Should be possible to generate for various DNN models [email protected]
 https://coinse.io 42
  106. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair Our guide to cost effectiveness • Intuitively speaking, SA is a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin Yoo DeepTest 2020 We apply SA analysis to Question Answering task.
 Stay tuned for the talk, which is right after this keynote :) Should be possible to generate for various DNN models Classification Tradeoff If only images with IoU < 0.5 are “bugs”… Not labelling 55% of images, and we still lose 5% of images whose Road Marker IoUs are less than 0.5. Will help reduce the development cost of DNN based systems [email protected]
 https://coinse.io 42
  107. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair Our guide to cost effectiveness • Intuitively speaking, SA is a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin Yoo DeepTest 2020 We apply SA analysis to Question Answering task.
 Stay tuned for the talk, which is right after this keynote :) Should be possible to generate for various DNN models Classification Tradeoff If only images with IoU < 0.5 are “bugs”… Not labelling 55% of images, and we still lose 5% of images whose Road Marker IoUs are less than 0.5. Will help reduce the development cost of DNN based systems SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test Input Generation Sungmin Kang, Robert Feldt, and Shin Yoo https://arxiv.org/abs/2005.09296 (SBST 2020) Encoder Decoder (b) (a) VAE Raw t=0 0.2 0.4 0.6 0.8 1.0 15 10 5 0 5 10 15 10 5 0 Trajectory of AT through interpolation green: 4 red: 9 VAE Raw Please watch the SBST presentation (https://www.youtube.com/watch?v=_psDl3wUh-4) and participate in online interactive session - 13:00 UTC, 2nd July 2020. Better if possible to sample or search freely [email protected]
 https://coinse.io 42
  108. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair Our guide to cost effectiveness • Intuitively speaking, SA is a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin Yoo DeepTest 2020 We apply SA analysis to Question Answering task.
 Stay tuned for the talk, which is right after this keynote :) Should be possible to generate for various DNN models Classification Tradeoff If only images with IoU < 0.5 are “bugs”… Not labelling 55% of images, and we still lose 5% of images whose Road Marker IoUs are less than 0.5. Will help reduce the development cost of DNN based systems SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test Input Generation Sungmin Kang, Robert Feldt, and Shin Yoo https://arxiv.org/abs/2005.09296 (SBST 2020) Encoder Decoder (b) (a) VAE Raw t=0 0.2 0.4 0.6 0.8 1.0 15 10 5 0 5 10 15 10 5 0 Trajectory of AT through interpolation green: 4 red: 9 VAE Raw Please watch the SBST presentation (https://www.youtube.com/watch?v=_psDl3wUh-4) and participate in online interactive session - 13:00 UTC, 2nd July 2020. Better if possible to sample or search freely • Retraining is more disruptive $ • However, it is also more widely impactful and repairs more instances % • I still believe there may be a nice use case for a direct repair like this… Direct repair can be more precise. TBNQMFEGSPNUIFNPTUGSFRVFOUUZQF  PGNJTCFIBWJPVSJO$*'"3 I neg "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSUIFSFQBJSCZ"SBDIOF "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSSFUSBJOJOH Will be also helpful for correcting the unexpected behaviour [email protected]
 https://coinse.io 42
  109. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair Our guide to cost effectiveness • Intuitively speaking, SA is a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin Yoo DeepTest 2020 We apply SA analysis to Question Answering task.
 Stay tuned for the talk, which is right after this keynote :) Should be possible to generate for various DNN models Classification Tradeoff If only images with IoU < 0.5 are “bugs”… Not labelling 55% of images, and we still lose 5% of images whose Road Marker IoUs are less than 0.5. Will help reduce the development cost of DNN based systems SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test Input Generation Sungmin Kang, Robert Feldt, and Shin Yoo https://arxiv.org/abs/2005.09296 (SBST 2020) Encoder Decoder (b) (a) VAE Raw t=0 0.2 0.4 0.6 0.8 1.0 15 10 5 0 5 10 15 10 5 0 Trajectory of AT through interpolation green: 4 red: 9 VAE Raw Please watch the SBST presentation (https://www.youtube.com/watch?v=_psDl3wUh-4) and participate in online interactive session - 13:00 UTC, 2nd July 2020. Better if possible to sample or search freely • Retraining is more disruptive $ • However, it is also more widely impactful and repairs more instances % • I still believe there may be a nice use case for a direct repair like this… Direct repair can be more precise. TBNQMFEGSPNUIFNPTUGSFRVFOUUZQF  PGNJTCFIBWJPVSJO$*'"3 I neg "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSUIFSFQBJSCZ"SBDIOF "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSSFUSBJOJOH Will be also helpful for correcting the unexpected behaviour Road Ahead Random thoughts on what we need… We need more systematic guidelines for retraining. We should start thinking about benchmarks now. Diversify subjects beyond image classifiers. [email protected]
 https://coinse.io 42
  110. Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional

    Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair Our guide to cost effectiveness • Intuitively speaking, SA is a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin Yoo DeepTest 2020 We apply SA analysis to Question Answering task.
 Stay tuned for the talk, which is right after this keynote :) Should be possible to generate for various DNN models Classification Tradeoff If only images with IoU < 0.5 are “bugs”… Not labelling 55% of images, and we still lose 5% of images whose Road Marker IoUs are less than 0.5. Will help reduce the development cost of DNN based systems SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test Input Generation Sungmin Kang, Robert Feldt, and Shin Yoo https://arxiv.org/abs/2005.09296 (SBST 2020) Encoder Decoder (b) (a) VAE Raw t=0 0.2 0.4 0.6 0.8 1.0 15 10 5 0 5 10 15 10 5 0 Trajectory of AT through interpolation green: 4 red: 9 VAE Raw Please watch the SBST presentation (https://www.youtube.com/watch?v=_psDl3wUh-4) and participate in online interactive session - 13:00 UTC, 2nd July 2020. Better if possible to sample or search freely • Retraining is more disruptive $ • However, it is also more widely impactful and repairs more instances % • I still believe there may be a nice use case for a direct repair like this… Direct repair can be more precise. TBNQMFEGSPNUIFNPTUGSFRVFOUUZQF  PGNJTCFIBWJPVSJO$*'"3 I neg "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSUIFSFQBJSCZ"SBDIOF "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSSFUSBJOJOH Will be also helpful for correcting the unexpected behaviour Road Ahead Random thoughts on what we need… We need more systematic guidelines for retraining. We should start thinking about benchmarks now. Diversify subjects beyond image classifiers. Recommendations for Going Forward …on a higher, strategic level Contextualise firmly in SE practice. Pay close attention to ML literature. Take replicability and transparency seriously [email protected]
 https://coinse.io 42