Searching for Cost Effective Test Inputs for DNN Testing

Shin Yoo | COINSE, KAIST Searching for Cost Effective Test
Inputs for DNN Testing Keynote | DeepTest 2020 1

2 Shin Yoo ([email protected]) Leads COINSE@KAIST  https://coinse.io SBSE | Regression
Testing | Fault Localisation | Program Slicing

This is a bleeding edge that is also moving very
fast: if there is something I completely missed, kindly enlighten me later! 3

Rules of Keynote Talk Invitation: ﬁrst, you accept; then, (much
later) you panic! 4

Rewinding back to ICSE 2019 What did I say (SBST
2019 Keynote)? Traditional Code DL System 5

2019 Keynote)? Traditional Code DL System Speciﬁcation 5

2019 Keynote)? Traditional Code DL System Speciﬁcation Training Data 5

2019 Keynote)? Traditional Code DL System Speciﬁcation Training Data Logic as Control Flow 5

2019 Keynote)? Traditional Code DL System Speciﬁcation Training Data Logic as Control Flow Logic as Data Flow 5

2019 Keynote)? Traditional Code DL System Speciﬁcation Training Data Logic as Control Flow Logic as Data Flow Written 5

2019 Keynote)? Traditional Code DL System Speciﬁcation Training Data Logic as Control Flow Logic as Data Flow Written Trained 5

2019 Keynote)? Traditional Code DL System Speciﬁcation Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested 5

2019 Keynote)? Traditional Code DL System Speciﬁcation Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults 5

2019 Keynote)? Traditional Code DL System Speciﬁcation Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? 5

2019 Keynote)? Traditional Code DL System Speciﬁcation Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched 5

2019 Keynote)? Traditional Code DL System Speciﬁcation Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? 5

Checking the State-of-the-Art (bounded by my knowledge, that is) Traditional
Code DL System Speciﬁcation Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? 6

Code DL System Speciﬁcation Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Veriﬁcation (abstract interpretation, interval analysis, etc…) 6

Code DL System Speciﬁcation Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Veriﬁcation (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. 6

Code DL System Speciﬁcation Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Veriﬁcation (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) 6

Code DL System Speciﬁcation Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Veriﬁcation (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles 6

Code DL System Speciﬁcation Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Veriﬁcation (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (MODE@FSE’18, Apricot@ASE’19) 6

• Neither a grand unifying theory, an one-stop solution, nor
a visionary perspective… but what I know the best, which is: • Advances made by COINSE and collaborators recently :) • Acknowledgements: my students, Prof. Robert Feldt, R&D Division at Hyundai Motors Company Today’s Topic James Laughlin (1914-1997) 7

Code DL System Speciﬁcation Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Veriﬁcation (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (MODE@FSE’18, Apricot@ASE’19) 8

Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (MODE@FSE’18, Apricot@ASE’19) Non Image Classifiers 8

Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (MODE@FSE’18, Apricot@ASE’19) SA & Cost Effectiveness Non Image Classifiers 8

Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (MODE@FSE’18, Apricot@ASE’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers 8

Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (MODE@FSE’18, Apricot@ASE’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair 8

Cost Effective Test Input for DNNs 9

Cost Effective Test Input for DNNs Will reveal unexpected behaviours
better 9

better Should be possible to identify for various DNN models 9

better Will help reduce the development cost of DNN based systems Should be possible to identify for various DNN models 9

better Will help reduce the development cost of DNN based systems Should be possible to identify for various DNN models Better if possible to sample or search freely 9

better Will help reduce the development cost of DNN based systems Should be possible to identify for various DNN models Better if possible to sample or search freely Will be also helpful for correcting the unexpected behaviour 9

Our guide to cost eﬀectiveness • Intuitively speaking, SA is
a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels than its own • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. 10

Our guide to cost eﬀectiveness • Intuitively speaking, SA is
a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels than its own • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better 10

Identifying cost effective inputs for various DNN models 11

Identifying cost effective inputs for various DNN models Does our
guide (SA) work for DNN models that are not image classiﬁers? 11

I suspect images form a very continuous and smooth input
space. The vast majority of DNN testing research is done to image classiﬁers. Are we overﬁtting? 12

(The space of images is inherently continuous) • A small
perturbation to the original image should not result in a vastly different execution of a DNN. • A small perturbation to the original image should not result in a vastly different execution of a DNN. • A small perturbation to the original image should result in a vastly different execution of a DNN. • See what I did there? :) The Continuity Assumption 5 5 five fave 13

Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin
Yoo DeepTest 2020 We apply SA analysis to Question Answering task.  Stay tuned for the talk, which is right after this keynote :) 14

Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin
Yoo DeepTest 2020 We apply SA analysis to Question Answering task.  Stay tuned for the talk, which is right after this keynote :) Should be possible to identify for various DNN models 14

Reducing development cost of DNN based system 15

Reducing development cost of DNN based system Where can we
safely cut corners? 15

(c) Selected test inputs based on LSA in CIFAR-10 (d)
Selected test inputs based on DSA in CIFAR-10 Fig. 2: Accuracy of test inputs in MNIST and CIFAR-10 dataset, selected from the input with the lowest SA, increas- ingly including inputs with higher SA, and vice versa (i.e., from the input with the highest SA to inputs with lower SA). Fig. 4 and C If we feed more surprising inputs first, classification accuracy suffer, and vice versa.  In other words, we can identify the inputs that induce unexpected behaviours pretty reliably. What can we do with this? J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. 16

Then we started an industry collaboration. On semantic segmentation for
autonomous driving. Fisheye Camera View 17

autonomous driving. Fisheye Camera View Semantic Segmentation 17

autonomous driving. Fisheye Camera View Lanes Semantic Segmentation 17

autonomous driving. Fisheye Camera View Lanes Road Semantic Segmentation 17

autonomous driving. Fisheye Camera View Lanes Road Vehicle Semantic Segmentation 17

What drives the development cost of DNN based systems? “The
Surprising Truth About What it Takes to Build a Machine Learning Product” Josh Logan (Tech Lead, Cloud AI group at Google)  https://medium.com/thelaunchpad/the-ml-surprise-f54706361a6c 18

Reducing DNN Labelling Cost using Surprise Adequacy: An Industrial Case
Study for Autonomous Driving Jinhan Kim, Jeongil Ju, Robert Feldt, and Shin Yoo https://arxiv.org/abs/2006.00894 Surprise Adequacy Semantic Segmentation ? Surprise Adequacy ? We collaborated with R&D Division at Hyundai Motors Group to investigate two research questions. 19

IoU is the standard evaluation metric. A B C IoU
= B A + B + C Intersection over Union 20

IoU is the standard evaluation metric. A B C DNN
Segmentation IoU = B A + B + C Intersection over Union 20

IoU is the standard evaluation metric. A B C DNN
Segmentation Actual Object IoU = B A + B + C Intersection over Union 20

SA works with semantic segmentation (With some architectural adjustments -
see paper for details) SA and IoU (Intersection over Union) are negatively correlated: each dot shows average SA of class pixels as well as the class IoU. 21

Human Label Inferred Segmentation Raw Input Frame SA > threshold
SA Heatmap Result Diﬀ 22 https://www.youtube.com/watch?v=N7wKFx8pcsU

“So what about the cost?” 23

“So what about the cost?” Extremely low SA ≃ model
highly likely to be correct ≃ no need to label…? 23

IoU Inaccuracy Tradeoff 24

IoU Inaccuracy Tradeoff 1) If we do not label x%
of images with lowest SA, and accept the inferred segmentation as IoU = 1.0, 24

of images with lowest SA, and accept the inferred segmentation as IoU = 1.0, 2) the actual IoU for a class will be y oﬀ from 1.0. 24

of images with lowest SA, and accept the inferred segmentation as IoU = 1.0, 2) the actual IoU for a class will be y oﬀ from 1.0. 3) 30% saving at the cost of IoU inaccuracy of 0.03! 24

Classification Tradeoff If only images with IoU < 0.5 are
“bugs”… 25

“bugs”… Not labelling 55% of images, and we still lose 5% of images whose Road Marker IoUs are less than 0.5. 25

“bugs”… Not labelling 55% of images, and we still lose 5% of images whose Road Marker IoUs are less than 0.5. Will help reduce the development cost of DNN based systems 25

SA vs. AL • SA-based partial labelling is conducted post
training to reduce evaluation cost. • Active Learning is conducted during training to reduce training cost. A Mirror Image 26

Navigating the space of possible inputs to be more cost
effective 27

Navigating the space of possible inputs to be more cost
effective Can we be free from the metamorphic perturbations? 27

A Live Poll Please visit https://sli.do and use event number
58054 (no longer active). 28 These are all inputs that cause a trained MNIST classiﬁer misbehave.     Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify.

58054 (no longer active). 28 These are all inputs that cause a trained MNIST classiﬁer misbehave.     Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1

58054 (no longer active). 28 These are all inputs that cause a trained MNIST classiﬁer misbehave.     Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1 2

58054 (no longer active). 28 These are all inputs that cause a trained MNIST classiﬁer misbehave.     Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1 2 3

58054 (no longer active). 28 These are all inputs that cause a trained MNIST classiﬁer misbehave.     Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1 2 3 4

58054 (no longer active). 28 These are all inputs that cause a trained MNIST classiﬁer misbehave.     Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1 2 3 5 4

58054 (no longer active). 28 These are all inputs that cause a trained MNIST classiﬁer misbehave.     Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1 2 3 5 4 Sorry, I’ve tricked all of you - they are all generated by machine. But I claim that some are more realistic than others.

58054 (no longer active). 28 These are all inputs that cause a trained MNIST classiﬁer misbehave.     Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1 2 3 5 4

58054 (no longer active). 28 These are all inputs that cause a trained MNIST classiﬁer misbehave.     Which of the following MNIST inputs do you think have NOT been generated by a machine? Mark all that you think qualify. 1 2 3 5 4 C&W Pixel Level Optimisation SINVAD SINVAD FGSM

Unlike primitive data, DNN input space is hard to navigate.
How do we make it more explorable? 29

How do we make it more explorable? Seed 29

How do we make it more explorable? Darken Seed 29

How do we make it more explorable? Occlusion Darken Seed 29

How do we make it more explorable? Occlusion Darken Different Weather Condition Seed 29

How do we make it more explorable? Occlusion Darken Different Weather Condition Seed Boundary of correct functional behaviour 29

How do we make it more explorable? Occlusion Darken Different Weather Condition Seed Boundary of correct functional behaviour How can we more freely navigate this space? 29

How do we make it more explorable? Occlusion Darken Different Weather Condition Seed Boundary of correct functional behaviour How can we more freely navigate this space? Add Noise 29

How do we make it more explorable? Occlusion Darken Different Weather Condition Seed Boundary of correct functional behaviour How can we more freely navigate this space? Parameterised Model (Riccio & Tonella, 2020) Add Noise 29

How do we make it more explorable? Occlusion Darken Different Weather Condition Seed Boundary of correct functional behaviour How can we more freely navigate this space? Parameterised Model (Riccio & Tonella, 2020) Variational Autoencoder (Kang et al., 2020) Add Noise 29

SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test
Input Generation Sungmin Kang, Robert Feldt, and Shin Yoo https://arxiv.org/abs/2005.09296 (SBST 2020) Encoder Decoder (a) VAE Raw t=0 0.2 0.4 0.6 0. Trajectory of AT through interpolation green: 4 red: 9 VAE Please watch the SBST presentation (https://www.youtube.com/watch?v=_psDl3wUh-4) and participate in online interactive session - 13:00 UTC, 2nd July 2020. 30

SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test
Input Generation Sungmin Kang, Robert Feldt, and Shin Yoo https://arxiv.org/abs/2005.09296 (SBST 2020) Encoder Decoder (a) VAE Raw t=0 0.2 0.4 0.6 0. Trajectory of AT through interpolation green: 4 red: 9 VAE Please watch the SBST presentation (https://www.youtube.com/watch?v=_psDl3wUh-4) and participate in online interactive session - 13:00 UTC, 2nd July 2020. Better if possible to sample or search freely 30

Helping to improve the behaviour of a DNN model 31

Helping to improve the behaviour of a DNN model What
corresponds to patching the code? 31

Code Test Failure Localise Understand Patch Test Again DNN Misbehaviour
Localise Understand Retrain Test Again 32

Retraining has been expensive and indirect. • MODE (Ma et
al., FSE 2018) uses GAN to augment training data with inputs similar to those that induce misbehaviour. • Apricot (Zhang and Chan, ASE 2019) trains multiple sub-DNNs, and adjust the weights of the original DNN w.r.t. the sub-DNNs. Data = $ Training = 33

Is it feasible to do direct repair à la APR?
Code → AST → GP DNN → Neural Weights → ? 34

Imagine each starling as a DNN (in a high dimension
space) and the whole murmuration as a search process for the best set of neural weights. 35

Search Based Repair of Deep Neural Networks Jeongju Sohn, Sungmin
Kang, and Shin Yoo https://arxiv.org/abs/1912.12463 Misbehaviour Localise Understand Patch Test Again 36

Kang, and Shin Yoo https://arxiv.org/abs/1912.12463 Misbehaviour Localise Understand Patch Test Again Arachne 36

Kang, and Shin Yoo https://arxiv.org/abs/1912.12463 Misbehaviour Localise Understand Patch Test Again Choose weights of neurons that 1) are most responsible for misbehaviour, and 2) are most inﬂuential to the outcome Arachne 36

Kang, and Shin Yoo https://arxiv.org/abs/1912.12463 Misbehaviour Localise Understand Patch Test Again Choose weights of neurons that 1) are most responsible for misbehaviour, and 2) are most inﬂuential to the outcome Use both positive and negative test cases, as in GenProg Arachne 36

• Arachne can re-adjust the class boundary between a speciﬁc
pair (top plot), or multiple pairs (bottom plot). • The readjustment generalises to test data! • When repairing a single “bug” (misclassiﬁcation label pair), there is relatively minor impact on other labels. Direct Repair seems to work. 3FQBJSJOHUIFNPTUGSFRVFOUUZQF PGNJTCFIBWJPVSPG$*'"3 3FQBJSJOHNVMUJQMFUZQFTPGNJTCFIBWJPVSPG$*'"3 37

• Retraining is more disruptive • However, it is also
more widely impactful and repairs more instances • I still believe there may be a nice use case for a direct repair like this… Direct repair can be more precise. TBNQMFEGSPNUIFNPTUGSFRVFOUUZQF PGNJTCFIBWJPVSJO$*'"3 Ineg "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSUIFSFQBJSCZ"SBDIOF "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSSFUSBJOJOH 38

• Retraining is more disruptive • However, it is also
more widely impactful and repairs more instances • I still believe there may be a nice use case for a direct repair like this… Direct repair can be more precise. TBNQMFEGSPNUIFNPTUGSFRVFOUUZQF PGNJTCFIBWJPVSJO$*'"3 Ineg "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSUIFSFQBJSCZ"SBDIOF "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSSFUSBJOJOH Will be also helpful for correcting the unexpected behaviour 38

Should be possible to identify for various DNN models 39

Will help reduce the development cost of DNN based systems
Should be possible to identify for various DNN models 39

Should be possible to identify for various DNN models Better if possible to sample or search freely 39

Should be possible to identify for various DNN models Better if possible to sample or search freely Will be also helpful for correcting the unexpected behaviour 39

Should be possible to identify for various DNN models Better if possible to sample or search freely Will be also helpful for correcting the unexpected behaviour What next? 39

Road Ahead Random thoughts on what we need… We need
more systematic guidelines for retraining. We should start thinking about benchmarks now. Diversify subjects beyond image classiﬁers. 40

Recommendations for Going Forward …on a higher, strategic level Contextualise
ﬁrmly in SE practice. Pay close attention to ML literature. Take replicability and transparency seriously 41

[email protected]  https://coinse.io 42

Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair [email protected]  https://coinse.io 42

Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair Our guide to cost effectiveness • Intuitively speaking, SA is a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better [email protected]  https://coinse.io 42

Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair Our guide to cost effectiveness • Intuitively speaking, SA is a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin Yoo DeepTest 2020 We apply SA analysis to Question Answering task.  Stay tuned for the talk, which is right after this keynote :) Should be possible to generate for various DNN models [email protected]  https://coinse.io 42

Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair Our guide to cost effectiveness • Intuitively speaking, SA is a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin Yoo DeepTest 2020 We apply SA analysis to Question Answering task.  Stay tuned for the talk, which is right after this keynote :) Should be possible to generate for various DNN models Classification Tradeoff If only images with IoU < 0.5 are “bugs”… Not labelling 55% of images, and we still lose 5% of images whose Road Marker IoUs are less than 0.5. Will help reduce the development cost of DNN based systems [email protected]  https://coinse.io 42

Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair Our guide to cost effectiveness • Intuitively speaking, SA is a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin Yoo DeepTest 2020 We apply SA analysis to Question Answering task.  Stay tuned for the talk, which is right after this keynote :) Should be possible to generate for various DNN models Classification Tradeoff If only images with IoU < 0.5 are “bugs”… Not labelling 55% of images, and we still lose 5% of images whose Road Marker IoUs are less than 0.5. Will help reduce the development cost of DNN based systems SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test Input Generation Sungmin Kang, Robert Feldt, and Shin Yoo https://arxiv.org/abs/2005.09296 (SBST 2020) Encoder Decoder (b) (a) VAE Raw t=0 0.2 0.4 0.6 0.8 1.0 15 10 5 0 5 10 15 10 5 0 Trajectory of AT through interpolation green: 4 red: 9 VAE Raw Please watch the SBST presentation (https://www.youtube.com/watch?v=_psDl3wUh-4) and participate in online interactive session - 13:00 UTC, 2nd July 2020. Better if possible to sample or search freely [email protected]  https://coinse.io 42

Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair Our guide to cost effectiveness • Intuitively speaking, SA is a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin Yoo DeepTest 2020 We apply SA analysis to Question Answering task.  Stay tuned for the talk, which is right after this keynote :) Should be possible to generate for various DNN models Classification Tradeoff If only images with IoU < 0.5 are “bugs”… Not labelling 55% of images, and we still lose 5% of images whose Road Marker IoUs are less than 0.5. Will help reduce the development cost of DNN based systems SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test Input Generation Sungmin Kang, Robert Feldt, and Shin Yoo https://arxiv.org/abs/2005.09296 (SBST 2020) Encoder Decoder (b) (a) VAE Raw t=0 0.2 0.4 0.6 0.8 1.0 15 10 5 0 5 10 15 10 5 0 Trajectory of AT through interpolation green: 4 red: 9 VAE Raw Please watch the SBST presentation (https://www.youtube.com/watch?v=_psDl3wUh-4) and participate in online interactive session - 13:00 UTC, 2nd July 2020. Better if possible to sample or search freely • Retraining is more disruptive $ • However, it is also more widely impactful and repairs more instances % • I still believe there may be a nice use case for a direct repair like this… Direct repair can be more precise. TBNQMFEGSPNUIFNPTUGSFRVFOUUZQF PGNJTCFIBWJPVSJO$*'"3 I neg "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSUIFSFQBJSCZ"SBDIOF "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSSFUSBJOJOH Will be also helpful for correcting the unexpected behaviour [email protected]  https://coinse.io 42

Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair Our guide to cost effectiveness • Intuitively speaking, SA is a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin Yoo DeepTest 2020 We apply SA analysis to Question Answering task.  Stay tuned for the talk, which is right after this keynote :) Should be possible to generate for various DNN models Classification Tradeoff If only images with IoU < 0.5 are “bugs”… Not labelling 55% of images, and we still lose 5% of images whose Road Marker IoUs are less than 0.5. Will help reduce the development cost of DNN based systems SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test Input Generation Sungmin Kang, Robert Feldt, and Shin Yoo https://arxiv.org/abs/2005.09296 (SBST 2020) Encoder Decoder (b) (a) VAE Raw t=0 0.2 0.4 0.6 0.8 1.0 15 10 5 0 5 10 15 10 5 0 Trajectory of AT through interpolation green: 4 red: 9 VAE Raw Please watch the SBST presentation (https://www.youtube.com/watch?v=_psDl3wUh-4) and participate in online interactive session - 13:00 UTC, 2nd July 2020. Better if possible to sample or search freely • Retraining is more disruptive $ • However, it is also more widely impactful and repairs more instances % • I still believe there may be a nice use case for a direct repair like this… Direct repair can be more precise. TBNQMFEGSPNUIFNPTUGSFRVFOUUZQF PGNJTCFIBWJPVSJO$*'"3 I neg "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSUIFSFQBJSCZ"SBDIOF "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSSFUSBJOJOH Will be also helpful for correcting the unexpected behaviour Road Ahead Random thoughts on what we need… We need more systematic guidelines for retraining. We should start thinking about benchmarks now. Diversify subjects beyond image classifiers. [email protected]  https://coinse.io 42

Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair Our guide to cost effectiveness • Intuitively speaking, SA is a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin Yoo DeepTest 2020 We apply SA analysis to Question Answering task.  Stay tuned for the talk, which is right after this keynote :) Should be possible to generate for various DNN models Classification Tradeoff If only images with IoU < 0.5 are “bugs”… Not labelling 55% of images, and we still lose 5% of images whose Road Marker IoUs are less than 0.5. Will help reduce the development cost of DNN based systems SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test Input Generation Sungmin Kang, Robert Feldt, and Shin Yoo https://arxiv.org/abs/2005.09296 (SBST 2020) Encoder Decoder (b) (a) VAE Raw t=0 0.2 0.4 0.6 0.8 1.0 15 10 5 0 5 10 15 10 5 0 Trajectory of AT through interpolation green: 4 red: 9 VAE Raw Please watch the SBST presentation (https://www.youtube.com/watch?v=_psDl3wUh-4) and participate in online interactive session - 13:00 UTC, 2nd July 2020. Better if possible to sample or search freely • Retraining is more disruptive $ • However, it is also more widely impactful and repairs more instances % • I still believe there may be a nice use case for a direct repair like this… Direct repair can be more precise. TBNQMFEGSPNUIFNPTUGSFRVFOUUZQF PGNJTCFIBWJPVSJO$*'"3 I neg "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSUIFSFQBJSCZ"SBDIOF "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSSFUSBJOJOH Will be also helpful for correcting the unexpected behaviour Road Ahead Random thoughts on what we need… We need more systematic guidelines for retraining. We should start thinking about benchmarks now. Diversify subjects beyond image classifiers. Recommendations for Going Forward …on a higher, strategic level Contextualise firmly in SE practice. Pay close attention to ML literature. Take replicability and transparency seriously [email protected]  https://coinse.io 42

Searching for Cost Effective Test Inputs for DN...

Searching for Cost Effective Test Inputs for DNN Testing

More Decks by Shin Yoo

Other Decks in Research

Featured

Transcript