Code DL System Specification Training Data Logic as Control Flow Logic as Data Flow Written Trained Tested Tested For Faults Faults? Patched Retrained? Verification (abstract interpretation, interval analysis, etc…) DNN models are getting bigger and more complicated. Test Adequacy (DeepXplore, DeepGauge, SA…) Input Generation (GAN, AEs, simulation, model-based) Taxonomy of faults in deep learning systems Metamorphic Oracles Systematic Retraining (Apricot - ASE ’19) SA & Cost Effectiveness VAE based SBST Non Image Classifiers Direct Repair Our guide to cost effectiveness • Intuitively speaking, SA is a distance metric that measures out-of-distribution- ness. • An input is more OOD if: • A similar neural activation has been rarely seen • its neural activation pattern is far from the observed mean • its neural activation is closer to inputs in other class labels • More OOD = More likely to misbehave Surprise Adequacy1 1. J. Kim, R. Feldt, and S. Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, ICSE 2019, pages 1039–1049. IEEE Press, 2019. Will reveal unexpected behaviours better Evaluating Surprise Adequacy on Question Answering Seah Kim and Shin Yoo DeepTest 2020 We apply SA analysis to Question Answering task.
Stay tuned for the talk, which is right after this keynote :) Should be possible to generate for various DNN models Classification Tradeoff If only images with IoU < 0.5 are “bugs”… Not labelling 55% of images, and we still lose 5% of images whose Road Marker IoUs are less than 0.5. Will help reduce the development cost of DNN based systems SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test Input Generation Sungmin Kang, Robert Feldt, and Shin Yoo https://arxiv.org/abs/2005.09296 (SBST 2020) Encoder Decoder (b) (a) VAE Raw t=0 0.2 0.4 0.6 0.8 1.0 15 10 5 0 5 10 15 10 5 0 Trajectory of AT through interpolation green: 4 red: 9 VAE Raw Please watch the SBST presentation (https://www.youtube.com/watch?v=_psDl3wUh-4) and participate in online interactive session - 13:00 UTC, 2nd July 2020. Better if possible to sample or search freely • Retraining is more disruptive $ • However, it is also more widely impactful and repairs more instances % • I still believe there may be a nice use case for a direct repair like this… Direct repair can be more precise. TBNQMFEGSPNUIFNPTUGSFRVFOUUZQF PGNJTCFIBWJPVSJO$*'"3 I neg "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSUIFSFQBJSCZ"SBDIOF "WHPGQBUDIFEBOECSPLFOJOQVUTQFSMBCFMBGUFSSFUSBJOJOH Will be also helpful for correcting the unexpected behaviour Road Ahead Random thoughts on what we need… We need more systematic guidelines for retraining. We should start thinking about benchmarks now. Diversify subjects beyond image classifiers. Recommendations for Going Forward …on a higher, strategic level Contextualise firmly in SE practice. Pay close attention to ML literature. Take replicability and transparency seriously
[email protected]
https://coinse.io 42