Machine learning quality for production

Machine Learning Quality for Production System March 15th, 2021 shibui
yusuke 1

$ whoami shibui yusuke • MLOps engineer in TierIV •
Kubernetes, MLOps, AR, AI • Favorite language: Kotlin, Dart, Golang • Github: @shibuiwilliam • Qiita: @cvusk • FB: yusuke.shibui cat : 0.55 dog: 0.45 human : 0.70 gorilla : 0.30 Object detection 2

CatOps in work from home Cat trouble shooting in WFH
Automated `playing` Teasers shake as I move chair day cat on chair Automated still wanna chair! Dev Data-driven evaluation 3

Agenda • Why do we test software for machine learning?
• Case study: e-commerce • Case study: self-driving 4

Why do we test software for machine learning? 5

Machine learning is more about deterministic software engineering 6 0.2
0.3 0.9 0.0 0.0 0.1 0.5 0.7 0.1 0.3 0.5 0.7 0.7 0.6 0.4 0.6 0.2 0.3 0.9 0.0 0.0 0.1 0.5 0.7 0.1 0.3 0.5 0.7 0.7 0.6 0.4 0.6 0.2 0.3 0.9 0.0 0.0 0.1 0.5 0.7 0.1 0.3 0.5 0.7 0.7 0.6 0.4 0.6 resize to 224x224x3 separate to RGB normalize to [0, 1] 60 90 240 0 0 10 127 195 10 92 128 195 195 161 111 161 [0.01, 0.002, 0.02, 0.011, … 0.001] def predict(array): max_value = -1 index = -1 for i, v in enumerate(array): if v > max_value: max_value = v index = i prediction = label_dict[index] print(f“you are {prediction}”) DEEP LEARNING!

Software unit test and machine learning test • Software algorithm
gets tested over deterministic unit test. • Machine learning predictions get tested over ground truth and probability. 7 input expected output assert YES or NO unit test method coverage: 95/100 truth inference evaluate 0 ~ 1 machine learning test model method Accuracy: 99% Precision: 0.95 Recall: 0.60

Diﬃculty in machine learning quality assurance • The ML Test
Score: A Rubric for ML Production Readiness and Technical Debt Reduction • https://storage.googleapis.com/pub-tools-public-publication-data/pdf/aad9f93b86b7addfea4c419b9 100c6cdd26cacea.pdf 9

Case study: e-commerce 10

• Web, interactive and dynamic Machine learning in E-commerce 11
computer vision NLP violation detection item violation detection assistance image search cat toy Photo Title Description Sell now Pricing estimation pricing recommendation

Machine learning system 12 Client GW LB cat! predict

Machine learning system quality in web system 3 categories of
end-to-end machine learning system quality 1. Inference a. Model performance b. Feedback loop c. Robustness for data change 2. System a. Latency, stability and scalability b. Cost effectiveness c. Exception handling 3. Operation a. Security b. Update and rollback c. Sustainable team Client GW LB int ﬂoat test data accuracy: 99.99% what’s this for? 1sec/req deleted training env update dockerimg:latest 0.1% error change job me too resign Sell, Buy, Search... And then there were none 13

Anti-patterns • Too slow for user-facing app • Not up-to-date
model test data accuracy: 99.99% 5sec/req test accuracy in 2015: 99.99% cat! dog! 2015 2020 good bye in 2sec 14

Latency analysis of machine learning inference • A bottleneck of
machine learning prediction service is prediction process. • Large portion of a process latency is due to deep learning calculation. • Advised to measure prediction latency before training to eliminate training too slow model, which tends to be slow in training. Input Preprocess Inference Postprocess Output network prep inference postp output network input Latency Client Web ML 15

Stability for web service • Load testing on prediction system.
• Inference tends to be CPU-bound; CPU usage should go up-and-down with loading rate changes. • For image and NLP, giving variation on size or shape of input data is recommended. Load tester LB time req/sec time RAM CPU CPU usage should rise with load rate time RAM CPU Something wrong if CPU usage is low - insuﬃcient server - eﬃcient model Loading rate Load testing resource usage 16

Robustness for data and model change • Data in real
life changes quite frequently, thus model retraining is necessary. • Retrain model, divide clients into A group and B group, and evaluate effectiveness of the new model against the current model. 17 LB real data changes over time; data drift concept drift; prediction changes with data drift trained dataset and expectation drifted data and prediction LB LB A group B group new model deployment current model deployment

Case study: self-driving 18

Machine learnings in self-driving • Realtime, parallel, and multi-step processes
traﬃc light recognition pedestrian recognition road width segmentation 20m 19

Traffic light recognition • Combination of map, object detection, image
classification and color recognition traffic light recognition 20m object detection image classification color recognition preprocess map red! 20

Machine learning system quality in self-driving image cropping test data
accuracy: 99.99% camera changed 1pred/sec new route confusion matrix object detection image classiﬁcation color recognition preprocess map distance to object 3 categories of end-to-end machine learning system quality 1. Inference a. Model performance b. Feedback loop c. Robustness for data change 2. System a. Latency, stability and scalability b. Cost effectiveness c. Exception handling 3. Operation a. Security b. Update and rollback c. Sustainable team 21

Prediction analysis 22 Accuracy Distance Accurate from 20m apart Inaccurate
even in few meter • Evaluated over accuracy and distance. • Some target images with wrong prediction in a distance can be predicted accurately in close. • Exception handling required if wrong prediction even in close distance.

Data management and exception handling • Wrong predictions in high-performance
model are often due to unexpected data that becomes edge-cases in the dataset. • Machine learning prediction system performance evaluation for large-scale dataset `and` evaluation for edge-cases. 23

Take aways! 24

Take aways 25 • Machine learning in product requires use-case
based quality assurance along with machine learning evaluation. • Quantitative evaluation on large dataset; evaluation for drifted data and edge-cases to ensure the model and system are robust against expected exceptions.

Machine learning quality for production

Machine learning quality for production

shibuiwilliam

More Decks by shibuiwilliam

Other Decks in Technology

Featured

Transcript