gets tested over deterministic unit test. • Machine learning predictions get tested over ground truth and probability. 7 input expected output assert YES or NO unit test method coverage: 95/100 truth inference evaluate 0 ~ 1 machine learning test model method Accuracy: 99% Precision: 0.95 Recall: 0.60
Score: A Rubric for ML Production Readiness and Technical Debt Reduction • https://storage.googleapis.com/pub-tools-public-publication-data/pdf/aad9f93b86b7addfea4c419b9 100c6cdd26cacea.pdf 9
end-to-end machine learning system quality 1. Inference a. Model performance b. Feedback loop c. Robustness for data change 2. System a. Latency, stability and scalability b. Cost effectiveness c. Exception handling 3. Operation a. Security b. Update and rollback c. Sustainable team Client GW LB int float test data accuracy: 99.99% what’s this for? 1sec/req deleted training env update dockerimg:latest 0.1% error change job me too resign Sell, Buy, Search... And then there were none 13
machine learning prediction service is prediction process. • Large portion of a process latency is due to deep learning calculation. • Advised to measure prediction latency before training to eliminate training too slow model, which tends to be slow in training. Input Preprocess Inference Postprocess Output network prep inference postp output network input Latency Client Web ML 15
• Inference tends to be CPU-bound; CPU usage should go up-and-down with loading rate changes. • For image and NLP, giving variation on size or shape of input data is recommended. Load tester LB time req/sec time RAM CPU CPU usage should rise with load rate time RAM CPU Something wrong if CPU usage is low - insufficient server - efficient model Loading rate Load testing resource usage 16
life changes quite frequently, thus model retraining is necessary. • Retrain model, divide clients into A group and B group, and evaluate effectiveness of the new model against the current model. 17 LB real data changes over time; data drift concept drift; prediction changes with data drift trained dataset and expectation drifted data and prediction LB LB A group B group new model deployment current model deployment
accuracy: 99.99% camera changed 1pred/sec new route confusion matrix object detection image classification color recognition preprocess map distance to object 3 categories of end-to-end machine learning system quality 1. Inference a. Model performance b. Feedback loop c. Robustness for data change 2. System a. Latency, stability and scalability b. Cost effectiveness c. Exception handling 3. Operation a. Security b. Update and rollback c. Sustainable team 21
even in few meter • Evaluated over accuracy and distance. • Some target images with wrong prediction in a distance can be predicted accurately in close. • Exception handling required if wrong prediction even in close distance.
model are often due to unexpected data that becomes edge-cases in the dataset. • Machine learning prediction system performance evaluation for large-scale dataset `and` evaluation for edge-cases. 23
based quality assurance along with machine learning evaluation. • Quantitative evaluation on large dataset; evaluation for drifted data and edge-cases to ensure the model and system are robust against expected exceptions.