Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine learning quality for production

Machine learning quality for production

machine learning system testing and quality assurance practice for production service.

shibuiwilliam

March 15, 2022
Tweet

More Decks by shibuiwilliam

Other Decks in Technology

Transcript

  1. $ whoami shibui yusuke • MLOps engineer in TierIV •

    Kubernetes, MLOps, AR, AI • Favorite language: Kotlin, Dart, Golang • Github: @shibuiwilliam • Qiita: @cvusk • FB: yusuke.shibui cat : 0.55 dog: 0.45 human : 0.70 gorilla : 0.30 Object detection 2
  2. CatOps in work from home Cat trouble shooting in WFH

    Automated `playing` Teasers shake as I move chair day cat on chair Automated still wanna chair! Dev Data-driven evaluation 3
  3. Agenda • Why do we test software for machine learning?

    • Case study: e-commerce • Case study: self-driving 4
  4. Machine learning is more about deterministic software engineering 6 0.2

    0.3 0.9 0.0 0.0 0.1 0.5 0.7 0.1 0.3 0.5 0.7 0.7 0.6 0.4 0.6 0.2 0.3 0.9 0.0 0.0 0.1 0.5 0.7 0.1 0.3 0.5 0.7 0.7 0.6 0.4 0.6 0.2 0.3 0.9 0.0 0.0 0.1 0.5 0.7 0.1 0.3 0.5 0.7 0.7 0.6 0.4 0.6 resize to 224x224x3 separate to RGB normalize to [0, 1] 60 90 240 0 0 10 127 195 10 92 128 195 195 161 111 161 [0.01, 0.002, 0.02, 0.011, … 0.001] def predict(array): max_value = -1 index = -1 for i, v in enumerate(array): if v > max_value: max_value = v index = i prediction = label_dict[index] print(f“you are {prediction}”) DEEP LEARNING!
  5. Software unit test and machine learning test • Software algorithm

    gets tested over deterministic unit test. • Machine learning predictions get tested over ground truth and probability. 7 input expected output assert YES or NO unit test method coverage: 95/100 truth inference evaluate 0 ~ 1 machine learning test model method Accuracy: 99% Precision: 0.95 Recall: 0.60
  6. 8

  7. Difficulty in machine learning quality assurance • The ML Test

    Score: A Rubric for ML Production Readiness and Technical Debt Reduction • https://storage.googleapis.com/pub-tools-public-publication-data/pdf/aad9f93b86b7addfea4c419b9 100c6cdd26cacea.pdf 9
  8. • Web, interactive and dynamic Machine learning in E-commerce 11

    computer vision NLP violation detection item violation detection assistance image search cat toy Photo Title Description Sell now Pricing estimation pricing recommendation
  9. Machine learning system quality in web system 3 categories of

    end-to-end machine learning system quality 1. Inference a. Model performance b. Feedback loop c. Robustness for data change 2. System a. Latency, stability and scalability b. Cost effectiveness c. Exception handling 3. Operation a. Security b. Update and rollback c. Sustainable team Client GW LB int float test data accuracy: 99.99% what’s this for? 1sec/req deleted training env update dockerimg:latest 0.1% error change job me too resign Sell, Buy, Search... And then there were none 13
  10. Anti-patterns • Too slow for user-facing app • Not up-to-date

    model test data accuracy: 99.99% 5sec/req test accuracy in 2015: 99.99% cat! dog! 2015 2020 good bye in 2sec 14
  11. Latency analysis of machine learning inference • A bottleneck of

    machine learning prediction service is prediction process. • Large portion of a process latency is due to deep learning calculation. • Advised to measure prediction latency before training to eliminate training too slow model, which tends to be slow in training. Input Preprocess Inference Postprocess Output network prep inference postp output network input Latency Client Web ML 15
  12. Stability for web service • Load testing on prediction system.

    • Inference tends to be CPU-bound; CPU usage should go up-and-down with loading rate changes. • For image and NLP, giving variation on size or shape of input data is recommended. Load tester LB time req/sec time RAM CPU CPU usage should rise with load rate time RAM CPU Something wrong if CPU usage is low - insufficient server - efficient model Loading rate Load testing resource usage 16
  13. Robustness for data and model change • Data in real

    life changes quite frequently, thus model retraining is necessary. • Retrain model, divide clients into A group and B group, and evaluate effectiveness of the new model against the current model. 17 LB real data changes over time; data drift concept drift; prediction changes with data drift trained dataset and expectation drifted data and prediction LB LB A group B group new model deployment current model deployment
  14. Machine learnings in self-driving • Realtime, parallel, and multi-step processes

    traffic light recognition pedestrian recognition road width segmentation 20m 19
  15. Traffic light recognition • Combination of map, object detection, image

    classification and color recognition traffic light recognition 20m object detection image classification color recognition preprocess map red! 20
  16. Machine learning system quality in self-driving image cropping test data

    accuracy: 99.99% camera changed 1pred/sec new route confusion matrix object detection image classification color recognition preprocess map distance to object 3 categories of end-to-end machine learning system quality 1. Inference a. Model performance b. Feedback loop c. Robustness for data change 2. System a. Latency, stability and scalability b. Cost effectiveness c. Exception handling 3. Operation a. Security b. Update and rollback c. Sustainable team 21
  17. Prediction analysis 22 Accuracy Distance Accurate from 20m apart Inaccurate

    even in few meter • Evaluated over accuracy and distance. • Some target images with wrong prediction in a distance can be predicted accurately in close. • Exception handling required if wrong prediction even in close distance.
  18. Data management and exception handling • Wrong predictions in high-performance

    model are often due to unexpected data that becomes edge-cases in the dataset. • Machine learning prediction system performance evaluation for large-scale dataset `and` evaluation for edge-cases. 23
  19. Take aways 25 • Machine learning in product requires use-case

    based quality assurance along with machine learning evaluation. • Quantitative evaluation on large dataset; evaluation for drifted data and edge-cases to ensure the model and system are robust against expected exceptions.