AlerTiger: Deep Learning for AI Model Health Monitoring at LinkedIn

• During a UI migration, AlerTiger detected a feature distribution
shift across nine model aspects, leading to missing features and performance drop. However, post retraining with the new schema, we saw improved business metrics. • When launching a new model, AlerTiger detected features with uniform values across all instances, causing performance inconsistency. On resolution, the successfully deployed model yielded expected business gains. AlerTiger: Deep Learning for AI Model Health Monitoring at LinkedIn Zhentao Xu, Ruoying Wang, Girish Balaji, Manas Bundele, Xiaofei Liu, Leo Liu, Tie Wang {zhexu, ruowang, gbalaji, mbundele, xiaoliu, leoliu, tiewang}@linkedin.com The health of AI models is crucial for the business success of data-driven companies. Unique challenges for AI model health monitoring and anomaly detection include: • lack of unified health definition • anomaly label sparsity • difficulty in generalization • lack of explainability • Constructed an end-to-end data-driven solution for monitoring AI models. • Combined quantile loss with RMSE to estimate a non-parametric distribution • Demonstrated broad generalizability and implemented this solution at scale, delivering high performance. • STEP-1 Health Statistics Generation: Serving pipeline emits data via Kafka; Offline spark job calculates statistics for input feature, output score, and auxiliary metadata. • STEP-4 Alerting and Visualization: Model health data is compiled into a comprehensive report and emailed to AI model owners. MLOps Model Health Dataset METHODOLOGY CONTRIBUTION INTRODUCTION PRODUCTION CASE STUDY EVALUATION LINK Algorithm Precision Recall F1 Score Time AlerTiger 0.47 1.00 0.64 0 h 35 mins ARIMA 0.45 0.94 0.61 1 h 28 mins DeepAR 0.44 0.88 0.59 8 h 49 mins Prophet 0.44 1.00 0.61 14 h 0 mins SARIMAX 0.43 0.94 0.59 1 h 30 mins AR 0.43 0.94 0.59 1 h 30 mins Effect of Time Series and Anomaly on Performance We proposed AlerTiger, a deep-learning- based MLOps model monitoring system: • defines three categories of health statistics and alerts on deviation. • employs a two-stage training process for sparse labels. • trains one model with various patterns, reducing onboarding costs. • provides a holistic report with cross- dimension interaction. AlerTiger has been used on most of LinkedIn's AI models for over a year. It has identified and helped resolve issues, which led to improved business metrics. SOLUTION Irregularity score for better boundary prediction • STEP-2 Two-stage Anomaly Detection (Forecasting + Classification) • STEP-3 Post Processing (Filtering + Grouping) Mixed quantile loss + MSE loss for forecasting Cross-entropy loss for classification

AlerTiger: Deep Learning for AI Model Health Mo...

AlerTiger: Deep Learning for AI Model Health Monitoring at LinkedIn

Zhentao Xu

More Decks by Zhentao Xu

Featured

Transcript

• During a UI migration, AlerTiger detected a feature distribution