Lupus - A Monitoring System for Accelerating MLOps

Target audience › People who are managing ML products. ›
People who belong to ML team which is expected to grow much more. › And all people who are interested in MLOps.

Agenda › What’s MLOps monitoring? › MLOps at ML Dept.
› Our challenges in MLOps monitoring › Lupus: our monitoring infrastructure

Self introduction Joined on April 2021, as new-grads In charge
of › Recommendation › Internal library development › Internal application development Personal › Living with Java sparrows Junki Ishikawa Machine Learning Development Team

What’s MLOps? ML + DevOps OPS DEV ML DESIGN ANALYZE
EVALUATE CODE PLAN BUILD RELEASE OPERATE MONITOR

What’s MLOps monitoring? What to monitor DevOps › Resource usage
(CPU, Memory, Storage, …) › Disk I/O › Network Traffic › Heartbeats › Business KPIs › DevOps KPIs (MTTR, …) › etc…

What’s MLOps monitoring? What to monitor DevOps MLOps › Data
statistics › Input data changes (data drift) › Input - target pattern changes (concept drift) › Model performance › Prediction accuracy › Diversity of recommendations › Fairness + › Resource usage (CPU, Memory, Storage, …) › Disk I/O › Network Traffic › Heartbeats › Business KPIs › DevOps KPIs (MTTR, …) › etc…

What’s MLOps monitoring? Other differences DevOps MLOps › Data statistics
› Input data changes (data drift) › Input - target pattern changes (concept drift) › Model performance › Prediction accuracy › Diversity of recommendations › Fairness › Resource usage (CPU, Memory, Storage, …) › Disk I/O › Network Traffic › Heartbeats › Business KPIs › DevOps KPIs (MTTR, …) › etc…

› Input data changes (data drift) › Input - target pattern changes (concept drift) › Model performance › Prediction accuracy › Diversity of recommendations › Fairness "VUPNBUJPO › Resource usage (CPU, Memory, Storage, …) › Disk I/O › Network Traffic › Heartbeats › Business KPIs › DevOps KPIs (MTTR, …) › etc…

› Input data changes (data drift) › Input - target pattern changes (concept drift) › Model performance › Prediction accuracy › Diversity of recommendations › Fairness "VUPNBUJPO *OUFSWBM › Resource usage (CPU, Memory, Storage, …) › Disk I/O › Network Traffic › Heartbeats › Business KPIs › DevOps KPIs (MTTR, …) › etc…

› Input data changes (data drift) › Input - target pattern changes (concept drift) › Model performance › Prediction accuracy › Diversity of recommendations › Fairness "VUPNBUJPO *OUFSWBM "MFSUMPHJD › Resource usage (CPU, Memory, Storage, …) › Disk I/O › Network Traffic › Heartbeats › Business KPIs › DevOps KPIs (MTTR, …) › etc…

MLOps at ML Dept. Products

MLOps at ML Dept. Scale Organizations 20+ Tables from external
organizations 500+ ML Products 100+

27 53 80 107 133 160 No. of logics to
select contents on SmartCH 19/5 7 9 11 20/1 3 5 7 9 11 21/1 3 5 7 9 MLOps at ML Dept. Scale Logic 1 Logic 2 … Logic n

MLOps at ML Dept. Facilities

MLOps at ML Dept. Facilities › Kubernetes › IU (LINE’s
Hadoop cluster) Infrastructure

Hadoop cluster) Infrastructure › Jutopia (LINE’s Jupyter server) Prototyping environment

Hadoop cluster) Infrastructure › Argo Workflows › Azkaban › Airflow Workflow engines › Jutopia (LINE’s Jupyter server) Prototyping environment

Hadoop cluster) Infrastructure › Argo Workflows › Azkaban › Airflow Workflow engines › ArgoCD › Drone CI CI / CD tools › Jutopia (LINE’s Jupyter server) Prototyping environment

MLOps at ML Dept. Facilities › User sparse/dense features ›
Item metadata features Shared feature vectors › Kubernetes › IU (LINE’s Hadoop cluster) Infrastructure › Argo Workflows › Azkaban › Airflow Workflow engines › ArgoCD › Drone CI CI / CD tools › Jutopia (LINE’s Jupyter server) Prototyping environment

MLOps at ML Dept. Facilities › Distributed training & inference
› Model collections › Recommendation automation › I/O manager › etc… Internal libraries › User sparse/dense features › Item metadata features Shared feature vectors › Kubernetes › IU (LINE’s Hadoop cluster) Infrastructure › Argo Workflows › Azkaban › Airflow Workflow engines › ArgoCD › Drone CI CI / CD tools › Jutopia (LINE’s Jupyter server) Prototyping environment

MLOps at ML Dept. Facilities › Distributed training & inference
› Model collections › Recommendation automation › I/O manager › etc… Internal libraries › User sparse/dense features › Item metadata features Shared feature vectors › A/B test manager › A/B test monitoring system › Recommendation demo generator Internal experiment manager › Kubernetes › IU (LINE’s Hadoop cluster) Infrastructure › Argo Workflows › Azkaban › Airflow Workflow engines › ArgoCD › Drone CI CI / CD tools › Jutopia (LINE’s Jupyter server) Prototyping environment

OPS DEV ML DESIGN ANALYZE EVALUATE CODE PLAN BUILD RELEASE
OPERATE MONITOR MLOps at ML Dept. Common pipeline 1SPUPUZQJOHUPPMT *OUFSOBMFYQFSJNFOUNBOBHFS 8PSLqPX&OHJOFT $*$%UPPMT *OUFSOBM-JCSBSJFT 4IBSFEGFBUVSFWFDUPST

OPS DEV ML DESIGN ANALYZE EVALUATE CODE PLAN BUILD RELEASE
OPERATE MONITOR MLOps at ML Dept. Common pipeline ? 1SPUPUZQJOHUPPMT *OUFSOBMFYQFSJNFOUNBOBHFS 8PSLqPX&OHJOFT $*$%UPPMT *OUFSOBM-JCSBSJFT 4IBSFEGFBUVSFWFDUPST

Our challenges in MLOps monitoring Monitoring issues › As the
number of ML products increases, the cost of monitoring has steadily grown. Increasing monitoring costs

Our challenges in MLOps monitoring Monitoring issues Disjointed, project-dependent monitoring
operations Increasing monitoring costs › Each project has different monitoring methods and alerts. › Sometimes cheap, sometimes poor. › As the number of ML products increases, the cost of monitoring has steadily grown.

Our challenges in MLOps monitoring Monitoring issues Disjointed, project-dependent monitoring
operations Outages due to lack of monitoring Increasing monitoring costs › Each project has different monitoring methods and alerts. › Sometimes cheap, sometimes poor. › As the number of ML products increases, the cost of monitoring has steadily grown. › There are many causes of outages (e.g. missing data, the changes of model outputs, etc.). › It is nearly impossible to manually monitor every product.

Our challenges in MLOps monitoring Actual outage we experienced before
Data Missing Model Update Manual Monitoring Cause: - Handcraft monitoring code on jupyter notebook Impact: - Cheap metrics - Poor alerting - Unreviewed code Cause: - Cluster outage - Delay Impact: - Low quality prediction - Empty prediction Cause: - Model architecture update - Smoothing Impact: - Significant drift in the prediction distribution - Found out 2 weeks later

Our challenges in MLOps monitoring What we need

Our challenges in MLOps monitoring What we need Collection

Our challenges in MLOps monitoring What we need Collection Metrics
aggregation tools Reliable metrics store

Our challenges in MLOps monitoring What we need Detection Collection
Metrics aggregation tools Reliable metrics store

Our challenges in MLOps monitoring What we need Detection Collection
Metrics aggregation tools Reliable metrics store Flexible anomaly detector Alerting system

Our challenges in MLOps monitoring What we need Detection Visualization
Collection Metrics aggregation tools Reliable metrics store Flexible anomaly detector Alerting system

Our challenges in MLOps monitoring What we need Detection Visualization
Collection User-friendly GUI app Metrics aggregation tools Reliable metrics store Flexible anomaly detector Alerting system

Lupus Common monitoring infrastructure for MLOps

Lupus Concept for engineers Easy to collect for operators Easy
to detect for project members Easy to visualize

Lupus Components Lupus server : Metric management and anomaly detection
APIs Lupus SPA : Web app for metrics and anomalies visualization Lupus library : Metrics aggregation tools and API client

Lupus Ecosystem

Metrics collection Lupus

Case: Metrics collection Which kind of metrics should we monitor?
Effective metrics depend on the task, data, model and so on… Data drift / Concept drift › Statistics of input data › Statistics of target variables Model degradation / replacement › Statistics of predictions › Ground-truth evaluation › Training / Validation metrics Lupus library helps to aggregate these metrics

Case: Metrics collection Library support Avg@k Sum 95 percentile Region
Age Device Rating Interests JP 23 iOS [5.0, 4.0] [a, b, c] JP 42 Mac [2.0] [e, g] JP 64 Android [4.5, 3.5] [x, y, a] US 27 iOS [4.0, 4.0] [v, t, v] US 38 Android [3.0] [y] … … … … … count per entity Unique entity count@k / Region Min

Case: Metrics collection Library support import pyspark.sql.functions as F from
pyspark.sql import Row stats = [] # age age_stats = df.groupby("region").agg(F.avg("age").alias("avg"), F.max("age").alias("max"), F.min("age").alias("min")) for row in age_stats.toLocalIterator(): stats.append({"col": "age", "region": row.region, "metric": "avg", "value": row["avg"]}) stats.append({"col": "age", "region": row.region, "metric": "max", "value": row["max"]}) stats.append({"col": "age", "region": row.region, "metric": "min", "value": row["min"]}) # device device_counts = df.groupby("region", "device").agg(F.count("device").alias("count")) device_unique = device_counts.groupby("region").agg(F.count("count").alias("unique")) for row in device_counts.toLocalIterator(): stats.append({"col": "device", "region": row.region, "metric": "count", "value": row["count"], "device": row.device}) for row in device_unique.toLocalIterator(): stats.append({"col": "device", "region": row.region, "metric": "count", "unique": row["unique"]}) # ratings def truncate(df, col, k): def _(row): dic = row.asDict() dic[col] = dic[col][:k] return Row(**dic) return df.rdd.map(_).toDF() ratings_stats_all = ( df.select("region", F.explode("ratings").alias("ratings")) .groupby("region").agg(F.avg("ratings").alias("avg"), F.max("ratings").alias("max"), F.min("ratings").alias("min")) ) for row in ratings_stats_all.toLocalIterator(): stats.append({"col": "ratings", "region": row.region, "metric": "avg", "value": row["avg"]}) stats.append({"col": "ratings", "region": row.region, "metric": "max", "value": row["max"]}) stats.append({"col": "ratings", "region": row.region, "metric": "min", "value": row["min"]}) ratings_stats_top5 = ( truncate(df, "ratings", 5) .select("region", F.explode("ratings").alias("ratings")) .groupby("region").agg(F.avg("ratings").alias("avg"), F.max("ratings").alias("max"), F.min("ratings").alias("min")) ) for row in ratings_stats_top5.toLocalIterator(): stats.append({"col": "ratings", "region": row.region, "metric": "avg@5", "value": row["avg"]}) stats.append({"col": "ratings", "region": row.region, "metric": "max@5", "value": row["max"]}) stats.append({"col": "ratings", "region": row.region, "metric": "min@5", "value": row["min"]}) interests_count_all = ( df.select("region", F.explode("interests").alias("interests")) .groupby("region", "interests").agg(F.count("interests").alias("count")) ) interests_unique_all = interests_count_all.groupby("region").agg(F.count("count").alias("unique")) for row in interests_count_all.toLocalIterator(): stats.append({"col": "interests", "region": row.region, "metric": "count", "value": row["count"], "device": row.interests}) for row in interests_unique_all.toLocalIterator(): stats.append({"col": "interests", "region": row.region, "metric": "count", "unique": row["unique"]}) interests_count_top5 = ( truncate(df, "interests", 5) .select("region", F.explode("interests").alias("interests")) .groupby("region", "interests").agg(F.count("interests").alias("count")) ) interests_unique_top5 = interests_count_top5.groupby("region").agg(F.count("count").alias("unique")) for row in interests_count_top5.toLocalIterator(): stats.append({"col": "interests", "region": row.region, "metric": "count", "value": row["count"], "device": row.interests}) for row in interests_unique_top5.toLocalIterator(): stats.append({"col": "interests", "region": row.region, "metric": "count", "unique": row["unique"]}

import pyspark.sql.functions as F from pyspark.sql import Row stats =
[] # age age_stats = df.groupby("region").agg(F.avg("age").alias("avg"), F.max("age").alias("max"), F.min("age").alias("min")) for row in age_stats.toLocalIterator(): stats.append({"col": "age", "region": row.region, "metric": "avg", "value": row["avg"]}) stats.append({"col": "age", "region": row.region, "metric": "max", "value": row["max"]}) stats.append({"col": "age", "region": row.region, "metric": "min", "value": row["min"]}) # device device_counts = df.groupby("region", "device").agg(F.count("device").alias("count")) device_unique = device_counts.groupby("region").agg(F.count("count").alias("unique")) for row in device_counts.toLocalIterator(): stats.append({"col": "device", "region": row.region, "metric": "count", "value": row["count"], "device": row.device}) for row in device_unique.toLocalIterator(): stats.append({"col": "device", "region": row.region, "metric": "count", "unique": row["unique"]}) # ratings def truncate(df, col, k): def _(row): dic = row.asDict() dic[col] = dic[col][:k] return Row(**dic) return df.rdd.map(_).toDF() ratings_stats_all = ( df.select("region", F.explode("ratings").alias("ratings")) .groupby("region").agg(F.avg("ratings").alias("avg"), F.max("ratings").alias("max"), F.min("ratings").alias("min")) ) for row in ratings_stats_all.toLocalIterator(): stats.append({"col": "ratings", "region": row.region, "metric": "avg", "value": row["avg"]}) stats.append({"col": "ratings", "region": row.region, "metric": "max", "value": row["max"]}) stats.append({"col": "ratings", "region": row.region, "metric": "min", "value": row["min"]}) ratings_stats_top5 = ( truncate(df, "ratings", 5) .select("region", F.explode("ratings").alias("ratings")) .groupby("region").agg(F.avg("ratings").alias("avg"), F.max("ratings").alias("max"), F.min("ratings").alias("min")) ) for row in ratings_stats_top5.toLocalIterator(): stats.append({"col": "ratings", "region": row.region, "metric": "avg@5", "value": row["avg"]}) stats.append({"col": "ratings", "region": row.region, "metric": "max@5", "value": row["max"]}) stats.append({"col": "ratings", "region": row.region, "metric": "min@5", "value": row["min"]}) interests_count_all = ( df.select("region", F.explode("interests").alias("interests")) .groupby("region", "interests").agg(F.count("interests").alias("count")) ) interests_unique_all = interests_count_all.groupby("region").agg(F.count("count").alias("unique")) for row in interests_count_all.toLocalIterator(): stats.append({"col": "interests", "region": row.region, "metric": "count", "value": row["count"], "device": row.interests}) for row in interests_unique_all.toLocalIterator(): stats.append({"col": "interests", "region": row.region, "metric": "count", "unique": row["unique"]}) interests_count_top5 = ( truncate(df, "interests", 5) .select("region", F.explode("interests").alias("interests")) .groupby("region", "interests").agg(F.count("interests").alias("count")) ) interests_unique_top5 = interests_count_top5.groupby("region").agg(F.count("count").alias("unique")) for row in interests_count_top5.toLocalIterator(): stats.append({"col": "interests", "region": row.region, "metric": "count", "value": row["count"], "device": row.interests}) for row in interests_unique_top5.toLocalIterator(): stats.append({"col": "interests", "region": row.region, "metric": "count", "unique": row["unique"]} from lupus.processor.spark import \ DistributionProcessor processor = DistributionProcessor( df, group_columns=[“region”], column_metrics={ “age”: [“avg”, “p25”, “p50”, “p75”], “device”: [“count”, “unique”], “ratings”: [“avg”, “avg@5”, “min”, “max”], “interests”: [“count”, “unique”, “unique@3”], }, ) metrics = processor.get_metrics() Case: Metrics collection Library support

Case: Metrics collection Library support pred gt A A B
C C C A B B b … … Label count F1-score Recall Accuracy

Case: Metrics collection Library support pred gt [A, C, B]
[A, B] [A, C, B] [A] [C, D, E] [C, D] [A, D, C] [C] [D, E, B] [A, D] … … Unique nDCG@k Recall Entropy@k

Case: Metrics collection Library support Training loss Extra metrics Validation
loss MLFlow

Case: Metrics collection Overview

Case: Metrics collection Overview 1. Aggregated metrics

Case: Metrics collection Overview 2. Push them to Lupus server

Case: Metrics collection Overview 3. metrics are uploaded to S3-compatible
storage

Case: Metrics collection Overview 4. Submit the collection job to
queue

Case: Metrics collection Overview 5. Workflow saves metrics to Hive
and Elasticsearch

Anomaly detection Lupus

Case: Anomaly detection Which kind of alert do we need?
Anomalies in the context of MLOps have more complex conditions than DevOps. Basic rules › If a metric exceeds the threshold › If a metric deviates significantly form the average of recent days Complex rules › If a metric deviates significantly from periodical change. › If the trend of a metric changes.

Case: Anomaly detection Available anomaly detection methods Thresholding Time-series prediction
by Prophet Window-based Rules Twitter’s AnomalyDetection package

Case: Anomaly detection Overview

Case: Anomaly detection Overview 1. Request detection

Case: Anomaly detection Overview 2. Detection job is queued

Case: Anomaly detection Overview 3. Workflow reads metrics from Hive
and performs detection

Case: Anomaly detection Overview 4. Save anomalies to Hive and
Elasticsearch

Visualization Lupus

Case: Visualization Overview

Case: Visualization Features and motivation › We have simple but
specific use cases. Major OSS do not fit our needs despite their complexity. › Lupus has niche requirements like showing anomalies and narrow down by metric groups. › LINE takes user privacy seriously and Lupus has strict and complicated authentication requirements. Why self-made? Web UI for metrics visualization › Metrics charts with anomaly information. › An explorer to easily discover a desired chart. › User customizable dashboards for daily observations.

Top Entrypoint to dashboards and the data source explorer

Discover Chart listing for discovering a desired metric chart

Metric chart Detail page to show a series of metrics
with anomaly points

Anomalies Detailed anomaly information by clicking a series of metrics

Dashboard Customizable dashboard to display favorite charts

Impacts › It became much easier to collect daily metrics
than before. Easy monitoring › Lupus helps finding outages by detect obstacles that we hadn’t noticed before. Avoiding outages Discover insights › We could move from self-made notebook to reliable codebase with reviews. Reliable monitoring code › We can access collected metrics very fast with Lupus WebUI. › Also, we can easily share them to project members. Fast access, shareable UI › We could find changes in the accuracy of our products that we hadn’t known. › Got motivated to improve the products.

than before. Easy monitoring › Lupus helps finding outages by detect obstacles that we hadn’t noticed before. › We could find changes in the accuracy of our products that we hadn’t known. › Got motivated to improve the products. Avoiding outages Discover insights › We could move from self-made notebook to reliable codebase with reviews. Reliable monitoring code › We can access collected metrics very fast with Lupus WebUI. › Also, we can easily share them to project members. Fast access, shareable UI

than before. Easy monitoring › Lupus helps finding outages by detect obstacles that we hadn’t noticed before. Avoiding outages Discover insights › We could move from self-made notebook to reliable codebase with reviews. Reliable monitoring code › We can access collected metrics very fast with Lupus WebUI. › Also, we can easily share them to project members. Fast access, shareable UI › We could find changes in the accuracy of our products that we hadn’t known. › Got motivated to improve the products.

Summary › With previous efforts, ML Dept. can now release
ML products in a short development time. › Along with this, the cost of monitoring has been getting bigger and bigger. Our challenges in MLOps monitoring Our solution › We have developed an original monitoring system for MLOps, called Lupus › Lupus provides 3 components to help us collect, alert and visualize metrics in an efficient manner. Monitoring on MLOps › MLOps requires additional monitoring metrics related to data and ML models.

Reference Introducing MLOps (O'Reilly Media, Inc.) › Mark Treveil, Nicolas
Omont, Clément Stenac, Kenji Lefevre, Du Phan, Joachim Zentici, Adrien Lavoillotte, Makoto Miyazaki and Lynn Heidmann Practical MLOps (O'Reilly Media, Inc.) › Noah Gift and Alfredo Deza MLOps: Continuous delivery and automation pipelines in machine learning (Google Could) › https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in- machine-learning Evidently AI blog: machine learning monitoring series (Evidently AI) › https://evidentlyai.com/blog#!/tfeeds/393523502011/c/machine%20learning%20monitoring%20series

Thank you

Lupus - A Monitoring System for Accelerating MLOps

Lupus - A Monitoring System for Accelerating MLOps

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Featured

Transcript