Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lupus - A Monitoring System for Accelerating MLOps

Lupus - A Monitoring System for Accelerating MLOps

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Target audience › People who are managing ML products. ›

    People who belong to ML team which is expected to grow much more. › And all people who are interested in MLOps.
  2. Agenda › What’s MLOps monitoring? › MLOps at ML Dept.

    › Our challenges in MLOps monitoring › Lupus: our monitoring infrastructure
  3. Self introduction Joined on April 2021, as new-grads In charge

    of › Recommendation › Internal library development › Internal application development Personal › Living with Java sparrows Junki Ishikawa Machine Learning Development Team
  4. Agenda › What’s MLOps monitoring? › MLOps at ML Dept.

    › Our challenges in MLOps monitoring › Lupus: our monitoring infrastructure
  5. What’s MLOps? ML + DevOps OPS DEV ML DESIGN ANALYZE

    EVALUATE CODE PLAN BUILD RELEASE OPERATE MONITOR
  6. What’s MLOps? ML + DevOps OPS DEV ML DESIGN ANALYZE

    EVALUATE CODE PLAN BUILD RELEASE OPERATE MONITOR
  7. What’s MLOps? ML + DevOps OPS DEV ML DESIGN ANALYZE

    EVALUATE CODE PLAN BUILD RELEASE OPERATE MONITOR
  8. What’s MLOps? ML + DevOps OPS DEV ML DESIGN ANALYZE

    EVALUATE CODE PLAN BUILD RELEASE OPERATE MONITOR
  9. What’s MLOps? ML + DevOps OPS DEV ML DESIGN ANALYZE

    EVALUATE CODE PLAN BUILD RELEASE OPERATE MONITOR
  10. What’s MLOps? ML + DevOps OPS DEV ML DESIGN ANALYZE

    EVALUATE CODE PLAN BUILD RELEASE OPERATE MONITOR
  11. What’s MLOps monitoring? What to monitor DevOps › Resource usage

    (CPU, Memory, Storage, …) › Disk I/O › Network Traffic › Heartbeats › Business KPIs › DevOps KPIs (MTTR, …) › etc…
  12. What’s MLOps monitoring? What to monitor DevOps MLOps › Data

    statistics › Input data changes (data drift) › Input - target pattern changes (concept drift) › Model performance › Prediction accuracy › Diversity of recommendations › Fairness + › Resource usage (CPU, Memory, Storage, …) › Disk I/O › Network Traffic › Heartbeats › Business KPIs › DevOps KPIs (MTTR, …) › etc…
  13. What’s MLOps monitoring? What to monitor DevOps MLOps › Data

    statistics › Input data changes (data drift) › Input - target pattern changes (concept drift) › Model performance › Prediction accuracy › Diversity of recommendations › Fairness + › Resource usage (CPU, Memory, Storage, …) › Disk I/O › Network Traffic › Heartbeats › Business KPIs › DevOps KPIs (MTTR, …) › etc…
  14. What’s MLOps monitoring? Other differences DevOps MLOps › Data statistics

    › Input data changes (data drift) › Input - target pattern changes (concept drift) › Model performance › Prediction accuracy › Diversity of recommendations › Fairness › Resource usage (CPU, Memory, Storage, …) › Disk I/O › Network Traffic › Heartbeats › Business KPIs › DevOps KPIs (MTTR, …) › etc…
  15. What’s MLOps monitoring? Other differences DevOps MLOps › Data statistics

    › Input data changes (data drift) › Input - target pattern changes (concept drift) › Model performance › Prediction accuracy › Diversity of recommendations › Fairness "VUPNBUJPO › Resource usage (CPU, Memory, Storage, …) › Disk I/O › Network Traffic › Heartbeats › Business KPIs › DevOps KPIs (MTTR, …) › etc…
  16. What’s MLOps monitoring? Other differences DevOps MLOps › Data statistics

    › Input data changes (data drift) › Input - target pattern changes (concept drift) › Model performance › Prediction accuracy › Diversity of recommendations › Fairness "VUPNBUJPO *OUFSWBM › Resource usage (CPU, Memory, Storage, …) › Disk I/O › Network Traffic › Heartbeats › Business KPIs › DevOps KPIs (MTTR, …) › etc…
  17. What’s MLOps monitoring? Other differences DevOps MLOps › Data statistics

    › Input data changes (data drift) › Input - target pattern changes (concept drift) › Model performance › Prediction accuracy › Diversity of recommendations › Fairness "VUPNBUJPO *OUFSWBM "MFSUMPHJD › Resource usage (CPU, Memory, Storage, …) › Disk I/O › Network Traffic › Heartbeats › Business KPIs › DevOps KPIs (MTTR, …) › etc…
  18. Agenda › What’s MLOps monitoring? › MLOps at ML Dept.

    › Our challenges in MLOps monitoring › Lupus: our monitoring infrastructure
  19. 27 53 80 107 133 160 No. of logics to

    select contents on SmartCH 19/5 7 9 11 20/1 3 5 7 9 11 21/1 3 5 7 9 MLOps at ML Dept. Scale Logic 1 Logic 2 … Logic n
  20. MLOps at ML Dept. Facilities › Kubernetes › IU (LINE’s

    Hadoop cluster) Infrastructure › Jutopia (LINE’s Jupyter server) Prototyping environment
  21. MLOps at ML Dept. Facilities › Kubernetes › IU (LINE’s

    Hadoop cluster) Infrastructure › Argo Workflows › Azkaban › Airflow Workflow engines › Jutopia (LINE’s Jupyter server) Prototyping environment
  22. MLOps at ML Dept. Facilities › Kubernetes › IU (LINE’s

    Hadoop cluster) Infrastructure › Argo Workflows › Azkaban › Airflow Workflow engines › ArgoCD › Drone CI CI / CD tools › Jutopia (LINE’s Jupyter server) Prototyping environment
  23. MLOps at ML Dept. Facilities › User sparse/dense features ›

    Item metadata features Shared feature vectors › Kubernetes › IU (LINE’s Hadoop cluster) Infrastructure › Argo Workflows › Azkaban › Airflow Workflow engines › ArgoCD › Drone CI CI / CD tools › Jutopia (LINE’s Jupyter server) Prototyping environment
  24. MLOps at ML Dept. Facilities › Distributed training & inference

    › Model collections › Recommendation automation › I/O manager › etc… Internal libraries › User sparse/dense features › Item metadata features Shared feature vectors › Kubernetes › IU (LINE’s Hadoop cluster) Infrastructure › Argo Workflows › Azkaban › Airflow Workflow engines › ArgoCD › Drone CI CI / CD tools › Jutopia (LINE’s Jupyter server) Prototyping environment
  25. MLOps at ML Dept. Facilities › Distributed training & inference

    › Model collections › Recommendation automation › I/O manager › etc… Internal libraries › User sparse/dense features › Item metadata features Shared feature vectors › A/B test manager › A/B test monitoring system › Recommendation demo generator Internal experiment manager › Kubernetes › IU (LINE’s Hadoop cluster) Infrastructure › Argo Workflows › Azkaban › Airflow Workflow engines › ArgoCD › Drone CI CI / CD tools › Jutopia (LINE’s Jupyter server) Prototyping environment
  26. OPS DEV ML DESIGN ANALYZE EVALUATE CODE PLAN BUILD RELEASE

    OPERATE MONITOR MLOps at ML Dept. Common pipeline 1SPUPUZQJOHUPPMT *OUFSOBMFYQFSJNFOUNBOBHFS 8PSLqPX&OHJOFT $*$%UPPMT *OUFSOBM-JCSBSJFT 4IBSFEGFBUVSFWFDUPST
  27. OPS DEV ML DESIGN ANALYZE EVALUATE CODE PLAN BUILD RELEASE

    OPERATE MONITOR MLOps at ML Dept. Common pipeline ? 1SPUPUZQJOHUPPMT *OUFSOBMFYQFSJNFOUNBOBHFS 8PSLqPX&OHJOFT $*$%UPPMT *OUFSOBM-JCSBSJFT 4IBSFEGFBUVSFWFDUPST
  28. Agenda › What’s MLOps monitoring? › MLOps at ML Dept.

    › Our challenges in MLOps monitoring › Lupus: our monitoring infrastructure
  29. Our challenges in MLOps monitoring Monitoring issues › As the

    number of ML products increases, the cost of monitoring has steadily grown. Increasing monitoring costs
  30. Our challenges in MLOps monitoring Monitoring issues Disjointed, project-dependent monitoring

    operations Increasing monitoring costs › Each project has different monitoring methods and alerts. › Sometimes cheap, sometimes poor. › As the number of ML products increases, the cost of monitoring has steadily grown.
  31. Our challenges in MLOps monitoring Monitoring issues Disjointed, project-dependent monitoring

    operations Outages due to lack of monitoring Increasing monitoring costs › Each project has different monitoring methods and alerts. › Sometimes cheap, sometimes poor. › As the number of ML products increases, the cost of monitoring has steadily grown. › There are many causes of outages (e.g. missing data, the changes of model outputs, etc.). › It is nearly impossible to manually monitor every product.
  32. Our challenges in MLOps monitoring Actual outage we experienced before

    Data Missing Model Update Manual Monitoring Cause: - Handcraft monitoring code on jupyter notebook Impact: - Cheap metrics - Poor alerting - Unreviewed code Cause: - Cluster outage - Delay Impact: - Low quality prediction - Empty prediction Cause: - Model architecture update - Smoothing Impact: - Significant drift in the prediction distribution - Found out 2 weeks later
  33. Our challenges in MLOps monitoring Actual outage we experienced before

    Data Missing Model Update Manual Monitoring Cause: - Handcraft monitoring code on jupyter notebook Impact: - Cheap metrics - Poor alerting - Unreviewed code Cause: - Cluster outage - Delay Impact: - Low quality prediction - Empty prediction Cause: - Model architecture update - Smoothing Impact: - Significant drift in the prediction distribution - Found out 2 weeks later
  34. Our challenges in MLOps monitoring Actual outage we experienced before

    Data Missing Model Update Manual Monitoring Cause: - Handcraft monitoring code on jupyter notebook Impact: - Cheap metrics - Poor alerting - Unreviewed code Cause: - Cluster outage - Delay Impact: - Low quality prediction - Empty prediction Cause: - Model architecture update - Smoothing Impact: - Significant drift in the prediction distribution - Found out 2 weeks later
  35. Our challenges in MLOps monitoring What we need Detection Collection

    Metrics aggregation tools Reliable metrics store
  36. Our challenges in MLOps monitoring What we need Detection Collection

    Metrics aggregation tools Reliable metrics store Flexible anomaly detector Alerting system
  37. Our challenges in MLOps monitoring What we need Detection Visualization

    Collection Metrics aggregation tools Reliable metrics store Flexible anomaly detector Alerting system
  38. Our challenges in MLOps monitoring What we need Detection Visualization

    Collection User-friendly GUI app Metrics aggregation tools Reliable metrics store Flexible anomaly detector Alerting system
  39. Agenda › What’s MLOps monitoring? › MLOps at ML Dept.

    › Our challenges in MLOps monitoring › Lupus: our monitoring infrastructure
  40. Lupus Concept for engineers Easy to collect for operators Easy

    to detect for project members Easy to visualize
  41. Lupus Components Lupus server : Metric management and anomaly detection

    APIs Lupus SPA : Web app for metrics and anomalies visualization Lupus library : Metrics aggregation tools and API client
  42. Case: Metrics collection Which kind of metrics should we monitor?

    Effective metrics depend on the task, data, model and so on… Data drift / Concept drift › Statistics of input data › Statistics of target variables Model degradation / replacement › Statistics of predictions › Ground-truth evaluation › Training / Validation metrics Lupus library helps to aggregate these metrics
  43. Case: Metrics collection Library support Avg@k Sum 95 percentile Region

    Age Device Rating Interests JP 23 iOS [5.0, 4.0] [a, b, c] JP 42 Mac [2.0] [e, g] JP 64 Android [4.5, 3.5] [x, y, a] US 27 iOS [4.0, 4.0] [v, t, v] US 38 Android [3.0] [y] … … … … … count per entity Unique entity count@k / Region Min
  44. Case: Metrics collection Library support import pyspark.sql.functions as F from

    pyspark.sql import Row stats = [] # age age_stats = df.groupby("region").agg(F.avg("age").alias("avg"), F.max("age").alias("max"), F.min("age").alias("min")) for row in age_stats.toLocalIterator(): stats.append({"col": "age", "region": row.region, "metric": "avg", "value": row["avg"]}) stats.append({"col": "age", "region": row.region, "metric": "max", "value": row["max"]}) stats.append({"col": "age", "region": row.region, "metric": "min", "value": row["min"]}) # device device_counts = df.groupby("region", "device").agg(F.count("device").alias("count")) device_unique = device_counts.groupby("region").agg(F.count("count").alias("unique")) for row in device_counts.toLocalIterator(): stats.append({"col": "device", "region": row.region, "metric": "count", "value": row["count"], "device": row.device}) for row in device_unique.toLocalIterator(): stats.append({"col": "device", "region": row.region, "metric": "count", "unique": row["unique"]}) # ratings def truncate(df, col, k): def _(row): dic = row.asDict() dic[col] = dic[col][:k] return Row(**dic) return df.rdd.map(_).toDF() ratings_stats_all = ( df.select("region", F.explode("ratings").alias("ratings")) .groupby("region").agg(F.avg("ratings").alias("avg"), F.max("ratings").alias("max"), F.min("ratings").alias("min")) ) for row in ratings_stats_all.toLocalIterator(): stats.append({"col": "ratings", "region": row.region, "metric": "avg", "value": row["avg"]}) stats.append({"col": "ratings", "region": row.region, "metric": "max", "value": row["max"]}) stats.append({"col": "ratings", "region": row.region, "metric": "min", "value": row["min"]}) ratings_stats_top5 = ( truncate(df, "ratings", 5) .select("region", F.explode("ratings").alias("ratings")) .groupby("region").agg(F.avg("ratings").alias("avg"), F.max("ratings").alias("max"), F.min("ratings").alias("min")) ) for row in ratings_stats_top5.toLocalIterator(): stats.append({"col": "ratings", "region": row.region, "metric": "avg@5", "value": row["avg"]}) stats.append({"col": "ratings", "region": row.region, "metric": "max@5", "value": row["max"]}) stats.append({"col": "ratings", "region": row.region, "metric": "min@5", "value": row["min"]}) interests_count_all = ( df.select("region", F.explode("interests").alias("interests")) .groupby("region", "interests").agg(F.count("interests").alias("count")) ) interests_unique_all = interests_count_all.groupby("region").agg(F.count("count").alias("unique")) for row in interests_count_all.toLocalIterator(): stats.append({"col": "interests", "region": row.region, "metric": "count", "value": row["count"], "device": row.interests}) for row in interests_unique_all.toLocalIterator(): stats.append({"col": "interests", "region": row.region, "metric": "count", "unique": row["unique"]}) interests_count_top5 = ( truncate(df, "interests", 5) .select("region", F.explode("interests").alias("interests")) .groupby("region", "interests").agg(F.count("interests").alias("count")) ) interests_unique_top5 = interests_count_top5.groupby("region").agg(F.count("count").alias("unique")) for row in interests_count_top5.toLocalIterator(): stats.append({"col": "interests", "region": row.region, "metric": "count", "value": row["count"], "device": row.interests}) for row in interests_unique_top5.toLocalIterator(): stats.append({"col": "interests", "region": row.region, "metric": "count", "unique": row["unique"]}
  45. import pyspark.sql.functions as F from pyspark.sql import Row stats =

    [] # age age_stats = df.groupby("region").agg(F.avg("age").alias("avg"), F.max("age").alias("max"), F.min("age").alias("min")) for row in age_stats.toLocalIterator(): stats.append({"col": "age", "region": row.region, "metric": "avg", "value": row["avg"]}) stats.append({"col": "age", "region": row.region, "metric": "max", "value": row["max"]}) stats.append({"col": "age", "region": row.region, "metric": "min", "value": row["min"]}) # device device_counts = df.groupby("region", "device").agg(F.count("device").alias("count")) device_unique = device_counts.groupby("region").agg(F.count("count").alias("unique")) for row in device_counts.toLocalIterator(): stats.append({"col": "device", "region": row.region, "metric": "count", "value": row["count"], "device": row.device}) for row in device_unique.toLocalIterator(): stats.append({"col": "device", "region": row.region, "metric": "count", "unique": row["unique"]}) # ratings def truncate(df, col, k): def _(row): dic = row.asDict() dic[col] = dic[col][:k] return Row(**dic) return df.rdd.map(_).toDF() ratings_stats_all = ( df.select("region", F.explode("ratings").alias("ratings")) .groupby("region").agg(F.avg("ratings").alias("avg"), F.max("ratings").alias("max"), F.min("ratings").alias("min")) ) for row in ratings_stats_all.toLocalIterator(): stats.append({"col": "ratings", "region": row.region, "metric": "avg", "value": row["avg"]}) stats.append({"col": "ratings", "region": row.region, "metric": "max", "value": row["max"]}) stats.append({"col": "ratings", "region": row.region, "metric": "min", "value": row["min"]}) ratings_stats_top5 = ( truncate(df, "ratings", 5) .select("region", F.explode("ratings").alias("ratings")) .groupby("region").agg(F.avg("ratings").alias("avg"), F.max("ratings").alias("max"), F.min("ratings").alias("min")) ) for row in ratings_stats_top5.toLocalIterator(): stats.append({"col": "ratings", "region": row.region, "metric": "avg@5", "value": row["avg"]}) stats.append({"col": "ratings", "region": row.region, "metric": "max@5", "value": row["max"]}) stats.append({"col": "ratings", "region": row.region, "metric": "min@5", "value": row["min"]}) interests_count_all = ( df.select("region", F.explode("interests").alias("interests")) .groupby("region", "interests").agg(F.count("interests").alias("count")) ) interests_unique_all = interests_count_all.groupby("region").agg(F.count("count").alias("unique")) for row in interests_count_all.toLocalIterator(): stats.append({"col": "interests", "region": row.region, "metric": "count", "value": row["count"], "device": row.interests}) for row in interests_unique_all.toLocalIterator(): stats.append({"col": "interests", "region": row.region, "metric": "count", "unique": row["unique"]}) interests_count_top5 = ( truncate(df, "interests", 5) .select("region", F.explode("interests").alias("interests")) .groupby("region", "interests").agg(F.count("interests").alias("count")) ) interests_unique_top5 = interests_count_top5.groupby("region").agg(F.count("count").alias("unique")) for row in interests_count_top5.toLocalIterator(): stats.append({"col": "interests", "region": row.region, "metric": "count", "value": row["count"], "device": row.interests}) for row in interests_unique_top5.toLocalIterator(): stats.append({"col": "interests", "region": row.region, "metric": "count", "unique": row["unique"]} from lupus.processor.spark import \ DistributionProcessor processor = DistributionProcessor( df, group_columns=[“region”], column_metrics={ “age”: [“avg”, “p25”, “p50”, “p75”], “device”: [“count”, “unique”], “ratings”: [“avg”, “avg@5”, “min”, “max”], “interests”: [“count”, “unique”, “unique@3”], }, ) metrics = processor.get_metrics() Case: Metrics collection Library support
  46. Case: Metrics collection Library support pred gt A A B

    C C C A B B b … … Label count F1-score Recall Accuracy
  47. Case: Metrics collection Library support pred gt [A, C, B]

    [A, B] [A, C, B] [A] [C, D, E] [C, D] [A, D, C] [C] [D, E, B] [A, D] … … Unique nDCG@k Recall Entropy@k
  48. Case: Anomaly detection Which kind of alert do we need?

    Anomalies in the context of MLOps have more complex conditions than DevOps. Basic rules › If a metric exceeds the threshold › If a metric deviates significantly form the average of recent days Complex rules › If a metric deviates significantly from periodical change. › If the trend of a metric changes.
  49. Case: Anomaly detection Available anomaly detection methods Thresholding Time-series prediction

    by Prophet Window-based Rules Twitter’s AnomalyDetection package
  50. Case: Visualization Features and motivation › We have simple but

    specific use cases. Major OSS do not fit our needs despite their complexity. › Lupus has niche requirements like showing anomalies and narrow down by metric groups. › LINE takes user privacy seriously and Lupus has strict and complicated authentication requirements. Why self-made? Web UI for metrics visualization › Metrics charts with anomaly information. › An explorer to easily discover a desired chart. › User customizable dashboards for daily observations.
  51. Impacts › It became much easier to collect daily metrics

    than before. Easy monitoring › Lupus helps finding outages by detect obstacles that we hadn’t noticed before. Avoiding outages Discover insights › We could move from self-made notebook to reliable codebase with reviews. Reliable monitoring code › We can access collected metrics very fast with Lupus WebUI. › Also, we can easily share them to project members. Fast access, shareable UI › We could find changes in the accuracy of our products that we hadn’t known. › Got motivated to improve the products.
  52. Impacts › It became much easier to collect daily metrics

    than before. Easy monitoring › Lupus helps finding outages by detect obstacles that we hadn’t noticed before. Avoiding outages Discover insights › We could move from self-made notebook to reliable codebase with reviews. Reliable monitoring code › We can access collected metrics very fast with Lupus WebUI. › Also, we can easily share them to project members. Fast access, shareable UI › We could find changes in the accuracy of our products that we hadn’t known. › Got motivated to improve the products.
  53. Impacts › It became much easier to collect daily metrics

    than before. Easy monitoring › Lupus helps finding outages by detect obstacles that we hadn’t noticed before. Avoiding outages Discover insights › We could move from self-made notebook to reliable codebase with reviews. Reliable monitoring code › We can access collected metrics very fast with Lupus WebUI. › Also, we can easily share them to project members. Fast access, shareable UI › We could find changes in the accuracy of our products that we hadn’t known. › Got motivated to improve the products.
  54. Impacts › It became much easier to collect daily metrics

    than before. Easy monitoring › Lupus helps finding outages by detect obstacles that we hadn’t noticed before. › We could find changes in the accuracy of our products that we hadn’t known. › Got motivated to improve the products. Avoiding outages Discover insights › We could move from self-made notebook to reliable codebase with reviews. Reliable monitoring code › We can access collected metrics very fast with Lupus WebUI. › Also, we can easily share them to project members. Fast access, shareable UI
  55. Impacts › It became much easier to collect daily metrics

    than before. Easy monitoring › Lupus helps finding outages by detect obstacles that we hadn’t noticed before. Avoiding outages Discover insights › We could move from self-made notebook to reliable codebase with reviews. Reliable monitoring code › We can access collected metrics very fast with Lupus WebUI. › Also, we can easily share them to project members. Fast access, shareable UI › We could find changes in the accuracy of our products that we hadn’t known. › Got motivated to improve the products.
  56. Impacts › It became much easier to collect daily metrics

    than before. Easy monitoring › Lupus helps finding outages by detect obstacles that we hadn’t noticed before. Avoiding outages Discover insights › We could move from self-made notebook to reliable codebase with reviews. Reliable monitoring code › We can access collected metrics very fast with Lupus WebUI. › Also, we can easily share them to project members. Fast access, shareable UI › We could find changes in the accuracy of our products that we hadn’t known. › Got motivated to improve the products.
  57. Summary › With previous efforts, ML Dept. can now release

    ML products in a short development time. › Along with this, the cost of monitoring has been getting bigger and bigger. Our challenges in MLOps monitoring Our solution › We have developed an original monitoring system for MLOps, called Lupus › Lupus provides 3 components to help us collect, alert and visualize metrics in an efficient manner. Monitoring on MLOps › MLOps requires additional monitoring metrics related to data and ML models.
  58. Reference Introducing MLOps (O'Reilly Media, Inc.) › Mark Treveil, Nicolas

    Omont, Clément Stenac, Kenji Lefevre, Du Phan, Joachim Zentici, Adrien Lavoillotte, Makoto Miyazaki and Lynn Heidmann Practical MLOps (O'Reilly Media, Inc.) › Noah Gift and Alfredo Deza MLOps: Continuous delivery and automation pipelines in machine learning (Google Could) › https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in- machine-learning Evidently AI blog: machine learning monitoring series (Evidently AI) › https://evidentlyai.com/blog#!/tfeeds/393523502011/c/machine%20learning%20monitoring%20series