Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building LINE Pay Monitoring System and Anomaly...

Building LINE Pay Monitoring System and Anomaly Log Detection System Using ML

Sun Uk Terry Kim
LINE Biz Plus / Fraud Detection System / Server Engineer
Giseung Lee
LINE Biz Plus / Pay DevOps / Devops Engineer

https://linedevday.linecorp.com/2021/ja/sessions/94
https://linedevday.linecorp.com/2021/en/sessions/94
https://linedevday.linecorp.com/2021/ko/sessions/94

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Agenda - Part 1: Building LINE Pay Monitoring System -

    Problem of Existing Monitoring System - Architecture of New Monitoring System - Next Steps - Part 2: Abnormal Log Detection with ML - Finding Needs - Hypothesis & Model Selection - Model Architecture and Challenges - Result Discussion
  2. Various Monitoring Tools Logging Infra Resource - KR/TH Infra Resource

    - Japan Business Statistics LINE Pay LINE Pay Member Various monitoring tools make some problems
  3. Hard to grasp system LINE Pay member have to visit

    many system LINE Pay Member Check CPU usage.. Check log message or count.. Check business stats.. Logging Infra Resource - KR/TH Infra Resource - Japan Business Statistics LINE Pay Check Storage usage..
  4. Hard to grasp system Monitoring tools which were used before

    System Information System Resource - Korea System Resource - Japan Application Resource Log Monitoring Business Statistics
  5. Hard to control alarm Each system have their own alert

    system Alert system Alert system Alert system Alert system LINE Pay Member Team01 Team02 Team03 Logging Infra Resource - KR/TH Infra Resource - Japan Business Statistics LINE Pay Correct Not Sent Wrong Target
  6. Hard to control alarm LINE Pay member get alarm from

    services which they were in charge before Service in charge of 2019 Service in charge of 2020 Service in charge of 2021 Still getting alarm from service not in charge now :(
  7. Hard to control alarm Different grouping method LINE Pay Member

    Team01 Team02 LINE Pay Member Team01 Team02 Monitoring System 01 Monitoring System 02 “I don’t have group” “I have my own group”
  8. Hard to handle requirement of LINE Pay Common infra system

    cannot be changed just for LINE Pay Business Statistics LINE Pay Logging Infra Resource - KR/TH Infra Resource - Japan Common Tools in LINE Specific requirements of LINE Pay For example, we want alert below judging from past incident. - when api call count rapidly decrease - when specific result type of api increase - when cpu usage rapidly increase Hard to handle requirement quickly
  9. Hard to handle requirement of LINE Pay Need for detecting

    weird server in one server group 0 3.5 7 10.5 18:00 18:01 18:02 18:03 18:04 18:05 server01 server02 server03 server04 server05 server06 server07 server08
  10. Hard to handle requirement of LINE Pay Need for detecting

    rapid change of API count 0 4 7 11 14 18 21 25 28 32 35 39 42 46 49 53 56 60 63 67 70 74 77 81 84 88 91 95 98 102 105 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 RPS
  11. Hard to handle requirement of LINE Pay There is a

    metrics we want, but there is no way to handle it Common Tools in LINE Metrics Cannot handle it the way we want !
  12. Structure of New Monitoring System Prometheus Grafana Alertmanager Alarm Tool

    SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender
  13. Structure of New Monitoring System Prometheus Grafana Alertmanager Alarm Tool

    SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Collecting Metrics
  14. Structure of New Monitoring System Prometheus Grafana Alertmanager Alarm Tool

    SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway SystemExporter System Application Exporter Log Appender Reprocessing Metrics and Evaluating Alert
  15. Structure of New Monitoring System Prometheus Grafana Alertmanager Alarm Tool

    SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Managing Alarm
  16. Structure of New Monitoring System Prometheus Grafana Alertmanager Alarm Tool

    SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Visualization
  17. Collecting Metrics Prometheus Grafana Alertmanager Alarm Tool SMS Email Phone

    Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Application Resource - GC count, time - JVM Heap usage - Thread status - Business metrics System Resource - Server information - CPU usage - Memory usage - Storage usage
  18. Collecting Metrics Prometheus Grafana Alertmanager Alarm Tool SMS Email Phone

    Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Log - Message - API duration - Log level - Result Type
  19. Reprocessing Metrics and Making Alert Prometheus Grafana Alertmanager Alarm Tool

    SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender - Steam processing for various purpose - Make more meaningful information Reprocess Raw Metric
  20. Reprocessing Metrics and Making Alert More meaningful information Raw metrics

    Flink Calculate Statistics Job1 Detect Abnormal Job2
  21. Reprocessing Metrics and Making Alert Prometheus Grafana Alertmanager Alarm Tool

    SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender - For component that are hard to apply the pull based collecting Alert Evaluation - Create flexible alert rule using PromQL Temporal Repository
  22. Managing Alarm Prometheus Grafana Alertmanager Alarm Tool SMS Email Phone

    Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Managing Alert - Group alert by label - Determine where to send alert Sending Alarms - Easy to control alarm
  23. Managing Alarm Metric sources Alert Evaluation Service Team Members Servers

    and Applications Servers and Applications Monitoring System01 LINE Pay Monitoring System Service01 Service02 Service03 Custom Groups Team01 Team02 Team03 Monitoring System02 Alertmanager Alarm Tool Alarm Tool
  24. Visualization Prometheus Grafana Alertmanager Alarm Tool SMS Email Phone Call

    LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Visualizing Metric - Application Overview - API Overview - Business statistics
  25. Easy to grasp system The existing system which is hard

    to grasp system LINE Pay Member Check CPU usage.. Check business stats.. Check log message or count.. Business Statistics Infra Resource - KR/TH Infra Resource - Japan Logging LINE Pay Check Storage usage..
  26. LINE Pay Monitoring System Easy to grasp system Integrating existing

    system Logging LINE Pay Check log message Check cpu usage Check jvm usage Check api count Check business stats LINE Pay Member
  27. Easy to control alarm The existing system which is hard

    to control alarm Alert system Alert system Alert system Alert system LINE Pay Member Team01 Team02 Team03 Logging Infra Resource - KR/TH Infra Resource - Japan Business Statistics LINE Pay Correct Not Sent Wrong Target
  28. Easy to control alarm Control alarm at one system Team01

    Team02 Team03 Oncall Oncall Lead Service01 Service03 Service02 Alarm Tool Alertmanager Prometheus External System Connector LINE Pay Members The place to control alarm Oncall Escalation
  29. Business Statistics LINE Pay Logging Infra Resource - KR/TH Infra

    Resource - Japan Common Tools in LINE Specific requirements of LINE Pay For example, we want alert below judging from past incident. - when api call count rapidly decrease - when specific result type of api increase - when cpu usage rapidly increase Hard to handle requirement quickly Easy to handle requirement of LINE Pay Most of existing monitoring tools are common infrastructure of LINE
  30. LINE Pay For example, we want alert below judging from

    past incident. - when api call count rapidly decrease - when specific result type of api increase - when cpu usage rapidly increase Easy to handle requirement quickly Easy to handle requirement of LINE Pay Now, LINE Pay can handle metric LINE Pay Monitoring System Server resource metrics Application resource metrics API statistic metrics Business statistic metrics Specific requirements of LINE Pay
  31. Easy to handle requirement of LINE Pay Need for detecting

    weird server in same server group 0 3.5 7 10.5 18:00 18:01 18:02 18:03 18:04 18:05 server01 server02 server03 server04 server05 server06 server07 server08
  32. Easy to handle requirement of LINE Pay The way to

    detect weird server in same server group Server01 Server02 Server03 Server04 CPU Utilization of Server01 The average of CPU Utilization except Server01 Alert Alert Rule - compare metrics
  33. Easy to handle requirement of LINE Pay Need for detecting

    rapid decreasing of API 0 4 7 11 14 18 21 25 28 32 35 39 42 46 49 53 56 60 63 67 70 74 77 81 84 88 91 95 98 102 105 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 RPS
  34. Easy to handle requirement of LINE Pay PromQL provides many

    useful functions - record: request:rps:1m expr: {PromQL for request per seconds} - record: request:rps:1m:avg:10m expr: avg_over_time(request:rps:1m[10m]) - record: request:rps:1m:stddev:10m expr: stddev_over_time(request:rps:1m[10m]) Acceptable range : request:rps:1m:avg:10m ± 3*request:rps:1m:stddev:10m
  35. Easy to handle requirement of LINE Pay Permissible range using

    standard deviation Average + 3*Standard Deviation Average - 3*Standard Deviation Current RPS
  36. Easy to handle requirement of LINE Pay Grafana Alertmanager Pagerduty

    SMS Email Phone Call LINE Pay Members Prometheus Node Exporter Server Application App metric Log Appender Log Collector Flink PushGateway Each node can integrate external system External System ex) logging, Slack, Jira.. Many connections which external system can be integrated
  37. Easy to handle requirement of LINE Pay Integration of alert

    from log monitoring system Log Monitoring System - Evaluate Alert Alarm Tool Connector - Detect log pattern - The number of matched log - Log body - Log time - Application name : - Alert title - Serverity - Summary : Callback API Trigger Alert
  38. What’s next? Next things to do for improvement Different Server

    Usage Pattern Different Pattern of API Request Different Scale of Server Usage
  39. What’s next? Next things to do for improvement Different Server

    Usage Pattern Different Pattern of API Request Different Scale of Server Usage
  40. Different Server Usage Pattern Alert Rules General Servers Admin, Batch

    Server Servers need to be more categorized Alert Rules need to be more concrete Categorize server and make alert rule concrete
  41. Different Scale of Server Usage Next things to do for

    improvement Different Server Usage Pattern Different Pattern of API Request Different Scale of Server Usage
  42. Different Scale of Server Usage Server01 Server02 Server03 Server04 CPU

    Utilization of Server01 The average of CPU Utilization except Server01 Alert {One server} > n * {The average of other servers}
  43. Different Scale of Server Usage n = 2 Average of

    CPU Utilization : 25% Threshold: 50% Catch Slow API Average of CPU Utilization : 5% Threshold: 10% Make False Positive Alert
  44. Different Pattern of API Request Next things to do for

    improvement Different Server Usage Pattern Different Pattern of API Request Different Scale of Server Usage
  45. Different Pattern of API Request Frequently Requested API Rarely or

    Regularly Requested API Average, Standard Deviation Not work..
  46. What’s next? Next things to do for improvement Different Server

    Usage Pattern Different Pattern of API Request Different Scale of Server Usage
  47. What’s next? Next things to do for improvement Reducing False

    Positive Alert & Making Reliable Monitoring System
  48. Where Does This Need Came From? - Service Size ↑

    - False Positive ↑ - Fatigue of Engineer ↑ - True Positive Awareness ↓ - Outage ↑ LINE Pay Log Summary API ~400 Return Code ~200 # of Error ~10M
  49. Where Does This Need Came From? LINE Pay Log Summary

    API ~400 Return Code ~200 # of Error ~10M -False Positive ↓ -Fatigue of Engineer ↓ -True Positive Awareness ↑ -Outage ↓ -Goal: Log Forecast
  50. Hypothesis Formulation Hypothesis 1 Sudden Increase of Particular Requests Hypothesis

    2 Tangled Sequence of Logs Outage Report Category Rate Request Log 45% Log Message 35% System Stats 5% Others 15% Request Log Log Message
  51. Model Selection -Unsupervised Learning -Robust to Retraining -Fast Implementation -Easy-to-Find

    Root Cause -Simple Approach Gaussian Mixture Model LSTM with Workflow Hypothesis 1 Hypothesis 2
  52. Solution #2: Fill in the blank 0 20 40 60

    80 100 120 140 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 0 20 40 60 80 100 120 140 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 90% Elimination of Batch Anomaly
  53. Challenge #3 Minorities Gaussian Mixture Model Expected Behavior Light 1

    à 100 100% Heavy 100 à 10,000 100% Actual Behavior Light 1 à 100 100% Heavy 1 à 10,000 10,000%
  54. Solution #3 Minorities Gaussian Mixture Model 3-1. Apply Penalty 3-2.

    Adding Features !" = $ − # '( )*+" ,-.()*+" ) (" = (" × !" ~5 min ~10 min ~30min ~5 min ~10 min ~30min ~5 min ~10 min ~30min + weekly monthly
  55. GMM Result Summary Gaussian Mixture Model -10M Err à 500⇩

    -~30 types of Apis -Post Processing: 500 à 150⇩ -15% of Outage ⊆ 150
  56. Result Discussion - Case I: Token Error - Case II:

    Service Maintenance - Case III: Macro Retries from Particular User - Case IV: Account Lock - Case V: Memory Errors - Case VI: Connection Failed - Case VII: New Message