$30 off During Our Annual Pro Sale. View Details »

Building LINE Pay Monitoring System and Anomaly Log Detection System Using ML

Building LINE Pay Monitoring System and Anomaly Log Detection System Using ML

Sun Uk Terry Kim
LINE Biz Plus / Fraud Detection System / Server Engineer
Giseung Lee
LINE Biz Plus / Pay DevOps / Devops Engineer

https://linedevday.linecorp.com/2021/ja/sessions/94
https://linedevday.linecorp.com/2021/en/sessions/94
https://linedevday.linecorp.com/2021/ko/sessions/94

LINE DEVDAY 2021
PRO

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. None
  2. Agenda - Part 1: Building LINE Pay Monitoring System -

    Problem of Existing Monitoring System - Architecture of New Monitoring System - Next Steps - Part 2: Abnormal Log Detection with ML - Finding Needs - Hypothesis & Model Selection - Model Architecture and Challenges - Result Discussion
  3. Build LINE Pay Monitoring System Gi Seung Lee Pay DevOps

  4. Existing Monitoring System

  5. Various Monitoring Tools Logging Infra Resource - KR/TH Infra Resource

    - Japan Business Statistics LINE Pay LINE Pay Member Various monitoring tools make some problems
  6. Hard to grasp system LINE Pay member have to visit

    many system LINE Pay Member Check CPU usage.. Check log message or count.. Check business stats.. Logging Infra Resource - KR/TH Infra Resource - Japan Business Statistics LINE Pay Check Storage usage..
  7. Hard to grasp system Monitoring tools which were used before

    System Information System Resource - Korea System Resource - Japan Application Resource Log Monitoring Business Statistics
  8. Hard to control alarm Each system have their own alert

    system Alert system Alert system Alert system Alert system LINE Pay Member Team01 Team02 Team03 Logging Infra Resource - KR/TH Infra Resource - Japan Business Statistics LINE Pay Correct Not Sent Wrong Target
  9. Hard to control alarm LINE Pay member get alarm from

    services which they were in charge before Service in charge of 2019 Service in charge of 2020 Service in charge of 2021 Still getting alarm from service not in charge now :(
  10. Hard to control alarm Different grouping method LINE Pay Member

    Team01 Team02 LINE Pay Member Team01 Team02 Monitoring System 01 Monitoring System 02 “I don’t have group” “I have my own group”
  11. Hard to handle requirement of LINE Pay Common infra system

    cannot be changed just for LINE Pay Business Statistics LINE Pay Logging Infra Resource - KR/TH Infra Resource - Japan Common Tools in LINE Specific requirements of LINE Pay For example, we want alert below judging from past incident. - when api call count rapidly decrease - when specific result type of api increase - when cpu usage rapidly increase Hard to handle requirement quickly
  12. Hard to handle requirement of LINE Pay Need for detecting

    weird server in one server group 0 3.5 7 10.5 18:00 18:01 18:02 18:03 18:04 18:05 server01 server02 server03 server04 server05 server06 server07 server08
  13. Hard to handle requirement of LINE Pay Need for detecting

    rapid change of API count 0 4 7 11 14 18 21 25 28 32 35 39 42 46 49 53 56 60 63 67 70 74 77 81 84 88 91 95 98 102 105 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 RPS
  14. Hard to handle requirement of LINE Pay There is a

    metrics we want, but there is no way to handle it Common Tools in LINE Metrics Cannot handle it the way we want !
  15. Structure of New Monitoring System

  16. Structure of New Monitoring System Prometheus Grafana Alertmanager Alarm Tool

    SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender
  17. Structure of New Monitoring System Prometheus Grafana Alertmanager Alarm Tool

    SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Collecting Metrics
  18. Structure of New Monitoring System Prometheus Grafana Alertmanager Alarm Tool

    SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway SystemExporter System Application Exporter Log Appender Reprocessing Metrics and Evaluating Alert
  19. Structure of New Monitoring System Prometheus Grafana Alertmanager Alarm Tool

    SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Managing Alarm
  20. Structure of New Monitoring System Prometheus Grafana Alertmanager Alarm Tool

    SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Visualization
  21. Collecting Metrics Prometheus Grafana Alertmanager Alarm Tool SMS Email Phone

    Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Application Resource - GC count, time - JVM Heap usage - Thread status - Business metrics System Resource - Server information - CPU usage - Memory usage - Storage usage
  22. Collecting Metrics Prometheus Grafana Alertmanager Alarm Tool SMS Email Phone

    Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Log - Message - API duration - Log level - Result Type
  23. Reprocessing Metrics and Making Alert Prometheus Grafana Alertmanager Alarm Tool

    SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender - Steam processing for various purpose - Make more meaningful information Reprocess Raw Metric
  24. Reprocessing Metrics and Making Alert More meaningful information Raw metrics

    Flink Calculate Statistics Job1 Detect Abnormal Job2
  25. Reprocessing Metrics and Making Alert Prometheus Grafana Alertmanager Alarm Tool

    SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender - For component that are hard to apply the pull based collecting Alert Evaluation - Create flexible alert rule using PromQL Temporal Repository
  26. Managing Alarm Prometheus Grafana Alertmanager Alarm Tool SMS Email Phone

    Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Managing Alert - Group alert by label - Determine where to send alert Sending Alarms - Easy to control alarm
  27. Managing Alarm Metric sources Alert Evaluation Service Team Members Servers

    and Applications Servers and Applications Monitoring System01 LINE Pay Monitoring System Service01 Service02 Service03 Custom Groups Team01 Team02 Team03 Monitoring System02 Alertmanager Alarm Tool Alarm Tool
  28. Visualization Prometheus Grafana Alertmanager Alarm Tool SMS Email Phone Call

    LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Visualizing Metric - Application Overview - API Overview - Business statistics
  29. Visualization Application Overview API Statistics Business Statistics + Any dashboard

    LINE Pay want
  30. What changed?

  31. Easy to grasp system The existing system which is hard

    to grasp system LINE Pay Member Check CPU usage.. Check business stats.. Check log message or count.. Business Statistics Infra Resource - KR/TH Infra Resource - Japan Logging LINE Pay Check Storage usage..
  32. LINE Pay Monitoring System Easy to grasp system Integrating existing

    system Logging LINE Pay Check log message Check cpu usage Check jvm usage Check api count Check business stats LINE Pay Member
  33. Easy to control alarm The existing system which is hard

    to control alarm Alert system Alert system Alert system Alert system LINE Pay Member Team01 Team02 Team03 Logging Infra Resource - KR/TH Infra Resource - Japan Business Statistics LINE Pay Correct Not Sent Wrong Target
  34. Easy to control alarm Control alarm at one system Team01

    Team02 Team03 Oncall Oncall Lead Service01 Service03 Service02 Alarm Tool Alertmanager Prometheus External System Connector LINE Pay Members The place to control alarm Oncall Escalation
  35. Business Statistics LINE Pay Logging Infra Resource - KR/TH Infra

    Resource - Japan Common Tools in LINE Specific requirements of LINE Pay For example, we want alert below judging from past incident. - when api call count rapidly decrease - when specific result type of api increase - when cpu usage rapidly increase Hard to handle requirement quickly Easy to handle requirement of LINE Pay Most of existing monitoring tools are common infrastructure of LINE
  36. LINE Pay For example, we want alert below judging from

    past incident. - when api call count rapidly decrease - when specific result type of api increase - when cpu usage rapidly increase Easy to handle requirement quickly Easy to handle requirement of LINE Pay Now, LINE Pay can handle metric LINE Pay Monitoring System Server resource metrics Application resource metrics API statistic metrics Business statistic metrics Specific requirements of LINE Pay
  37. Easy to handle requirement of LINE Pay Need for detecting

    weird server in same server group 0 3.5 7 10.5 18:00 18:01 18:02 18:03 18:04 18:05 server01 server02 server03 server04 server05 server06 server07 server08
  38. Easy to handle requirement of LINE Pay The way to

    detect weird server in same server group Server01 Server02 Server03 Server04 CPU Utilization of Server01 The average of CPU Utilization except Server01 Alert Alert Rule - compare metrics
  39. Easy to handle requirement of LINE Pay LINE Pay can

    detect situation like this
  40. Easy to handle requirement of LINE Pay Need for detecting

    rapid decreasing of API 0 4 7 11 14 18 21 25 28 32 35 39 42 46 49 53 56 60 63 67 70 74 77 81 84 88 91 95 98 102 105 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 RPS
  41. Easy to handle requirement of LINE Pay PromQL provides many

    useful functions - record: request:rps:1m expr: {PromQL for request per seconds} - record: request:rps:1m:avg:10m expr: avg_over_time(request:rps:1m[10m]) - record: request:rps:1m:stddev:10m expr: stddev_over_time(request:rps:1m[10m]) Acceptable range : request:rps:1m:avg:10m ± 3*request:rps:1m:stddev:10m
  42. Easy to handle requirement of LINE Pay Permissible range using

    standard deviation Average + 3*Standard Deviation Average - 3*Standard Deviation Current RPS
  43. Easy to handle requirement of LINE Pay Grafana Alertmanager Pagerduty

    SMS Email Phone Call LINE Pay Members Prometheus Node Exporter Server Application App metric Log Appender Log Collector Flink PushGateway Each node can integrate external system External System ex) logging, Slack, Jira.. Many connections which external system can be integrated
  44. Easy to handle requirement of LINE Pay Integration of alert

    from log monitoring system Log Monitoring System - Evaluate Alert Alarm Tool Connector - Detect log pattern - The number of matched log - Log body - Log time - Application name : - Alert title - Serverity - Summary : Callback API Trigger Alert
  45. What’s next?

  46. What’s next? Next things to do for improvement Different Server

    Usage Pattern Different Pattern of API Request Different Scale of Server Usage
  47. What’s next? Next things to do for improvement Different Server

    Usage Pattern Different Pattern of API Request Different Scale of Server Usage
  48. Different Server Usage Pattern General server which is used by

    users
  49. Different Server Usage Pattern Batch server which is triggered regularly

  50. Different Server Usage Pattern Admin server which is triggered irregularly

    Irregular Heavy job
  51. Different Server Usage Pattern Alert Rules General Servers Admin, Batch

    Server Servers need to be more categorized Alert Rules need to be more concrete Categorize server and make alert rule concrete
  52. Different Scale of Server Usage Next things to do for

    improvement Different Server Usage Pattern Different Pattern of API Request Different Scale of Server Usage
  53. Different Scale of Server Usage Server01 Server02 Server03 Server04 CPU

    Utilization of Server01 The average of CPU Utilization except Server01 Alert {One server} > n * {The average of other servers}
  54. Different Scale of Server Usage n = 2 Average of

    CPU Utilization : 25% Threshold: 50% Catch Slow API Average of CPU Utilization : 5% Threshold: 10% Make False Positive Alert
  55. Different Pattern of API Request Next things to do for

    improvement Different Server Usage Pattern Different Pattern of API Request Different Scale of Server Usage
  56. Different Pattern of API Request Frequently Requested API Rarely or

    Regularly Requested API Average, Standard Deviation Not work..
  57. Different Pattern of API Request Ideal API that we expected

  58. Different Pattern of API Request Rarely requested API

  59. Different Pattern of API Request Regularly requested API

  60. What’s next? Next things to do for improvement Different Server

    Usage Pattern Different Pattern of API Request Different Scale of Server Usage
  61. What’s next? Next things to do for improvement Reducing False

    Positive Alert & Making Reliable Monitoring System
  62. Next, there is more interesting things!

  63. LINE Pay Anomaly Log Detection Sun Uk Kim FDS (Fraud

    Detection System)
  64. Where Does This Need Came From? - Service Size ↑

    - False Positive ↑ - Fatigue of Engineer ↑ - True Positive Awareness ↓ - Outage ↑ LINE Pay Log Summary API ~400 Return Code ~200 # of Error ~10M
  65. Where Does This Need Came From? LINE Pay Log Summary

    API ~400 Return Code ~200 # of Error ~10M -False Positive ↓ -Fatigue of Engineer ↓ -True Positive Awareness ↑ -Outage ↓ -Goal: Log Forecast
  66. Hypothesis Formulation Hypothesis 1 Sudden Increase of Particular Requests Hypothesis

    2 Tangled Sequence of Logs Outage Report Category Rate Request Log 45% Log Message 35% System Stats 5% Others 15% Request Log Log Message
  67. Model Selection -Unsupervised Learning -Robust to Retraining -Fast Implementation -Easy-to-Find

    Root Cause -Simple Approach Gaussian Mixture Model LSTM with Workflow Hypothesis 1 Hypothesis 2
  68. Gaussian Mixture Model (GMM) Abnormal Reference: https://www.askpython.com/python/normal-distribution

  69. Gaussian Mixture Model (GMM) Reference: https://github.com/rickiepark/handson-ml2

  70. GMM Feature #1

  71. GMM Feature #1

  72. Challenge #1 Gaussian Mixture Model -Expensive Processing -Similar But Different

    -Unbalanced Data -Vulnerable to Retraining
  73. Solution #1: Change Features

  74. Solution #1 Result Gaussian Mixture Model

  75. Challenge #2: Batch

  76. Solution #2: Fill in the blank 0 20 40 60

    80 100 120 140 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 0 20 40 60 80 100 120 140 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 90% Elimination of Batch Anomaly
  77. Challenge #3 Minorities Gaussian Mixture Model Expected Behavior Light 1

    à 100 100% Heavy 100 à 10,000 100% Actual Behavior Light 1 à 100 100% Heavy 1 à 10,000 10,000%
  78. Solution #3 Minorities Gaussian Mixture Model 3-1. Apply Penalty 3-2.

    Adding Features !" = $ − # '( )*+" ,-.()*+" ) (" = (" × !" ~5 min ~10 min ~30min ~5 min ~10 min ~30min ~5 min ~10 min ~30min + weekly monthly
  79. GMM Result Summary Gaussian Mixture Model -10M Err à 500⇩

    -~30 types of Apis -Post Processing: 500 à 150⇩ -15% of Outage ⊆ 150
  80. LSTM (Long Short-Term Memory)

  81. LSTM (Long Short-Term Memory)

  82. Architecture Gaussian Mixture Model Training

  83. Architecture Gaussian Mixture Model Training Detection

  84. Challenge Concurrency

  85. Solution : Concurrency vs. Multi-Task Concurrency Multi-Task Workflow Library

  86. Solution: Concurrency vs. Multi-Task

  87. Result Discussion - Case I: Token Error - Case II:

    Service Maintenance - Case III: Macro Retries from Particular User - Case IV: Account Lock - Case V: Memory Errors - Case VI: Connection Failed - Case VII: New Message
  88. Thank you