Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building LINE Pay Monitoring System and Anomaly Log Detection System Using ML

Building LINE Pay Monitoring System and Anomaly Log Detection System Using ML

Sun Uk Terry Kim
LINE Biz Plus / Fraud Detection System / Server Engineer
Giseung Lee
LINE Biz Plus / Pay DevOps / Devops Engineer

https://linedevday.linecorp.com/2021/ja/sessions/94
https://linedevday.linecorp.com/2021/en/sessions/94
https://linedevday.linecorp.com/2021/ko/sessions/94

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Agenda
    - Part 1: Building LINE Pay Monitoring System
    - Problem of Existing Monitoring System
    - Architecture of New Monitoring System
    - Next Steps
    - Part 2: Abnormal Log Detection with ML
    - Finding Needs
    - Hypothesis & Model Selection
    - Model Architecture and Challenges
    - Result Discussion

    View full-size slide

  2. Build LINE Pay
    Monitoring System
    Gi Seung Lee
    Pay DevOps

    View full-size slide

  3. Existing Monitoring System

    View full-size slide

  4. Various Monitoring Tools
    Logging
    Infra Resource
    - KR/TH
    Infra Resource
    - Japan
    Business
    Statistics
    LINE Pay
    LINE Pay Member
    Various monitoring tools make some problems

    View full-size slide

  5. Hard to grasp system
    LINE Pay member have to visit many system
    LINE Pay Member
    Check CPU usage..
    Check log message or count..
    Check business stats..
    Logging
    Infra Resource
    - KR/TH
    Infra Resource
    - Japan
    Business
    Statistics
    LINE Pay
    Check Storage usage..

    View full-size slide

  6. Hard to grasp system
    Monitoring tools which were used before
    System Information System Resource - Korea System Resource - Japan
    Application Resource Log Monitoring
    Business Statistics

    View full-size slide

  7. Hard to control alarm
    Each system have their own alert system
    Alert system
    Alert system Alert system Alert system
    LINE Pay Member
    Team01 Team02 Team03
    Logging
    Infra Resource
    - KR/TH
    Infra Resource
    - Japan
    Business
    Statistics
    LINE Pay
    Correct
    Not Sent
    Wrong Target

    View full-size slide

  8. Hard to control alarm
    LINE Pay member get alarm from services which they were in charge before
    Service in charge of 2019 Service in charge of 2020 Service in charge of 2021
    Still getting alarm from service not in charge now :(

    View full-size slide

  9. Hard to control alarm
    Different grouping method
    LINE Pay Member
    Team01 Team02
    LINE Pay Member
    Team01 Team02
    Monitoring System 01 Monitoring System 02
    “I don’t have group” “I have my own group”

    View full-size slide

  10. Hard to handle requirement of LINE Pay
    Common infra system cannot be changed just for LINE Pay
    Business
    Statistics
    LINE Pay
    Logging
    Infra Resource
    - KR/TH
    Infra Resource
    - Japan
    Common Tools in LINE
    Specific requirements of LINE Pay
    For example, we want alert below judging from past incident.
    - when api call count rapidly decrease
    - when specific result type of api increase
    - when cpu usage rapidly increase
    Hard to handle requirement quickly

    View full-size slide

  11. Hard to handle requirement of LINE Pay
    Need for detecting weird server in one server group
    0
    3.5
    7
    10.5
    18:00 18:01 18:02 18:03 18:04 18:05
    server01
    server02
    server03
    server04
    server05
    server06
    server07
    server08

    View full-size slide

  12. Hard to handle requirement of LINE Pay
    Need for detecting rapid change of API count
    0
    4
    7
    11
    14
    18
    21
    25
    28
    32
    35
    39
    42
    46
    49
    53
    56
    60
    63
    67
    70
    74
    77
    81
    84
    88
    91
    95
    98
    102
    105
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
    RPS

    View full-size slide

  13. Hard to handle requirement of LINE Pay
    There is a metrics we want, but there is no way to handle it
    Common Tools in LINE
    Metrics
    Cannot handle it the way we want !

    View full-size slide

  14. Structure of
    New Monitoring System

    View full-size slide

  15. Structure of New Monitoring System
    Prometheus
    Grafana
    Alertmanager Alarm Tool
    SMS Email Phone Call
    LINE Pay Members
    Log Collector Flink
    Pushgateway
    System Exporter
    System
    Application
    Exporter Log Appender

    View full-size slide

  16. Structure of New Monitoring System
    Prometheus
    Grafana
    Alertmanager Alarm Tool
    SMS Email Phone Call
    LINE Pay Members
    Log Collector Flink
    Pushgateway
    System Exporter
    System
    Application
    Exporter Log Appender
    Collecting Metrics

    View full-size slide

  17. Structure of New Monitoring System
    Prometheus
    Grafana
    Alertmanager Alarm Tool
    SMS Email Phone Call
    LINE Pay Members
    Log Collector Flink
    Pushgateway
    SystemExporter
    System
    Application
    Exporter Log Appender
    Reprocessing Metrics and Evaluating Alert

    View full-size slide

  18. Structure of New Monitoring System
    Prometheus
    Grafana
    Alertmanager Alarm Tool
    SMS Email Phone Call
    LINE Pay Members
    Log Collector Flink
    Pushgateway
    System Exporter
    System
    Application
    Exporter Log Appender
    Managing Alarm

    View full-size slide

  19. Structure of New Monitoring System
    Prometheus
    Grafana
    Alertmanager Alarm Tool
    SMS Email Phone Call
    LINE Pay Members
    Log Collector Flink
    Pushgateway
    System Exporter
    System
    Application
    Exporter Log Appender
    Visualization

    View full-size slide

  20. Collecting Metrics
    Prometheus
    Grafana
    Alertmanager Alarm Tool
    SMS Email Phone Call
    LINE Pay Members
    Log Collector Flink
    Pushgateway
    System Exporter
    System
    Application
    Exporter Log Appender
    Application Resource
    - GC count, time
    - JVM Heap usage
    - Thread status
    - Business metrics
    System Resource
    - Server information
    - CPU usage
    - Memory usage
    - Storage usage

    View full-size slide

  21. Collecting Metrics
    Prometheus
    Grafana
    Alertmanager Alarm Tool
    SMS Email Phone Call
    LINE Pay Members
    Log Collector Flink
    Pushgateway
    System Exporter
    System
    Application
    Exporter Log Appender
    Log
    - Message
    - API duration
    - Log level
    - Result Type

    View full-size slide

  22. Reprocessing Metrics and Making Alert
    Prometheus
    Grafana
    Alertmanager Alarm Tool
    SMS Email Phone Call
    LINE Pay Members
    Log Collector Flink
    Pushgateway
    System Exporter
    System
    Application
    Exporter Log Appender
    - Steam processing for various purpose
    - Make more meaningful information
    Reprocess Raw Metric

    View full-size slide

  23. Reprocessing Metrics and Making Alert
    More meaningful information
    Raw metrics
    Flink
    Calculate
    Statistics
    Job1
    Detect
    Abnormal
    Job2

    View full-size slide

  24. Reprocessing Metrics and Making Alert
    Prometheus
    Grafana
    Alertmanager Alarm Tool
    SMS Email Phone Call
    LINE Pay Members
    Log Collector Flink
    Pushgateway
    System Exporter
    System
    Application
    Exporter Log Appender
    - For component that are hard to apply
    the pull based collecting
    Alert Evaluation
    - Create flexible alert
    rule using PromQL
    Temporal Repository

    View full-size slide

  25. Managing Alarm
    Prometheus
    Grafana
    Alertmanager Alarm Tool
    SMS Email Phone Call
    LINE Pay Members
    Log Collector Flink
    Pushgateway
    System Exporter
    System
    Application
    Exporter Log Appender
    Managing Alert
    - Group alert by label
    - Determine where to send alert
    Sending Alarms
    - Easy to control alarm

    View full-size slide

  26. Managing Alarm
    Metric sources
    Alert Evaluation
    Service
    Team
    Members
    Servers and Applications Servers and Applications
    Monitoring System01 LINE Pay Monitoring System
    Service01 Service02 Service03
    Custom Groups Team01 Team02 Team03
    Monitoring System02
    Alertmanager
    Alarm Tool
    Alarm Tool

    View full-size slide

  27. Visualization
    Prometheus
    Grafana
    Alertmanager Alarm Tool
    SMS Email Phone Call
    LINE Pay Members
    Log Collector Flink
    Pushgateway
    System Exporter
    System
    Application
    Exporter Log Appender
    Visualizing Metric
    - Application Overview
    - API Overview
    - Business statistics

    View full-size slide

  28. Visualization
    Application Overview API Statistics Business Statistics
    + Any dashboard LINE Pay want

    View full-size slide

  29. What changed?

    View full-size slide

  30. Easy to grasp system
    The existing system which is hard to grasp system
    LINE Pay Member
    Check CPU usage..
    Check business stats..
    Check log message or count..
    Business
    Statistics
    Infra Resource
    - KR/TH
    Infra Resource
    - Japan
    Logging
    LINE Pay
    Check Storage usage..

    View full-size slide

  31. LINE Pay Monitoring System
    Easy to grasp system
    Integrating existing system
    Logging
    LINE Pay
    Check log message
    Check cpu usage
    Check jvm usage
    Check api count
    Check business stats
    LINE Pay Member

    View full-size slide

  32. Easy to control alarm
    The existing system which is hard to control alarm
    Alert system
    Alert system Alert system Alert system
    LINE Pay Member
    Team01 Team02 Team03
    Logging
    Infra Resource
    - KR/TH
    Infra Resource
    - Japan
    Business
    Statistics
    LINE Pay
    Correct
    Not Sent
    Wrong Target

    View full-size slide

  33. Easy to control alarm
    Control alarm at one system
    Team01 Team02 Team03
    Oncall Oncall Lead
    Service01 Service03
    Service02
    Alarm Tool
    Alertmanager
    Prometheus External System
    Connector
    LINE Pay Members
    The place to control alarm
    Oncall
    Escalation

    View full-size slide

  34. Business
    Statistics
    LINE Pay
    Logging
    Infra Resource
    - KR/TH
    Infra Resource
    - Japan
    Common Tools in LINE
    Specific requirements of LINE Pay
    For example, we want alert below judging from past incident.
    - when api call count rapidly decrease
    - when specific result type of api increase
    - when cpu usage rapidly increase
    Hard to handle requirement quickly
    Easy to handle requirement of LINE Pay
    Most of existing monitoring tools are common infrastructure of LINE

    View full-size slide

  35. LINE Pay
    For example, we want alert below judging from past incident.
    - when api call count rapidly decrease
    - when specific result type of api increase
    - when cpu usage rapidly increase
    Easy to handle requirement quickly
    Easy to handle requirement of LINE Pay
    Now, LINE Pay can handle metric
    LINE Pay Monitoring System
    Server resource metrics
    Application resource metrics
    API statistic metrics
    Business statistic metrics
    Specific requirements of LINE Pay

    View full-size slide

  36. Easy to handle requirement of LINE Pay
    Need for detecting weird server in same server group
    0
    3.5
    7
    10.5
    18:00 18:01 18:02 18:03 18:04 18:05
    server01
    server02
    server03
    server04
    server05
    server06
    server07
    server08

    View full-size slide

  37. Easy to handle requirement of LINE Pay
    The way to detect weird server in same server group
    Server01 Server02 Server03 Server04
    CPU Utilization of Server01 The average of CPU Utilization except Server01
    Alert
    Alert Rule - compare metrics

    View full-size slide

  38. Easy to handle requirement of LINE Pay
    LINE Pay can detect situation like this

    View full-size slide

  39. Easy to handle requirement of LINE Pay
    Need for detecting rapid decreasing of API
    0
    4
    7
    11
    14
    18
    21
    25
    28
    32
    35
    39
    42
    46
    49
    53
    56
    60
    63
    67
    70
    74
    77
    81
    84
    88
    91
    95
    98
    102
    105
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
    RPS

    View full-size slide

  40. Easy to handle requirement of LINE Pay
    PromQL provides many useful functions
    - record: request:rps:1m
    expr: {PromQL for request per seconds}
    - record: request:rps:1m:avg:10m
    expr: avg_over_time(request:rps:1m[10m])
    - record: request:rps:1m:stddev:10m
    expr: stddev_over_time(request:rps:1m[10m])
    Acceptable range :
    request:rps:1m:avg:10m ± 3*request:rps:1m:stddev:10m

    View full-size slide

  41. Easy to handle requirement of LINE Pay
    Permissible range using standard deviation
    Average + 3*Standard Deviation
    Average - 3*Standard Deviation
    Current RPS

    View full-size slide

  42. Easy to handle requirement of LINE Pay
    Grafana
    Alertmanager Pagerduty
    SMS Email Phone Call
    LINE Pay Members
    Prometheus
    Node Exporter
    Server
    Application
    App metric Log Appender
    Log Collector Flink
    PushGateway
    Each node can integrate external system
    External System
    ex) logging, Slack, Jira..
    Many connections which external system can be integrated

    View full-size slide

  43. Easy to handle requirement of LINE Pay
    Integration of alert from log monitoring system
    Log Monitoring System
    - Evaluate Alert
    Alarm Tool
    Connector
    - Detect log pattern
    - The number of matched log
    - Log body
    - Log time
    - Application name
    :
    - Alert title
    - Serverity
    - Summary
    :
    Callback API Trigger Alert

    View full-size slide

  44. What’s next?

    View full-size slide

  45. What’s next?
    Next things to do for improvement
    Different Server
    Usage Pattern
    Different Pattern of
    API Request
    Different Scale of
    Server Usage

    View full-size slide

  46. What’s next?
    Next things to do for improvement
    Different Server
    Usage Pattern
    Different Pattern of
    API Request
    Different Scale of
    Server Usage

    View full-size slide

  47. Different Server Usage Pattern
    General server which is used by users

    View full-size slide

  48. Different Server Usage Pattern
    Batch server which is triggered regularly

    View full-size slide

  49. Different Server Usage Pattern
    Admin server which is triggered irregularly
    Irregular Heavy job

    View full-size slide

  50. Different Server Usage Pattern
    Alert Rules
    General Servers Admin, Batch Server
    Servers need to be more categorized
    Alert Rules need to be more concrete
    Categorize server and make alert rule concrete

    View full-size slide

  51. Different Scale of Server Usage
    Next things to do for improvement
    Different Server
    Usage Pattern
    Different Pattern of
    API Request
    Different Scale of
    Server Usage

    View full-size slide

  52. Different Scale of Server Usage
    Server01 Server02 Server03 Server04
    CPU Utilization of Server01 The average of CPU Utilization except Server01
    Alert
    {One server} > n * {The average of other servers}

    View full-size slide

  53. Different Scale of Server Usage
    n = 2
    Average of CPU Utilization : 25% Threshold: 50% Catch Slow API
    Average of CPU Utilization : 5% Threshold: 10% Make False Positive Alert

    View full-size slide

  54. Different Pattern of API Request
    Next things to do for improvement
    Different Server
    Usage Pattern
    Different Pattern of
    API Request
    Different Scale of
    Server Usage

    View full-size slide

  55. Different Pattern of API Request
    Frequently Requested API Rarely or Regularly Requested API
    Average, Standard Deviation
    Not work..

    View full-size slide

  56. Different Pattern of API Request
    Ideal API that we expected

    View full-size slide

  57. Different Pattern of API Request
    Rarely requested API

    View full-size slide

  58. Different Pattern of API Request
    Regularly requested API

    View full-size slide

  59. What’s next?
    Next things to do for improvement
    Different Server
    Usage Pattern
    Different Pattern of
    API Request
    Different Scale of
    Server Usage

    View full-size slide

  60. What’s next?
    Next things to do for improvement
    Reducing False Positive Alert &
    Making Reliable Monitoring System

    View full-size slide

  61. Next, there is more interesting things!

    View full-size slide

  62. LINE Pay Anomaly Log Detection
    Sun Uk Kim
    FDS (Fraud Detection System)

    View full-size slide

  63. Where Does This Need Came From?
    - Service Size ↑
    - False Positive ↑
    - Fatigue of Engineer ↑
    - True Positive Awareness ↓
    - Outage ↑
    LINE Pay Log Summary
    API ~400
    Return Code ~200
    # of Error ~10M

    View full-size slide

  64. Where Does This Need Came From?
    LINE Pay Log Summary
    API ~400
    Return Code ~200
    # of Error ~10M
    -False Positive ↓
    -Fatigue of Engineer ↓
    -True Positive Awareness ↑
    -Outage ↓
    -Goal: Log Forecast

    View full-size slide

  65. Hypothesis Formulation
    Hypothesis 1
    Sudden Increase
    of Particular Requests
    Hypothesis 2
    Tangled Sequence of Logs
    Outage Report
    Category Rate
    Request Log 45%
    Log Message 35%
    System Stats 5%
    Others 15%
    Request Log
    Log Message

    View full-size slide

  66. Model Selection
    -Unsupervised Learning
    -Robust to Retraining
    -Fast Implementation
    -Easy-to-Find Root Cause
    -Simple Approach
    Gaussian Mixture Model
    LSTM with Workflow
    Hypothesis 1
    Hypothesis 2

    View full-size slide

  67. Gaussian Mixture Model (GMM)
    Abnormal
    Reference: https://www.askpython.com/python/normal-distribution

    View full-size slide

  68. Gaussian Mixture Model (GMM)
    Reference: https://github.com/rickiepark/handson-ml2

    View full-size slide

  69. GMM Feature #1

    View full-size slide

  70. GMM Feature #1

    View full-size slide

  71. Challenge #1
    Gaussian Mixture Model
    -Expensive Processing
    -Similar But Different
    -Unbalanced Data
    -Vulnerable to Retraining

    View full-size slide

  72. Solution #1: Change Features

    View full-size slide

  73. Solution #1 Result
    Gaussian Mixture Model

    View full-size slide

  74. Challenge #2: Batch

    View full-size slide

  75. Solution #2: Fill in the blank
    0
    20
    40
    60
    80
    100
    120
    140
    t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16
    0
    20
    40
    60
    80
    100
    120
    140
    t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16
    90% Elimination
    of Batch Anomaly

    View full-size slide

  76. Challenge #3 Minorities
    Gaussian Mixture Model
    Expected Behavior
    Light 1 à 100 100%
    Heavy 100 à 10,000 100%
    Actual Behavior
    Light 1 à 100 100%
    Heavy 1 à 10,000 10,000%

    View full-size slide

  77. Solution #3 Minorities
    Gaussian Mixture Model
    3-1. Apply Penalty 3-2. Adding Features
    !"
    = $ −
    # '( )*+"
    ,-.()*+"
    )
    ("
    = ("
    × !"
    ~5 min ~10 min ~30min
    ~5 min ~10 min ~30min
    ~5 min ~10 min ~30min
    +
    weekly
    monthly

    View full-size slide

  78. GMM Result Summary
    Gaussian Mixture Model
    -10M Err à 500⇩
    -~30 types of Apis
    -Post Processing: 500 à 150⇩
    -15% of Outage ⊆ 150

    View full-size slide

  79. LSTM (Long Short-Term Memory)

    View full-size slide

  80. LSTM (Long Short-Term Memory)

    View full-size slide

  81. Architecture
    Gaussian Mixture Model
    Training

    View full-size slide

  82. Architecture
    Gaussian Mixture Model
    Training
    Detection

    View full-size slide

  83. Challenge
    Concurrency

    View full-size slide

  84. Solution : Concurrency vs. Multi-Task
    Concurrency Multi-Task
    Workflow Library

    View full-size slide

  85. Solution: Concurrency vs. Multi-Task

    View full-size slide

  86. Result Discussion
    - Case I: Token Error
    - Case II: Service Maintenance
    - Case III: Macro Retries from Particular User
    - Case IV: Account Lock
    - Case V: Memory Errors
    - Case VI: Connection Failed
    - Case VII: New Message

    View full-size slide