Building LINE Pay Monitoring System and Anomaly Log Detection System Using ML

Agenda - Part 1: Building LINE Pay Monitoring System -
Problem of Existing Monitoring System - Architecture of New Monitoring System - Next Steps - Part 2: Abnormal Log Detection with ML - Finding Needs - Hypothesis & Model Selection - Model Architecture and Challenges - Result Discussion

Build LINE Pay Monitoring System Gi Seung Lee Pay DevOps

Existing Monitoring System

Various Monitoring Tools Logging Infra Resource - KR/TH Infra Resource
- Japan Business Statistics LINE Pay LINE Pay Member Various monitoring tools make some problems

Hard to grasp system LINE Pay member have to visit
many system LINE Pay Member Check CPU usage.. Check log message or count.. Check business stats.. Logging Infra Resource - KR/TH Infra Resource - Japan Business Statistics LINE Pay Check Storage usage..

Hard to grasp system Monitoring tools which were used before
System Information System Resource - Korea System Resource - Japan Application Resource Log Monitoring Business Statistics

Hard to control alarm Each system have their own alert
system Alert system Alert system Alert system Alert system LINE Pay Member Team01 Team02 Team03 Logging Infra Resource - KR/TH Infra Resource - Japan Business Statistics LINE Pay Correct Not Sent Wrong Target

Hard to control alarm LINE Pay member get alarm from
services which they were in charge before Service in charge of 2019 Service in charge of 2020 Service in charge of 2021 Still getting alarm from service not in charge now :(

Hard to control alarm Different grouping method LINE Pay Member
Team01 Team02 LINE Pay Member Team01 Team02 Monitoring System 01 Monitoring System 02 “I don’t have group” “I have my own group”

Hard to handle requirement of LINE Pay Common infra system
cannot be changed just for LINE Pay Business Statistics LINE Pay Logging Infra Resource - KR/TH Infra Resource - Japan Common Tools in LINE Specific requirements of LINE Pay For example, we want alert below judging from past incident. - when api call count rapidly decrease - when specific result type of api increase - when cpu usage rapidly increase Hard to handle requirement quickly

Hard to handle requirement of LINE Pay Need for detecting
weird server in one server group 0 3.5 7 10.5 18:00 18:01 18:02 18:03 18:04 18:05 server01 server02 server03 server04 server05 server06 server07 server08

Hard to handle requirement of LINE Pay Need for detecting
rapid change of API count 0 4 7 11 14 18 21 25 28 32 35 39 42 46 49 53 56 60 63 67 70 74 77 81 84 88 91 95 98 102 105 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 RPS

Hard to handle requirement of LINE Pay There is a
metrics we want, but there is no way to handle it Common Tools in LINE Metrics Cannot handle it the way we want !

Structure of New Monitoring System

Structure of New Monitoring System Prometheus Grafana Alertmanager Alarm Tool
SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender

SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Collecting Metrics

SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway SystemExporter System Application Exporter Log Appender Reprocessing Metrics and Evaluating Alert

SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Managing Alarm

SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Visualization

Collecting Metrics Prometheus Grafana Alertmanager Alarm Tool SMS Email Phone
Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Application Resource - GC count, time - JVM Heap usage - Thread status - Business metrics System Resource - Server information - CPU usage - Memory usage - Storage usage

Collecting Metrics Prometheus Grafana Alertmanager Alarm Tool SMS Email Phone
Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Log - Message - API duration - Log level - Result Type

Reprocessing Metrics and Making Alert Prometheus Grafana Alertmanager Alarm Tool
SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender - Steam processing for various purpose - Make more meaningful information Reprocess Raw Metric

Reprocessing Metrics and Making Alert More meaningful information Raw metrics
Flink Calculate Statistics Job1 Detect Abnormal Job2

Reprocessing Metrics and Making Alert Prometheus Grafana Alertmanager Alarm Tool
SMS Email Phone Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender - For component that are hard to apply the pull based collecting Alert Evaluation - Create flexible alert rule using PromQL Temporal Repository

Managing Alarm Prometheus Grafana Alertmanager Alarm Tool SMS Email Phone
Call LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Managing Alert - Group alert by label - Determine where to send alert Sending Alarms - Easy to control alarm

Managing Alarm Metric sources Alert Evaluation Service Team Members Servers
and Applications Servers and Applications Monitoring System01 LINE Pay Monitoring System Service01 Service02 Service03 Custom Groups Team01 Team02 Team03 Monitoring System02 Alertmanager Alarm Tool Alarm Tool

Visualization Prometheus Grafana Alertmanager Alarm Tool SMS Email Phone Call
LINE Pay Members Log Collector Flink Pushgateway System Exporter System Application Exporter Log Appender Visualizing Metric - Application Overview - API Overview - Business statistics

Visualization Application Overview API Statistics Business Statistics + Any dashboard
LINE Pay want

What changed?

Easy to grasp system The existing system which is hard
to grasp system LINE Pay Member Check CPU usage.. Check business stats.. Check log message or count.. Business Statistics Infra Resource - KR/TH Infra Resource - Japan Logging LINE Pay Check Storage usage..

LINE Pay Monitoring System Easy to grasp system Integrating existing
system Logging LINE Pay Check log message Check cpu usage Check jvm usage Check api count Check business stats LINE Pay Member

Easy to control alarm The existing system which is hard
to control alarm Alert system Alert system Alert system Alert system LINE Pay Member Team01 Team02 Team03 Logging Infra Resource - KR/TH Infra Resource - Japan Business Statistics LINE Pay Correct Not Sent Wrong Target

Easy to control alarm Control alarm at one system Team01
Team02 Team03 Oncall Oncall Lead Service01 Service03 Service02 Alarm Tool Alertmanager Prometheus External System Connector LINE Pay Members The place to control alarm Oncall Escalation

Business Statistics LINE Pay Logging Infra Resource - KR/TH Infra
Resource - Japan Common Tools in LINE Specific requirements of LINE Pay For example, we want alert below judging from past incident. - when api call count rapidly decrease - when specific result type of api increase - when cpu usage rapidly increase Hard to handle requirement quickly Easy to handle requirement of LINE Pay Most of existing monitoring tools are common infrastructure of LINE

LINE Pay For example, we want alert below judging from
past incident. - when api call count rapidly decrease - when specific result type of api increase - when cpu usage rapidly increase Easy to handle requirement quickly Easy to handle requirement of LINE Pay Now, LINE Pay can handle metric LINE Pay Monitoring System Server resource metrics Application resource metrics API statistic metrics Business statistic metrics Specific requirements of LINE Pay

Easy to handle requirement of LINE Pay Need for detecting
weird server in same server group 0 3.5 7 10.5 18:00 18:01 18:02 18:03 18:04 18:05 server01 server02 server03 server04 server05 server06 server07 server08

Easy to handle requirement of LINE Pay The way to
detect weird server in same server group Server01 Server02 Server03 Server04 CPU Utilization of Server01 The average of CPU Utilization except Server01 Alert Alert Rule - compare metrics

Easy to handle requirement of LINE Pay LINE Pay can
detect situation like this

Easy to handle requirement of LINE Pay Need for detecting
rapid decreasing of API 0 4 7 11 14 18 21 25 28 32 35 39 42 46 49 53 56 60 63 67 70 74 77 81 84 88 91 95 98 102 105 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 RPS

Easy to handle requirement of LINE Pay PromQL provides many
useful functions - record: request:rps:1m expr: {PromQL for request per seconds} - record: request:rps:1m:avg:10m expr: avg_over_time(request:rps:1m[10m]) - record: request:rps:1m:stddev:10m expr: stddev_over_time(request:rps:1m[10m]) Acceptable range : request:rps:1m:avg:10m ± 3*request:rps:1m:stddev:10m

Easy to handle requirement of LINE Pay Permissible range using
standard deviation Average + 3*Standard Deviation Average - 3*Standard Deviation Current RPS

Easy to handle requirement of LINE Pay Grafana Alertmanager Pagerduty
SMS Email Phone Call LINE Pay Members Prometheus Node Exporter Server Application App metric Log Appender Log Collector Flink PushGateway Each node can integrate external system External System ex) logging, Slack, Jira.. Many connections which external system can be integrated

Easy to handle requirement of LINE Pay Integration of alert
from log monitoring system Log Monitoring System - Evaluate Alert Alarm Tool Connector - Detect log pattern - The number of matched log - Log body - Log time - Application name : - Alert title - Serverity - Summary : Callback API Trigger Alert

What’s next?

What’s next? Next things to do for improvement Different Server
Usage Pattern Different Pattern of API Request Different Scale of Server Usage

Different Server Usage Pattern General server which is used by
users

Different Server Usage Pattern Batch server which is triggered regularly

Different Server Usage Pattern Admin server which is triggered irregularly
Irregular Heavy job

Different Server Usage Pattern Alert Rules General Servers Admin, Batch
Server Servers need to be more categorized Alert Rules need to be more concrete Categorize server and make alert rule concrete

Different Scale of Server Usage Next things to do for
improvement Different Server Usage Pattern Different Pattern of API Request Different Scale of Server Usage

Different Scale of Server Usage Server01 Server02 Server03 Server04 CPU
Utilization of Server01 The average of CPU Utilization except Server01 Alert {One server} > n * {The average of other servers}

Different Scale of Server Usage n = 2 Average of
CPU Utilization : 25% Threshold: 50% Catch Slow API Average of CPU Utilization : 5% Threshold: 10% Make False Positive Alert

Different Pattern of API Request Next things to do for
improvement Different Server Usage Pattern Different Pattern of API Request Different Scale of Server Usage

Different Pattern of API Request Frequently Requested API Rarely or
Regularly Requested API Average, Standard Deviation Not work..

Different Pattern of API Request Ideal API that we expected

Different Pattern of API Request Rarely requested API

Different Pattern of API Request Regularly requested API

What’s next? Next things to do for improvement Different Server
Usage Pattern Different Pattern of API Request Different Scale of Server Usage

What’s next? Next things to do for improvement Reducing False
Positive Alert & Making Reliable Monitoring System

Next, there is more interesting things!

LINE Pay Anomaly Log Detection Sun Uk Kim FDS (Fraud
Detection System)

Where Does This Need Came From? - Service Size ↑
- False Positive ↑ - Fatigue of Engineer ↑ - True Positive Awareness ↓ - Outage ↑ LINE Pay Log Summary API ~400 Return Code ~200 # of Error ~10M

Where Does This Need Came From? LINE Pay Log Summary
API ~400 Return Code ~200 # of Error ~10M -False Positive ↓ -Fatigue of Engineer ↓ -True Positive Awareness ↑ -Outage ↓ -Goal: Log Forecast

Hypothesis Formulation Hypothesis 1 Sudden Increase of Particular Requests Hypothesis
2 Tangled Sequence of Logs Outage Report Category Rate Request Log 45% Log Message 35% System Stats 5% Others 15% Request Log Log Message

Model Selection -Unsupervised Learning -Robust to Retraining -Fast Implementation -Easy-to-Find
Root Cause -Simple Approach Gaussian Mixture Model LSTM with Workflow Hypothesis 1 Hypothesis 2

Gaussian Mixture Model (GMM) Abnormal Reference: https://www.askpython.com/python/normal-distribution

Gaussian Mixture Model (GMM) Reference: https://github.com/rickiepark/handson-ml2

GMM Feature #1

Challenge #1 Gaussian Mixture Model -Expensive Processing -Similar But Different
-Unbalanced Data -Vulnerable to Retraining

Solution #1: Change Features

Solution #1 Result Gaussian Mixture Model

Challenge #2: Batch

Solution #2: Fill in the blank 0 20 40 60
80 100 120 140 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 0 20 40 60 80 100 120 140 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 90% Elimination of Batch Anomaly

Challenge #3 Minorities Gaussian Mixture Model Expected Behavior Light 1
à 100 100% Heavy 100 à 10,000 100% Actual Behavior Light 1 à 100 100% Heavy 1 à 10,000 10,000%

Solution #3 Minorities Gaussian Mixture Model 3-1. Apply Penalty 3-2.
Adding Features !" = $ − # '( )*+" ,-.()*+" ) (" = (" × !" ~5 min ~10 min ~30min ~5 min ~10 min ~30min ~5 min ~10 min ~30min + weekly monthly

GMM Result Summary Gaussian Mixture Model -10M Err à 500⇩
-~30 types of Apis -Post Processing: 500 à 150⇩ -15% of Outage ⊆ 150

LSTM (Long Short-Term Memory)

Architecture Gaussian Mixture Model Training

Architecture Gaussian Mixture Model Training Detection

Challenge Concurrency

Solution : Concurrency vs. Multi-Task Concurrency Multi-Task Workflow Library

Solution: Concurrency vs. Multi-Task

Result Discussion - Case I: Token Error - Case II:
Service Maintenance - Case III: Macro Retries from Particular User - Case IV: Account Lock - Case V: Memory Errors - Case VI: Connection Failed - Case VII: New Message

Thank you

Building LINE Pay Monitoring System and Anomaly...

Building LINE Pay Monitoring System and Anomaly Log Detection System Using ML

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Featured

Transcript