Slide 1

Slide 1 text

從混亂到掌控:使用 Observability 揭開管家的神秘面紗

Slide 2

Slide 2 text

Tristan Education • B.B.A in Finance @ NTU Experience • 2023 - 2024 | TECH FRESH @ LINE Taiwan • 2022 - 2023 | Software Engineer Intern @ Junyiacademy • 2022 | Backend Trainee @ AppWorks School

Slide 3

Slide 3 text

LINE INVOICE 發票管家

Slide 4

Slide 4 text

LINE OAP

Slide 5

Slide 5 text

01 02 03 Three Pillars of Observability Case Study: LINE INVOICE Introduction to Observability CONTENT 04 Conclusion

Slide 6

Slide 6 text

SECTION 01 Introduction to Observability

Slide 7

Slide 7 text

What should be taken when system errors occurs?

Slide 8

Slide 8 text

Purpose of Observability

Slide 9

Slide 9 text

SECTION 02 Three Pillars of Observability

Slide 10

Slide 10 text

Three Pillars of Observability

Slide 11

Slide 11 text

Logs 1. Immutable / Timestamped record of discrete events 2. Record necessary info for each request Source: https://grafana.com/products/cloud/logs/

Slide 12

Slide 12 text

● Unstructured - PlainText ● Structured - JSON format ● Binary ○ MySQL binlogs ○ systemd journal logs Logs Format

Slide 13

Slide 13 text

● Unstructured - PlainText ● Structured - JSON format ● Binary ○ MySQL binlogs ○ systemd journal logs Logs Format

Slide 14

Slide 14 text

● Unstructured - PlainText ● Structured - JSON format ● Binary ○ MySQL binlogs ○ systemd journal logs Logs Format Source: https://www.percona.com/blog/binlog-encryption-percona-server-mysql/

Slide 15

Slide 15 text

Logs Collection Flow

Slide 16

Slide 16 text

Metrics Quantitative insight into system performance and resource utilization Source: https://grafana.com/products/cloud/metrics/

Slide 17

Slide 17 text

Metrics Supported Data Types Counter Gauge Histogram Summary

Slide 18

Slide 18 text

Metrics Supported Data Types Counter Gauge Histogram Summary • Only increases, never decreases • Application: HTTP request times

Slide 19

Slide 19 text

Metrics Supported Data Types Counter Gauge Histogram Summary • Increase or decrease at any time • Application: num of concurrent reqs

Slide 20

Slide 20 text

Metrics Supported Data Types Counter Gauge Histogram Summary • Only increases, never decreases • Application: request durations

Slide 21

Slide 21 text

Metrics Supported Data Types Counter Gauge Histogram Summary • Provides precise sampling of observations • Application: request durations

Slide 22

Slide 22 text

Metrics Collection Flow

Slide 23

Slide 23 text

Traces ● Record and visualize the complete path of a request through the system ● Identify specific points Source: https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/traces/ https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/ch04.html

Slide 24

Slide 24 text

Traces Span

Slide 25

Slide 25 text

Traces Span Structure

Slide 26

Slide 26 text

Traces Collection Flow

Slide 27

Slide 27 text

SECTION 03 Case Study LINE INVOICE

Slide 28

Slide 28 text

Gary Hu Education • M.S. in Computer Science @ NTU • B.B.A in Information Management @ NTU Experience • 2023 - 2024 | TECH FRESH @ LINE Taiwan • 2022 - 2023 | Software Engineer Intern @ KKCompany • 2022 | Research Assistant @ Academia Sinica

Slide 29

Slide 29 text

LINE INVOICE 發票管家

Slide 30

Slide 30 text

LINE Sticker 貼圖傳送任務

Slide 31

Slide 31 text

SECTION 03 Case Study LINE INVOICE

Slide 32

Slide 32 text

Case 1: Mystery Behind the Blank Screen Scenario Thousands of users simultaneously accessing our system Problem Users are met with blank screens and error messages. Challenges We need to investigate the error, and identify its cause.

Slide 33

Slide 33 text

Case 1: Mystery Behind the Blank Screen Steps 1. Centralized Log Collection 2. Log Search 3. Identify Log Locations 4. Impact Tracking

Slide 34

Slide 34 text

Case 2: Peak Traffic Monitoring Scenario Thousands of users simultaneously accessing our system Problem 1. Server cannot handle all requests 2. Timeouts and poor user experience Challenges Identify bottlenecks and optimize server performance

Slide 35

Slide 35 text

Case 2: Peak Traffic Monitoring Source: https://pixotech.com/blog/what-a-performance-how-site-speed-affects-ux/ https://smallbusinessweb.co/impact-of-website-loading-speed/ 3.7s Users start getting frustrated 75% Speed affects user experience 53% Users abandon after three seconds

Slide 36

Slide 36 text

Case 2: Peak Traffic Monitoring Steps 1. Collect metrics from all services 2. Visualize metrics to understand system behavior 3. Monitor traffic volume and response time

Slide 37

Slide 37 text

Case 2: Peak Traffic Monitoring Steps 4. Collect CPU and memory usage 5. Identify issues 6. Address inappropriate configurations

Slide 38

Slide 38 text

About Our System

Slide 39

Slide 39 text

Case 3: Mystery of the 5-Minute Workflow Scenario Many workflows are executed daily to fetch user invoices Problem We discovered that numerous workflows are taking over 5 minutes to complete. Challenges Identify the cause of the delays and optimize the workflow performance.

Slide 40

Slide 40 text

Case 3: Mystery of the 5-Minute Workflow Steps 1. Collect traces 2. Visualize traces 3. Analyze each spans v

Slide 41

Slide 41 text

Case 3: Mystery of the 5-Minute Workflow Findings ● Fetching invoices from the government takes 27 seconds. ● Storing the invoices in the database, however, takes 54 seconds. 54 secs 27 secs

Slide 42

Slide 42 text

Conclusion Precision Detection Optimization

Slide 43

Slide 43 text

Thank You Tristan Wu Gary Hu