Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Introduction Hot chat on LINE OpenChat API requests / 1 min on 1 hot chat x100 traffic spikes Famous singer’s concert 5M social media followers 5K fans on 1 LINE OpenChat “Hot chat”

Slide 3

Slide 3 text

Agenda - Introduction to LINE OpenChat - How LINE OpenChat server handles traffic spikes on hot chat - Future plans - Fetch spikes on hot chat - Join spikes on hot chat

Slide 4

Slide 4 text

LINE OpenChat Discovery Messaging Entrance

Slide 5

Slide 5 text

Overview 1 day / 1 minute API requests 10B / 10M requests 1 OpenChat Members 5K+ members 1 minute 1 OpenChat 200K requests

Slide 6

Slide 6 text

Hot chat on LINE OpenChat 100 messages / 1 sec x100 traffic spikes Famous singer’s concert 5M social media followers 5K fans on 1 LINE OpenChat “Hot chat” 200K API requests / 1 min on 1 hot chat Send message API requests

Slide 7

Slide 7 text

Event delivery architecture 1. Store 5K members Storage 2. Server push 3. Fetch events 1 event 5K fetch events x 5,000 Send Message Message Reaction Message Mark As Read Chat Status Update Member Status Update Note & Post Update Events on 1 chat

Slide 8

Slide 8 text

How LINE OpenChat server handles extreme traffic spikes on hot chat

Slide 9

Slide 9 text

Case 1) fetchEvent spikes on hot chat LINE client OpenChat server Storages LINE client LINE client LINE client Publish server 5. Fetch events 1. Send message 3. Publish event 4. Server push 2. Store on storages

Slide 10

Slide 10 text

Hot chat metrics › Storage(MySQL, Redis) traffic spikes on 1 shard MySQL Redis OpenChat server MySQL › Response timeout on storage’s 1 shard Chat related query / 1 min Slow query / 1 min Char related request / 1 min Response timeout / 1 min +200 ~ 300% +200 ~ 300%

Slide 11

Slide 11 text

Hot chat metrics › Kafka offset lag spikes on 1 partition › GC, CPU usage spikes on some server group Kafka offset lag GC Response timeout / 1 min Offset lag / 1 min Garbage collection time / 1 min

Slide 12

Slide 12 text

Hot key problem by hot chat Shard 2 Shard 3 Shard 1 API requests / 1 min on 1 hot chat › On hot chat, all events go to same shard and same key Sharding by key: chatId API requests

Slide 13

Slide 13 text

Hot key problem by hot chat Shard 2 Shard 3 Sharding by key: chatId Shard 4 Shard 1 … Replication 3 … Replication 1 Replication 2 More replications More shards

Slide 14

Slide 14 text

Need to focus on hot chat 0.1% Hot chat Normal chat 99.9%

Slide 15

Slide 15 text

Solution) hot chat detection & throttling LINE client OpenChat server LINE client LINE client LINE client Publish server Storages Throttle too many fetch events on hot chat

Slide 16

Slide 16 text

LINE client OpenChat server LINE client LINE client LINE client Storages Fetch events Chat A Chat B Hot chat threshold Publish server Set chat B as hot chat Hot chat detection

Slide 17

Slide 17 text

Hot chat detection Cache bucket; public Completable process(kafka topic) { boolean consume = bucket.get(chatId).tryConsume(); if (!consume) { hotChatStorage.set(chatId, N seconds); } } Fetch events Chat A Chat B Hot chat threshold

Slide 18

Slide 18 text

LINE client OpenChat server LINE client LINE client LINE client Storages Hot chat throttling Publish server Check hot chat Throttling server push X% Throttle too many fetch events on hot chat

Slide 19

Slide 19 text

Hot chat throttling New event … Server push on hot chat A 5K members Shard 1 New event New event New event New event … Server push on hot chat A 5K members New event New event New event 1 second 1 second Shard 1 5K fetch events Throttling X%

Slide 20

Slide 20 text

Dynamic configuration ”hotChatConfigs": { "enabled": true, "dryRun": false, // hot chat detection ”durationSeconds": 10, ”threshold": X, // hot chat throttling ”expireSeconds": 30, ”throttleRate": Y% // target chat ”chatId” : { .. } } › LINE Central Dogma enables dynamic configuration without restarting the server

Slide 21

Slide 21 text

Result MySQL MySQL MySQL Chat related query / 1 min Slow query / 1 min MySQL +200 ~ 300% +50 ~ 100% No slow query › Reduce storage(MySQL, Redis) traffic spikes on 1 shard › Remove response timeout by hot chat

Slide 22

Slide 22 text

Result › Automated hot chat detection and throttling › Effectively control fetch events spikes on hot chat › No more service impact by hot chat Hot chat dashboard

Slide 23

Slide 23 text

Case 2) Join spikes on hot chat

Slide 24

Slide 24 text

Join LINE OpenChat Discovery Join QR code, link

Slide 25

Slide 25 text

Join spikes on hot chat Max join on 1 chat 2018 2019 2020 2021 2022 › Recently max join traffic is increased significantly 100 join / 10 sec 2K join / 1 sec

Slide 26

Slide 26 text

Join spikes on hot chat LINE client OpenChat server Storages 1. Join OpenChat 2. Store chat member on MySQL › Join spikes trigger heavy load on MySQL Join spikes on 1 chat

Slide 27

Slide 27 text

Join spikes metrics › Chat member insert query, CPU usage spikes on 1 MySQL shard MySQL MySQL MySQL CPU usage Load average Insert chat member query / 1 sec 500+ join / 1 sec Slow query / 1 min CPU usage 80%

Slide 28

Slide 28 text

Join spikes metrics › Response timeout, thread pool queued OpenChat server OpenChat server Response timeout / 1 min Queued thread / 1 min

Slide 29

Slide 29 text

Improve bottlenecks on MySQL

Slide 30

Slide 30 text

“Insert chat member” query JOINED … … … › MySQL AUTO_INCREMENT handling in InnoDB https://dev.mysql.com/doc/refman/5.6/en/innodb-auto-increment-handling.html Insert chat member query on MySQL “INSERT … SELECT” query is “Bulk inserts” AUTO-INC table-level lock

Slide 31

Slide 31 text

“Insert chat member” query › MySQL AUTO_INCREMENT handling in InnoDB https://dev.mysql.com/doc/refman/5.6/en/innodb-auto-increment-handling.html MySQL QPS AUTO-INC table lock CPU usage JOINED … … … AUTO-INC table-level lock Insert chat member query on MySQL CPU usage 100%

Slide 32

Slide 32 text

Change innodb_autoinc_lock_mode › Change Innodb_autoinc_lock_mode = 2 (“interleaved” lock mode)

Slide 33

Slide 33 text

Change innodb_autoinc_lock_mode › Change Innodb_autoinc_lock_mode = 2 (“interleaved” lock mode) MySQL QPS CPU usage CPU usage 10 ~ 20% 2K join / 1 sec

Slide 34

Slide 34 text

“Get chat member count” query Get chat member count query and MySQL query cache Chat member Chat id 100 State = LEAVED State = JOINED State = JOINED Query cache Member count: 2

Slide 35

Slide 35 text

“Get chat member count” query MySQL query cache uses table-level lock Chat member Chat id 100 State = LEAVED State = JOINED State = JOINED Query cache Member count: 2 State = JOINED State = JOINED . . . State = JOINED Member count: 3 Member count: 4 Member count: 5 3 joins 3 updates

Slide 36

Slide 36 text

“Get chat member count” query MySQL QPS MySQL query cache uses table-level lock MySQL query cache table lock . . . Chat member Chat id 100 State = LEAVED State = JOINED State = JOINED Query cache Member count: 2 State = JOINED State = JOINED State = JOINED Member count: 3 Member count: 4 Member count: 5 3 joins 3 updates . . .

Slide 37

Slide 37 text

“Get chat member count” query Remove MySQL query cache Member count Chat id 100 2 Add chat member count table › Time complexity of get chat member count query: O(N) -> O(log N) Chat member Chat id 100 State = LEAVED State = JOINED State = JOINED Query cache Member count: 2

Slide 38

Slide 38 text

Apply join throttling to protect MySQL

Slide 39

Slide 39 text

Join throttling LINE client OpenChat server Storages LINE client LINE client LINE client Publish server Join OpenChat Join on one chat Chat A Chat B MySQL processing limit Throttle join request X% for Y seconds

Slide 40

Slide 40 text

Delayed 1 second › Join throttling can be delayed Kafka offset lag

Slide 41

Slide 41 text

› Delayed join throttling trigger response timeout on single MySQL shard 2K join / 1 sec, delayed 1 second Response timeout Thread pool queued

Slide 42

Slide 42 text

Local cache for join throttling LINE client OpenChat server Storages LINE client LINE client LINE client Publish server › Limit the upper bound of join spikes without dependency, delay by local cache Limit X join / Y second on single chat Join throttling (can be delayed) Local Cache

Slide 43

Slide 43 text

Circuit breaker and bulkhead › Circuit breaker, bulkhead for service impact isolation Circuit breaker + Bulkhead OpenChat server Shard 1 Shard 2 Shard 3 Circuit breaker OpenChat server Circuit breaker Circuit breaker Thread pool Bulkhead Bulkhead Bulkhead

Slide 44

Slide 44 text

Circuit breaker and bulkhead › Circuit breaker, bulkhead for service impact isolation Shard 1 Shard 2 Shard 3 Circuit breaker OpenChat server Thread pool Bulkhead Bulkhead Bulkhead Failed or slow calls … OpenChat server Shard 1

Slide 45

Slide 45 text

Result › No more service impact by join spikes CPU usage 10 ~ 20%

Slide 46

Slide 46 text

Understand Hot Chat - Hot chat triggers such a heavy load on storages with various patterns - It’s hard to predict hot chat patterns and bottlenecks Detection & Handling - Monitor requests by API, country, app type, chat - Prepare throttling beforehand and improve bottlenecks later - Local cache and dynamic configuration is an effective way to handle hot chat Isolation - Sharding is not enough, need circuit breaker and bulkhead on each storage’s shards - Hot chat is only 0.1%. need to choose solution that fits well for 0.1% hot chat What we’ve learned

Slide 47

Slide 47 text

Summary Goal: Isolate, minimize service impact by hot chat Understand Hot Chat Detection & Handling Isolation Fetch spikes on hot chat Join spikes on hot chat Hot chat detection & throttling Improve MySQL bottlenecks & apply join throttling Circuit breaker, bulkhead Choose solution that fits well on 0.1% hot chat

Slide 48

Slide 48 text

Future plans Reliability Hot chat adaptive, multiple threshold Integrate hot chat throttling with other methods that reducing server side load Hot key problem More traffic, More features, More hot chat patterns Hot chat storage dynamic isolation

Slide 49

Slide 49 text

Thank you