$30 off During Our Annual Pro Sale. View Details »

How LINE OpenChat Server Handles Extreme Traffic Spikes

How LINE OpenChat Server Handles Extreme Traffic Spikes

Tech-Verse2022
PRO

November 18, 2022
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. None
  2. Introduction Hot chat on LINE OpenChat API requests / 1

    min on 1 hot chat x100 traffic spikes Famous singer’s concert 5M social media followers 5K fans on 1 LINE OpenChat “Hot chat”
  3. Agenda - Introduction to LINE OpenChat - How LINE OpenChat

    server handles traffic spikes on hot chat - Future plans - Fetch spikes on hot chat - Join spikes on hot chat
  4. LINE OpenChat Discovery Messaging Entrance

  5. Overview 1 day / 1 minute API requests 10B /

    10M requests 1 OpenChat Members 5K+ members 1 minute 1 OpenChat 200K requests
  6. Hot chat on LINE OpenChat 100 messages / 1 sec

    x100 traffic spikes Famous singer’s concert 5M social media followers 5K fans on 1 LINE OpenChat “Hot chat” 200K API requests / 1 min on 1 hot chat Send message API requests
  7. Event delivery architecture 1. Store 5K members Storage 2. Server

    push 3. Fetch events 1 event 5K fetch events x 5,000 Send Message Message Reaction Message Mark As Read Chat Status Update Member Status Update Note & Post Update Events on 1 chat
  8. How LINE OpenChat server handles extreme traffic spikes on hot

    chat
  9. Case 1) fetchEvent spikes on hot chat LINE client OpenChat

    server Storages LINE client LINE client LINE client Publish server 5. Fetch events 1. Send message 3. Publish event 4. Server push 2. Store on storages
  10. Hot chat metrics › Storage(MySQL, Redis) traffic spikes on 1

    shard MySQL Redis OpenChat server MySQL › Response timeout on storage’s 1 shard Chat related query / 1 min Slow query / 1 min Char related request / 1 min Response timeout / 1 min +200 ~ 300% +200 ~ 300%
  11. Hot chat metrics › Kafka offset lag spikes on 1

    partition › GC, CPU usage spikes on some server group Kafka offset lag GC Response timeout / 1 min Offset lag / 1 min Garbage collection time / 1 min
  12. Hot key problem by hot chat Shard 2 Shard 3

    Shard 1 API requests / 1 min on 1 hot chat › On hot chat, all events go to same shard and same key Sharding by key: chatId API requests
  13. Hot key problem by hot chat Shard 2 Shard 3

    Sharding by key: chatId Shard 4 Shard 1 … Replication 3 … Replication 1 Replication 2 More replications More shards
  14. Need to focus on hot chat 0.1% Hot chat Normal

    chat 99.9%
  15. Solution) hot chat detection & throttling LINE client OpenChat server

    LINE client LINE client LINE client Publish server Storages Throttle too many fetch events on hot chat
  16. LINE client OpenChat server LINE client LINE client LINE client

    Storages Fetch events Chat A Chat B Hot chat threshold Publish server Set chat B as hot chat Hot chat detection
  17. Hot chat detection Cache<ChatId, Bucket> bucket; public Completable process(kafka topic)

    { boolean consume = bucket.get(chatId).tryConsume(); if (!consume) { hotChatStorage.set(chatId, N seconds); } } Fetch events Chat A Chat B Hot chat threshold
  18. LINE client OpenChat server LINE client LINE client LINE client

    Storages Hot chat throttling Publish server Check hot chat Throttling server push X% Throttle too many fetch events on hot chat
  19. Hot chat throttling New event … Server push on hot

    chat A 5K members Shard 1 New event New event New event New event … Server push on hot chat A 5K members New event New event New event 1 second 1 second Shard 1 5K fetch events Throttling X%
  20. Dynamic configuration ”hotChatConfigs": { "enabled": true, "dryRun": false, // hot

    chat detection ”durationSeconds": 10, ”threshold": X, // hot chat throttling ”expireSeconds": 30, ”throttleRate": Y% // target chat ”chatId” : { .. } } › LINE Central Dogma enables dynamic configuration without restarting the server
  21. Result MySQL MySQL MySQL Chat related query / 1 min

    Slow query / 1 min MySQL +200 ~ 300% +50 ~ 100% No slow query › Reduce storage(MySQL, Redis) traffic spikes on 1 shard › Remove response timeout by hot chat
  22. Result › Automated hot chat detection and throttling › Effectively

    control fetch events spikes on hot chat › No more service impact by hot chat Hot chat dashboard
  23. Case 2) Join spikes on hot chat

  24. Join LINE OpenChat Discovery Join QR code, link

  25. Join spikes on hot chat Max join on 1 chat

    2018 2019 2020 2021 2022 › Recently max join traffic is increased significantly 100 join / 10 sec 2K join / 1 sec
  26. Join spikes on hot chat LINE client OpenChat server Storages

    1. Join OpenChat 2. Store chat member on MySQL › Join spikes trigger heavy load on MySQL Join spikes on 1 chat
  27. Join spikes metrics › Chat member insert query, CPU usage

    spikes on 1 MySQL shard MySQL MySQL MySQL CPU usage Load average Insert chat member query / 1 sec 500+ join / 1 sec Slow query / 1 min CPU usage 80%
  28. Join spikes metrics › Response timeout, thread pool queued OpenChat

    server OpenChat server Response timeout / 1 min Queued thread / 1 min
  29. Improve bottlenecks on MySQL

  30. “Insert chat member” query JOINED … … … › MySQL

    AUTO_INCREMENT handling in InnoDB https://dev.mysql.com/doc/refman/5.6/en/innodb-auto-increment-handling.html Insert chat member query on MySQL “INSERT … SELECT” query is “Bulk inserts” AUTO-INC table-level lock
  31. “Insert chat member” query › MySQL AUTO_INCREMENT handling in InnoDB

    https://dev.mysql.com/doc/refman/5.6/en/innodb-auto-increment-handling.html MySQL QPS AUTO-INC table lock CPU usage JOINED … … … AUTO-INC table-level lock Insert chat member query on MySQL CPU usage 100%
  32. Change innodb_autoinc_lock_mode › Change Innodb_autoinc_lock_mode = 2 (“interleaved” lock mode)

  33. Change innodb_autoinc_lock_mode › Change Innodb_autoinc_lock_mode = 2 (“interleaved” lock mode)

    MySQL QPS CPU usage CPU usage 10 ~ 20% 2K join / 1 sec
  34. “Get chat member count” query Get chat member count query

    and MySQL query cache Chat member Chat id 100 State = LEAVED State = JOINED State = JOINED Query cache Member count: 2
  35. “Get chat member count” query MySQL query cache uses table-level

    lock Chat member Chat id 100 State = LEAVED State = JOINED State = JOINED Query cache Member count: 2 State = JOINED State = JOINED . . . State = JOINED Member count: 3 Member count: 4 Member count: 5 3 joins 3 updates
  36. “Get chat member count” query MySQL QPS MySQL query cache

    uses table-level lock MySQL query cache table lock . . . Chat member Chat id 100 State = LEAVED State = JOINED State = JOINED Query cache Member count: 2 State = JOINED State = JOINED State = JOINED Member count: 3 Member count: 4 Member count: 5 3 joins 3 updates . . .
  37. “Get chat member count” query Remove MySQL query cache Member

    count Chat id 100 2 Add chat member count table › Time complexity of get chat member count query: O(N) -> O(log N) Chat member Chat id 100 State = LEAVED State = JOINED State = JOINED Query cache Member count: 2
  38. Apply join throttling to protect MySQL

  39. Join throttling LINE client OpenChat server Storages LINE client LINE

    client LINE client Publish server Join OpenChat Join on one chat Chat A Chat B MySQL processing limit Throttle join request X% for Y seconds
  40. Delayed 1 second › Join throttling can be delayed Kafka

    offset lag
  41. › Delayed join throttling trigger response timeout on single MySQL

    shard 2K join / 1 sec, delayed 1 second Response timeout Thread pool queued
  42. Local cache for join throttling LINE client OpenChat server Storages

    LINE client LINE client LINE client Publish server › Limit the upper bound of join spikes without dependency, delay by local cache Limit X join / Y second on single chat Join throttling (can be delayed) Local Cache
  43. Circuit breaker and bulkhead › Circuit breaker, bulkhead for service

    impact isolation Circuit breaker + Bulkhead OpenChat server Shard 1 Shard 2 Shard 3 Circuit breaker OpenChat server Circuit breaker Circuit breaker Thread pool Bulkhead Bulkhead Bulkhead
  44. Circuit breaker and bulkhead › Circuit breaker, bulkhead for service

    impact isolation Shard 1 Shard 2 Shard 3 Circuit breaker OpenChat server Thread pool Bulkhead Bulkhead Bulkhead Failed or slow calls … OpenChat server Shard 1
  45. Result › No more service impact by join spikes CPU

    usage 10 ~ 20%
  46. Understand Hot Chat - Hot chat triggers such a heavy

    load on storages with various patterns - It’s hard to predict hot chat patterns and bottlenecks Detection & Handling - Monitor requests by API, country, app type, chat - Prepare throttling beforehand and improve bottlenecks later - Local cache and dynamic configuration is an effective way to handle hot chat Isolation - Sharding is not enough, need circuit breaker and bulkhead on each storage’s shards - Hot chat is only 0.1%. need to choose solution that fits well for 0.1% hot chat What we’ve learned
  47. Summary Goal: Isolate, minimize service impact by hot chat Understand

    Hot Chat Detection & Handling Isolation Fetch spikes on hot chat Join spikes on hot chat Hot chat detection & throttling Improve MySQL bottlenecks & apply join throttling Circuit breaker, bulkhead Choose solution that fits well on 0.1% hot chat
  48. Future plans Reliability Hot chat adaptive, multiple threshold Integrate hot

    chat throttling with other methods that reducing server side load Hot key problem More traffic, More features, More hot chat patterns Hot chat storage dynamic isolation
  49. Thank you