Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How LINE OpenChat Server Handles Extreme Traffic Spikes

How LINE OpenChat Server Handles Extreme Traffic Spikes

Tech-Verse2022
PRO

November 18, 2022
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. View Slide

  2. Introduction
    Hot chat on LINE OpenChat
    API requests / 1 min on 1 hot chat
    x100 traffic spikes
    Famous singer’s concert
    5M social media followers
    5K fans on 1 LINE OpenChat
    “Hot chat”

    View Slide

  3. Agenda
    - Introduction to LINE OpenChat
    - How LINE OpenChat server handles traffic spikes
    on hot chat
    - Future plans
    - Fetch spikes on hot chat
    - Join spikes on hot chat

    View Slide

  4. LINE OpenChat
    Discovery Messaging
    Entrance

    View Slide

  5. Overview
    1 day / 1 minute
    API requests
    10B / 10M
    requests
    1 OpenChat
    Members
    5K+
    members
    1 minute
    1 OpenChat
    200K
    requests

    View Slide

  6. Hot chat on LINE OpenChat
    100 messages / 1 sec
    x100 traffic spikes
    Famous singer’s concert
    5M social media followers
    5K fans on 1 LINE OpenChat
    “Hot chat”
    200K API requests / 1 min on 1 hot chat
    Send message
    API requests

    View Slide

  7. Event delivery architecture
    1. Store
    5K members
    Storage
    2. Server push
    3. Fetch events
    1 event 5K fetch events
    x 5,000
    Send
    Message
    Message
    Reaction
    Message Mark
    As Read
    Chat Status
    Update
    Member
    Status Update
    Note & Post
    Update
    Events on 1 chat

    View Slide

  8. How LINE OpenChat server handles
    extreme traffic spikes on hot chat

    View Slide

  9. Case 1) fetchEvent spikes on hot chat
    LINE client OpenChat server
    Storages
    LINE client
    LINE client
    LINE client
    Publish server
    5. Fetch events
    1. Send message 3. Publish event 4. Server push
    2. Store on storages

    View Slide

  10. Hot chat metrics
    › Storage(MySQL, Redis) traffic spikes on 1 shard
    MySQL
    Redis
    OpenChat server
    MySQL
    › Response timeout on storage’s 1 shard
    Chat related query / 1 min
    Slow query / 1 min
    Char related request / 1 min
    Response timeout / 1 min
    +200 ~ 300%
    +200 ~ 300%

    View Slide

  11. Hot chat metrics
    › Kafka offset lag spikes on 1 partition
    › GC, CPU usage spikes on some server group
    Kafka offset lag
    GC
    Response timeout / 1 min
    Offset lag / 1 min
    Garbage collection time / 1 min

    View Slide

  12. Hot key problem by hot chat
    Shard 2 Shard 3
    Shard 1
    API requests / 1 min on 1 hot chat
    › On hot chat, all events go to same shard and same key
    Sharding by key: chatId
    API requests

    View Slide

  13. Hot key problem by hot chat
    Shard 2 Shard 3
    Sharding by key: chatId
    Shard 4
    Shard 1 …
    Replication 3 …
    Replication 1 Replication 2
    More replications
    More shards

    View Slide

  14. Need to focus on hot chat
    0.1%
    Hot chat
    Normal chat
    99.9%

    View Slide

  15. Solution) hot chat detection & throttling
    LINE client OpenChat server
    LINE client
    LINE client
    LINE client
    Publish server
    Storages
    Throttle too many fetch events on hot chat

    View Slide

  16. LINE client OpenChat server
    LINE client
    LINE client
    LINE client
    Storages
    Fetch events
    Chat A
    Chat B
    Hot chat threshold
    Publish server
    Set chat B as hot chat
    Hot chat detection

    View Slide

  17. Hot chat detection
    Cache bucket;
    public Completable process(kafka topic) {
    boolean consume = bucket.get(chatId).tryConsume();
    if (!consume) {
    hotChatStorage.set(chatId, N seconds);
    }
    }
    Fetch events
    Chat A
    Chat B
    Hot chat threshold

    View Slide

  18. LINE client OpenChat server
    LINE client
    LINE client
    LINE client
    Storages
    Hot chat throttling
    Publish server
    Check hot chat Throttling server push X%
    Throttle too many fetch events on hot chat

    View Slide

  19. Hot chat throttling
    New event …
    Server push on hot chat A
    5K members
    Shard 1
    New event New event New event New event …
    Server push on hot chat A
    5K members
    New event New event New event
    1 second 1 second
    Shard 1
    5K fetch events
    Throttling X%

    View Slide

  20. Dynamic configuration
    ”hotChatConfigs": {
    "enabled": true,
    "dryRun": false,
    // hot chat detection
    ”durationSeconds": 10,
    ”threshold": X,
    // hot chat throttling
    ”expireSeconds": 30,
    ”throttleRate": Y%
    // target chat
    ”chatId” : { .. }
    }
    › LINE Central Dogma enables dynamic
    configuration without restarting the server

    View Slide

  21. Result
    MySQL MySQL
    MySQL
    Chat related query / 1 min Slow query / 1 min
    MySQL
    +200 ~ 300%
    +50 ~ 100%
    No slow query
    › Reduce storage(MySQL, Redis) traffic spikes on 1 shard
    › Remove response timeout by hot chat

    View Slide

  22. Result
    › Automated hot chat detection and throttling
    › Effectively control fetch events spikes on hot chat
    › No more service impact by hot chat
    Hot chat dashboard

    View Slide

  23. Case 2) Join spikes on hot chat

    View Slide

  24. Join LINE OpenChat
    Discovery Join QR code, link

    View Slide

  25. Join spikes on hot chat
    Max join on 1 chat
    2018 2019 2020 2021 2022
    › Recently max join traffic is increased significantly
    100 join / 10 sec
    2K join / 1 sec

    View Slide

  26. Join spikes on hot chat
    LINE client OpenChat server
    Storages
    1. Join OpenChat
    2. Store chat member
    on MySQL
    › Join spikes trigger heavy load on MySQL
    Join spikes on 1 chat

    View Slide

  27. Join spikes metrics
    › Chat member insert query, CPU usage spikes on 1 MySQL shard
    MySQL
    MySQL
    MySQL CPU usage Load average
    Insert chat member query / 1 sec
    500+ join / 1 sec
    Slow query / 1 min
    CPU usage 80%

    View Slide

  28. Join spikes metrics
    › Response timeout, thread pool queued
    OpenChat server
    OpenChat server
    Response timeout / 1 min
    Queued thread / 1 min

    View Slide

  29. Improve bottlenecks on MySQL

    View Slide

  30. “Insert chat member” query
    JOINED



    › MySQL AUTO_INCREMENT handling in InnoDB
    https://dev.mysql.com/doc/refman/5.6/en/innodb-auto-increment-handling.html
    Insert chat member query on MySQL
    “INSERT … SELECT” query is “Bulk inserts”
    AUTO-INC table-level lock

    View Slide

  31. “Insert chat member” query
    › MySQL AUTO_INCREMENT handling in InnoDB
    https://dev.mysql.com/doc/refman/5.6/en/innodb-auto-increment-handling.html
    MySQL QPS
    AUTO-INC table lock
    CPU usage
    JOINED



    AUTO-INC table-level lock
    Insert chat member query on MySQL
    CPU usage 100%

    View Slide

  32. Change innodb_autoinc_lock_mode
    › Change Innodb_autoinc_lock_mode = 2 (“interleaved” lock mode)

    View Slide

  33. Change innodb_autoinc_lock_mode
    › Change Innodb_autoinc_lock_mode = 2 (“interleaved” lock mode)
    MySQL QPS
    CPU usage
    CPU usage 10 ~ 20%
    2K join / 1 sec

    View Slide

  34. “Get chat member count” query
    Get chat member count query and MySQL query cache
    Chat member
    Chat id
    100
    State = LEAVED
    State = JOINED
    State = JOINED
    Query cache
    Member count: 2

    View Slide

  35. “Get chat member count” query
    MySQL query cache uses table-level lock
    Chat member
    Chat id
    100
    State = LEAVED
    State = JOINED
    State = JOINED
    Query cache
    Member count: 2
    State = JOINED
    State = JOINED
    .
    .
    .
    State = JOINED
    Member count: 3
    Member count: 4
    Member count: 5
    3 joins 3 updates

    View Slide

  36. “Get chat member count” query
    MySQL QPS
    MySQL query cache uses table-level lock MySQL query cache table lock
    .
    .
    .
    Chat member
    Chat id
    100
    State = LEAVED
    State = JOINED
    State = JOINED
    Query cache
    Member count: 2
    State = JOINED
    State = JOINED
    State = JOINED
    Member count: 3
    Member count: 4
    Member count: 5
    3 joins 3 updates
    .
    .
    .

    View Slide

  37. “Get chat member count” query
    Remove MySQL query cache
    Member count
    Chat id
    100 2
    Add chat member count table
    › Time complexity of get chat member count query: O(N) -> O(log N)
    Chat member
    Chat id
    100
    State = LEAVED
    State = JOINED
    State = JOINED
    Query cache
    Member count: 2

    View Slide

  38. Apply join throttling to protect MySQL

    View Slide

  39. Join throttling
    LINE client OpenChat server
    Storages
    LINE client
    LINE client
    LINE client
    Publish server
    Join OpenChat
    Join on one chat
    Chat A
    Chat B
    MySQL processing limit
    Throttle join request X% for Y seconds

    View Slide

  40. Delayed 1 second
    › Join throttling can be delayed
    Kafka offset lag

    View Slide

  41. › Delayed join throttling trigger response timeout on single MySQL shard
    2K join / 1 sec, delayed 1 second
    Response timeout
    Thread pool queued

    View Slide

  42. Local cache for join throttling
    LINE client OpenChat server
    Storages
    LINE client
    LINE client
    LINE client
    Publish server
    › Limit the upper bound of join spikes without dependency, delay by local cache
    Limit X join / Y second on single chat
    Join throttling (can be delayed)
    Local Cache

    View Slide

  43. Circuit breaker and bulkhead
    › Circuit breaker, bulkhead for service impact isolation
    Circuit breaker + Bulkhead
    OpenChat server
    Shard 1 Shard 2 Shard 3
    Circuit
    breaker
    OpenChat server
    Circuit
    breaker
    Circuit
    breaker
    Thread pool
    Bulkhead Bulkhead Bulkhead

    View Slide

  44. Circuit breaker and bulkhead
    › Circuit breaker, bulkhead for service impact isolation
    Shard 1 Shard 2 Shard 3
    Circuit breaker
    OpenChat server
    Thread pool
    Bulkhead Bulkhead Bulkhead
    Failed or slow calls

    OpenChat server
    Shard 1

    View Slide

  45. Result
    › No more service impact by join spikes
    CPU usage 10 ~ 20%

    View Slide

  46. Understand Hot Chat
    - Hot chat triggers such a
    heavy load on storages with
    various patterns
    - It’s hard to predict hot chat
    patterns and bottlenecks
    Detection & Handling
    - Monitor requests by API,
    country, app type, chat
    - Prepare throttling beforehand
    and improve bottlenecks later
    - Local cache and dynamic
    configuration is an effective
    way to handle hot chat
    Isolation
    - Sharding is not enough, need
    circuit breaker and bulkhead
    on each storage’s shards
    - Hot chat is only 0.1%. need
    to choose solution that fits
    well for 0.1% hot chat
    What we’ve learned

    View Slide

  47. Summary
    Goal: Isolate, minimize service impact by hot chat
    Understand Hot Chat Detection & Handling Isolation
    Fetch spikes on hot chat
    Join spikes on hot chat
    Hot chat detection &
    throttling
    Improve MySQL bottlenecks
    & apply join throttling
    Circuit breaker, bulkhead
    Choose solution that fits
    well on 0.1% hot chat

    View Slide

  48. Future plans
    Reliability
    Hot chat adaptive,
    multiple threshold
    Integrate hot chat throttling
    with other methods that
    reducing server side load
    Hot key problem
    More traffic,
    More features,
    More hot chat patterns
    Hot chat storage
    dynamic isolation

    View Slide

  49. Thank you

    View Slide