Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Handling Large-Scale Traffic: “Happy New Year L...

Handling Large-Scale Traffic: “Happy New Year LINE” and “PayPay Festival”

Shunsuke Nakamura (LINE / Messaging Platform Engineering department 2 / Senior Engineering Manager)
Shinji Kobayashi (Yahoo! JAPAN / Production Division 2, Shopping Services Group, Commerce Group / Senior Manager)
Yuki Matsumoto (ValueCommerce / Technology Development Division / CTO)

https://tech-verse.me/ja/sessions/295
https://tech-verse.me/en/sessions/295
https://tech-verse.me/ko/sessions/295

Tech-Verse2022

November 18, 2022
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. Handling Large-Scale Traffic: “Happy New Year LINE” and “PayPay Festival”

    Yuki Matsumoto / ValueCommerce Shinji Kobayashi / Yahoo! JAPAN Shunsuke Nakamura / LINE
  2. Shunsuke Nakamura LINE Senior Manager Shinji Kobayashi Yahoo! JAPAN Senior

    Manager Yuki Matsumoto ValueCommerce CTO Introduction of Speakers Speakers
  3. Agenda • Load Comparison: Peak Times vs Normal • Load

    Handling and Preparation for Surviving Heavy Traffic • Past Outage Handling / Continuous efforts for improvement • Thoughts and ideas for future work
  4. あけおめLINE の ピークトラフィック Peak traffic at New Year LINE 11

    Once a year x5 messages/sec 3 peaks in JP/TW/TH LINE
  5. あけおめピークでさばいた瞬間メッセージ数 Peak messages/sec at "01-01 00:00:0x" LINE Couldn't accept full

    requests due outages and traffic limitation/throttling Capable of more message requests/sec with more service requirement LINE
  6. Comparison of normal load and event peak load Event peak

    30x End of event 20x Rapid trend change 100x ヤフー
  7. Capacity planning 3ヶ月前から始まる、あけおめ準備モード New Year preparation that starts from 3+

    months ago ⎯ Estimate target traffic with previous years' data ⎯ Inspect big changes for this year ⎯ Expand each component if needed Improvement / Debug ⎯ Inspect dynamic toggles on urgency to disable/throttle feature ⎯ Optimize RPC/IO by dedupliction or batch ⎯ Mitigate load with cache ⎯ Verify failure handling behavior through chaos engineering All dev work as SRE LINE
  8. 負荷テスト Load test → Find out the latest performance neck

    → Gain deep insight from metrics and thread/heap → Notice unpredicted problems in advance → Notice missing metrics and safty net Simulate new year bursting on the specified servers in prod. HOW : Adjust request weight on front-end gateway server PIC : SRE team and server developers LINE
  9. メッセージングストレージの新年準備 New Year preparation for Messaging storage Capacity planning for

    Redis and HBase clusters TTL tuning to take balance of Cache/Storage usage for New Year workload Load simulation with replayer based on RPC/transaction log and Kafka (*1) High load simulation with HBase Region movement L4 load balancer + LEGY (front-end) LINE Apps (iOS /Android) talk-server (back-end) Redis HBase (*1) https://www.slideshare.net/linecorp/hbase-at-line-2017 LINE
  10. カオスエンジニアリング Chaos engineering → Check no behavior diff as prior

    test → Evaluate resiliency by measuring MTTR → Make sure unexpected side effects → Confirm report line on failure Confirm the recent behavior of failure handling in prod on schedule HOW : `kill` / `iptables –j drop` master server/storage node PIC : server developers LINE
  11. 負荷試験とRateLimitの設定 Top prepare app Crawl FE Save Request/Response Load test

    app Micro Service App Micro Service App Micro Service App Micro Service App Micro Service App Micro Service App Micro Service App x100~ attack app ヤフー 1
  12. 2種の負荷試験とRateLimitによるサービス全断への備え Automatic load test / Constant load test RateLimit towards

    overload traffic ⎯ Test simulated traffic from FE to BE according to the load estimate of the event ⎯ Issue test ordesr to load DB Write on the order path ⎯ Implement large scale test while suspending services ⎯ the event has changed to a certain amount of load generated every Sunday (currently changed again), set tests performed automatically every week at midnight ⎯ Set certain scale without service suspension ⎯ Set a RateLimit in advance for major applications based on the performance limits clarified by the load tests ⎯ Prevent all users from being affected even if the worst case happens Full system load test ヤフー
  13. Allow unpredicted outage once but Make sure to prevent reccurence

    next year LINE Messaging Platform 開発者のプライド PRIDE of developers who own LINE Messaging Platform Always develop reliable product at LINE scale (x3-5 load) LINE
  14. Storage write delay in HBase (*1) 過去に発生した障害対応 1/2 Previous outage

    handling 1/2 Push notification delay (*2) Thread Pool for push notification got full Purged queue in order to be able to accept the next bursting Redis's async job queue to write for HBase accumulated Purged queue and recovered with MapReduce within midnight (*1) https://www.slideshare.net/linecorp/a-5-47983106 (*2) https://engineering.linecorp.com/ja/blog/how-line- messaging-servers-prepare-for-new-year-traffic/ LINE
  15. Message Box LINE local DB Server Client (LINE app) talk-server

    active users incremental-sync talk-server local DB full-sync Message Box (HBase) inactive users or Initialization on login
  16. MessageBox書き込み遅延に対する改善 Outage prevention against MessageBox write delay Root cause :

    ⎯ Worker threads got stuck due heavy atomic sequence computation and frequent minor GC on HBase Resolution : ⎯ Redesigned lightweight schema on write ⎯ Replaced storage during the next year Send Message (write) indexes metadata sequence Unread Count atomic increment atomic increment put put Fetch (read) V2 storage Send Message (write) single put Fetch (read) Compute unreadCount on read ASIS TOBE Migrated within 1 year Frequent minor GC LINE
  17. 注文DB書き込み遅延による二重注文発生 Order DB (Oracle) Order PF FE App Cart Order

    History Async write Delay Can’t Read Order DB (Oracle) Order PF FE App Cart Order History Async write Order DB (Cassandra) Sync write Before After ヤフー 1 1 2 3 2 3 5 4
  18. Storage write delay in HBase (*1) 過去に発生した障害対応 2/2 Previous outage

    handling 2/2 Push notification delay (*2) Thread Pool for push notification got full Purged queue in order to be able to accept the next bursting Redis's async job queue to write for HBase accumulated Purged queue and recovered with MapReduce within midnight (*1) https://www.slideshare.net/linecorp/a-5-47983106 (*2) https://engineering.linecorp.com/ja/blog/how-line- messaging-servers-prepare-for-new-year-traffic/ LINE
  19. プッシュ通知遅延障害に対する改善 Outage prevention against push notification delay Root cause :

    ⎯ More connections by async HTTP client caused fullGC in Armeria server Resolution : ⎯ Make connections reusable and fixed number of connections ⎯ Isolate the outage impact by queue separation and circuit breaker ⎯ Enhance load test patterns for better prediction talk-server JVM push-server #1 response delay #5 Full GC NotificationService (Armeria) #4 memory pressure #2 threadPool full #3 Many HTTP connections LINE External push services (APNS/GCM)
  20. 急なBEの障害にスマホアプリが対応できない ヤフー Backend API FE App Application BFF Can’t Read

    Backend API FE App Application BFF CDN JSON File Before After Phone APP Can’t Change Write fault info Phone APP Backoff ヤフー 1 1 2
  21. 大晦日業務 12/31 18:00 ~ 21:00 Sit in front of PC

    and open dashboard Connect to Zoom 12/31 21:00 ~ 23:55 Disable toggle for optional feature 1/1 00:00 ~ 00:10 Monitor JP NewYear and report status periodically Make effort to resolve/mitigate unexpected outages due 1:00 !! 1/1 01:00 ~ 01:10 TW New year 1/1 02:00 ~ 02:10 TH New year 1/x (first working day) Retrospective meeting Work on end of year LINE All dev work as SRE for common objective