Handling Large-Scale Traffic: “Happy New Year LINE” and “PayPay Festival” Yuki Matsumoto / ValueCommerce Shinji Kobayashi / Yahoo! JAPAN Shunsuke Nakamura / LINE

Shunsuke Nakamura LINE Senior Manager Shinji Kobayashi Yahoo! JAPAN Senior Manager Yuki Matsumoto ValueCommerce CTO Introduction of Speakers Speakers

Agenda • Load Comparison: Peak Times vs Normal • Load Handling and Preparation for Surviving Heavy Traffic • Past Outage Handling / Continuous efforts for improvement • Thoughts and ideas for future work

平常時とイベントピーク時の負荷の比較 Load Comparison: Peak Times vs Normal Introduction

平常時のトラフィック Regular peak traffic at LINE LINE

あけおめLINE の ピークトラフィック Peak traffic at New Year LINE 11 Once a year x5 messages/sec 3 peaks in JP/TW/TH LINE

あけおめピークでさばいた瞬間メッセージ数 Peak messages/sec at "01-01 00:00:0x" LINE Couldn't accept full requests due outages and traffic limitation/throttling Capable of more message requests/sec with more service requirement LINE

購入導線 Top ItemDetail Search Cart Purchase Confirm Complete

平常時とイベントピーク時の負荷の比較 Regular traffic Peek traffic ItemDetail Cart Search Top ItemDetail Cart Search Top ヤフー 1 2

Comparison of normal load and event peak load Event peak 30x End of event 20x Rapid trend change 100x ヤフー

大規模トラフィックを乗り切るための 負荷対策と事前準備 Load Handling and Preparation for Surviving Heavy Traffic Main Topic

Capacity planning 3ヶ月前から始まる、あけおめ準備モード New Year preparation that starts from 3+ months ago ⎯ Estimate target traffic with previous years' data ⎯ Inspect big changes for this year ⎯ Expand each component if needed Improvement / Debug ⎯ Inspect dynamic toggles on urgency to disable/throttle feature ⎯ Optimize RPC/IO by dedupliction or batch ⎯ Mitigate load with cache ⎯ Verify failure handling behavior through chaos engineering All dev work as SRE LINE

負荷テスト Load test → Find out the latest performance neck → Gain deep insight from metrics and thread/heap → Notice unpredicted problems in advance → Notice missing metrics and safty net Simulate new year bursting on the specified servers in prod. HOW : Adjust request weight on front-end gateway server PIC : SRE team and server developers LINE

2種類の負荷テスト 2 kind of load tests #1 gradual load increase #2 rapid load increase LINE

メッセージングストレージの新年準備 New Year preparation for Messaging storage Capacity planning for Redis and HBase clusters TTL tuning to take balance of Cache/Storage usage for New Year workload Load simulation with replayer based on RPC/transaction log and Kafka (*1) High load simulation with HBase Region movement L4 load balancer + LEGY (front-end) LINE Apps (iOS /Android) talk-server (back-end) Redis HBase (*1) LINE

カオスエンジニアリング Chaos engineering → Check no behavior diff as prior test → Evaluate resiliency by measuring MTTR → Make sure unexpected side effects → Confirm report line on failure Confirm the recent behavior of failure handling in prod on schedule HOW : `kill` / `iptables –j drop` master server/storage node PIC : server developers LINE

負荷試験とRateLimitの設定 Top prepare app Crawl FE Save Request/Response Load test app Micro Service App Micro Service App Micro Service App Micro Service App Micro Service App Micro Service App Micro Service App x100~ attack app ヤフー 1

2種の負荷試験とRateLimitによるサービス全断への備え Automatic load test / Constant load test RateLimit towards overload traffic ⎯ Test simulated traffic from FE to BE according to the load estimate of the event ⎯ Issue test ordesr to load DB Write on the order path ⎯ Implement large scale test while suspending services ⎯ the event has changed to a certain amount of load generated every Sunday (currently changed again), set tests performed automatically every week at midnight ⎯ Set certain scale without service suspension ⎯ Set a RateLimit in advance for major applications based on the performance limits clarified by the load tests ⎯ Prevent all users from being affected even if the worst case happens Full system load test ヤフー

大規模トラフィックを乗り切るための 負荷対策と事前準備 Load Handling and Preparation for Surviving Heavy Traffic Main Topic Discussion

過去に発生した障害対応 / 障害対応のための 継続的な改善努力 Past Outage Handling / Continuous efforts for improvement Main Topic

Allow unpredicted outage once but Make sure to prevent reccurence next year LINE Messaging Platform 開発者のプライド PRIDE of developers who own LINE Messaging Platform Always develop reliable product at LINE scale (x3-5 load) LINE

Storage write delay in HBase (*1) 過去に発生した障害対応 1/2 Previous outage handling 1/2 Push notification delay (*2) Thread Pool for push notification got full Purged queue in order to be able to accept the next bursting Redis's async job queue to write for HBase accumulated Purged queue and recovered with MapReduce within midnight (*1) (*2) messaging-servers-prepare-for-new-year-traffic/ LINE

Message Box LINE local DB Server Client (LINE app) talk-server active users incremental-sync talk-server local DB full-sync Message Box (HBase) inactive users or Initialization on login

MessageBox書き込み遅延に対する改善 Outage prevention against MessageBox write delay Root cause : ⎯ Worker threads got stuck due heavy atomic sequence computation and frequent minor GC on HBase Resolution : ⎯ Redesigned lightweight schema on write ⎯ Replaced storage during the next year Send Message (write) indexes metadata sequence Unread Count atomic increment atomic increment put put Fetch (read) V2 storage Send Message (write) single put Fetch (read) Compute unreadCount on read ASIS TOBE Migrated within 1 year Frequent minor GC LINE

注文DB書き込み遅延による二重注文発生 Order DB (Oracle) Order PF FE App Cart Order History Async write Delay Can’t Read Order DB (Oracle) Order PF FE App Cart Order History Async write Order DB (Cassandra) Sync write Before After ヤフー 1 1 2 3 2 3 5 4

Storage write delay in HBase (*1) 過去に発生した障害対応 2/2 Previous outage handling 2/2 Push notification delay (*2) Thread Pool for push notification got full Purged queue in order to be able to accept the next bursting Redis's async job queue to write for HBase accumulated Purged queue and recovered with MapReduce within midnight (*1) (*2) messaging-servers-prepare-for-new-year-traffic/ LINE

プッシュ通知遅延障害に対する改善 Outage prevention against push notification delay Root cause : ⎯ More connections by async HTTP client caused fullGC in Armeria server Resolution : ⎯ Make connections reusable and fixed number of connections ⎯ Isolate the outage impact by queue separation and circuit breaker ⎯ Enhance load test patterns for better prediction talk-server JVM push-server #1 response delay #5 Full GC NotificationService (Armeria) #4 memory pressure #2 threadPool full #3 Many HTTP connections LINE External push services (APNS/GCM)

プッシュ通知遅延障害に対する改善後 Effect after outage prevention against push notification delay Queue size was under limitation (20K) Circuit breaker worked effectively LINE

急なBEの障害にスマホアプリが対応できない ヤフー Backend API FE App Application BFF Can’t Read Backend API FE App Application BFF CDN JSON File Before After Phone APP Can’t Change Write fault info Phone APP Backoff ヤフー 1 1 2

感想と今後他社の取り組みからやってみたいこと Thoughts and ideas for future work Closing

大晦日業務 12/31 18:00 ~ 21:00 Sit in front of PC and open dashboard Connect to Zoom 12/31 21:00 ~ 23:55 Disable toggle for optional feature 1/1 00:00 ~ 00:10 Monitor JP NewYear and report status periodically Make effort to resolve/mitigate unexpected outages due 1:00 !! 1/1 01:00 ~ 01:10 TW New year 1/1 02:00 ~ 02:10 TH New year 1/x (first working day) Retrospective meeting Work on end of year LINE All dev work as SRE for common objective

Enhance RateLimit function RateLimit Function B Function A BE