Slide 1

Slide 1 text

Handling Large-Scale Traffic: “Happy New Year LINE” and “PayPay Festival” Yuki Matsumoto / ValueCommerce Shinji Kobayashi / Yahoo! JAPAN Shunsuke Nakamura / LINE

Slide 2

Slide 2 text

Shunsuke Nakamura LINE Senior Manager Shinji Kobayashi Yahoo! JAPAN Senior Manager Yuki Matsumoto ValueCommerce CTO Introduction of Speakers Speakers

Slide 3

Slide 3 text

Agenda • Load Comparison: Peak Times vs Normal • Load Handling and Preparation for Surviving Heavy Traffic • Past Outage Handling / Continuous efforts for improvement • Thoughts and ideas for future work

Slide 4

Slide 4 text

平常時とイベントピーク時の負荷の比較 Load Comparison: Peak Times vs Normal Introduction

Slide 5

Slide 5 text

平常時のトラフィック Regular peak traffic at LINE LINE

Slide 6

Slide 6 text

あけおめLINE の ピークトラフィック Peak traffic at New Year LINE 11 Once a year x5 messages/sec 3 peaks in JP/TW/TH LINE

Slide 7

Slide 7 text

あけおめピークでさばいた瞬間メッセージ数 Peak messages/sec at "01-01 00:00:0x" LINE Couldn't accept full requests due outages and traffic limitation/throttling Capable of more message requests/sec with more service requirement LINE

Slide 8

Slide 8 text

購入導線 Top ItemDetail Search Cart Purchase Confirm Complete

Slide 9

Slide 9 text

平常時とイベントピーク時の負荷の比較 Regular traffic Peek traffic ItemDetail Cart Search Top ItemDetail Cart Search Top ヤフー 1 2

Slide 10

Slide 10 text

Comparison of normal load and event peak load Event peak 30x End of event 20x Rapid trend change 100x ヤフー

Slide 11

Slide 11 text

大規模トラフィックを乗り切るための 負荷対策と事前準備 Load Handling and Preparation for Surviving Heavy Traffic Main Topic

Slide 12

Slide 12 text

Capacity planning 3ヶ月前から始まる、あけおめ準備モード New Year preparation that starts from 3+ months ago ⎯ Estimate target traffic with previous years' data ⎯ Inspect big changes for this year ⎯ Expand each component if needed Improvement / Debug ⎯ Inspect dynamic toggles on urgency to disable/throttle feature ⎯ Optimize RPC/IO by dedupliction or batch ⎯ Mitigate load with cache ⎯ Verify failure handling behavior through chaos engineering All dev work as SRE LINE

Slide 13

Slide 13 text

負荷テスト Load test → Find out the latest performance neck → Gain deep insight from metrics and thread/heap → Notice unpredicted problems in advance → Notice missing metrics and safty net Simulate new year bursting on the specified servers in prod. HOW : Adjust request weight on front-end gateway server PIC : SRE team and server developers LINE

Slide 14

Slide 14 text

2種類の負荷テスト 2 kind of load tests #1 gradual load increase #2 rapid load increase LINE

Slide 15

Slide 15 text

メッセージングストレージの新年準備 New Year preparation for Messaging storage Capacity planning for Redis and HBase clusters TTL tuning to take balance of Cache/Storage usage for New Year workload Load simulation with replayer based on RPC/transaction log and Kafka (*1) High load simulation with HBase Region movement L4 load balancer + LEGY (front-end) LINE Apps (iOS /Android) talk-server (back-end) Redis HBase (*1) https://www.slideshare.net/linecorp/hbase-at-line-2017 LINE

Slide 16

Slide 16 text

カオスエンジニアリング Chaos engineering → Check no behavior diff as prior test → Evaluate resiliency by measuring MTTR → Make sure unexpected side effects → Confirm report line on failure Confirm the recent behavior of failure handling in prod on schedule HOW : `kill` / `iptables –j drop` master server/storage node PIC : server developers LINE

Slide 17

Slide 17 text

負荷試験とRateLimitの設定 Top prepare app Crawl FE Save Request/Response Load test app Micro Service App Micro Service App Micro Service App Micro Service App Micro Service App Micro Service App Micro Service App x100~ attack app ヤフー 1

Slide 18

Slide 18 text

2種の負荷試験とRateLimitによるサービス全断への備え Automatic load test / Constant load test RateLimit towards overload traffic ⎯ Test simulated traffic from FE to BE according to the load estimate of the event ⎯ Issue test ordesr to load DB Write on the order path ⎯ Implement large scale test while suspending services ⎯ the event has changed to a certain amount of load generated every Sunday (currently changed again), set tests performed automatically every week at midnight ⎯ Set certain scale without service suspension ⎯ Set a RateLimit in advance for major applications based on the performance limits clarified by the load tests ⎯ Prevent all users from being affected even if the worst case happens Full system load test ヤフー

Slide 19

Slide 19 text

大規模トラフィックを乗り切るための 負荷対策と事前準備 Load Handling and Preparation for Surviving Heavy Traffic Main Topic Discussion

Slide 20

Slide 20 text

過去に発生した障害対応 / 障害対応のための 継続的な改善努力 Past Outage Handling / Continuous efforts for improvement Main Topic

Slide 21

Slide 21 text

Allow unpredicted outage once but Make sure to prevent reccurence next year LINE Messaging Platform 開発者のプライド PRIDE of developers who own LINE Messaging Platform Always develop reliable product at LINE scale (x3-5 load) LINE

Slide 22

Slide 22 text

Storage write delay in HBase (*1) 過去に発生した障害対応 1/2 Previous outage handling 1/2 Push notification delay (*2) Thread Pool for push notification got full Purged queue in order to be able to accept the next bursting Redis's async job queue to write for HBase accumulated Purged queue and recovered with MapReduce within midnight (*1) https://www.slideshare.net/linecorp/a-5-47983106 (*2) https://engineering.linecorp.com/ja/blog/how-line- messaging-servers-prepare-for-new-year-traffic/ LINE

Slide 23

Slide 23 text

Message Box LINE local DB Server Client (LINE app) talk-server active users incremental-sync talk-server local DB full-sync Message Box (HBase) inactive users or Initialization on login

Slide 24

Slide 24 text

MessageBox書き込み遅延に対する改善 Outage prevention against MessageBox write delay Root cause : ⎯ Worker threads got stuck due heavy atomic sequence computation and frequent minor GC on HBase Resolution : ⎯ Redesigned lightweight schema on write ⎯ Replaced storage during the next year Send Message (write) indexes metadata sequence Unread Count atomic increment atomic increment put put Fetch (read) V2 storage Send Message (write) single put Fetch (read) Compute unreadCount on read ASIS TOBE Migrated within 1 year Frequent minor GC LINE

Slide 25

Slide 25 text

注文DB書き込み遅延による二重注文発生 Order DB (Oracle) Order PF FE App Cart Order History Async write Delay Can’t Read Order DB (Oracle) Order PF FE App Cart Order History Async write Order DB (Cassandra) Sync write Before After ヤフー 1 1 2 3 2 3 5 4

Slide 26

Slide 26 text

Storage write delay in HBase (*1) 過去に発生した障害対応 2/2 Previous outage handling 2/2 Push notification delay (*2) Thread Pool for push notification got full Purged queue in order to be able to accept the next bursting Redis's async job queue to write for HBase accumulated Purged queue and recovered with MapReduce within midnight (*1) https://www.slideshare.net/linecorp/a-5-47983106 (*2) https://engineering.linecorp.com/ja/blog/how-line- messaging-servers-prepare-for-new-year-traffic/ LINE

Slide 27

Slide 27 text

プッシュ通知遅延障害に対する改善 Outage prevention against push notification delay Root cause : ⎯ More connections by async HTTP client caused fullGC in Armeria server Resolution : ⎯ Make connections reusable and fixed number of connections ⎯ Isolate the outage impact by queue separation and circuit breaker ⎯ Enhance load test patterns for better prediction talk-server JVM push-server #1 response delay #5 Full GC NotificationService (Armeria) #4 memory pressure #2 threadPool full #3 Many HTTP connections LINE External push services (APNS/GCM)

Slide 28

Slide 28 text

プッシュ通知遅延障害に対する改善後 Effect after outage prevention against push notification delay Queue size was under limitation (20K) Circuit breaker worked effectively LINE

Slide 29

Slide 29 text

急なBEの障害にスマホアプリが対応できない ヤフー Backend API FE App Application BFF Can’t Read Backend API FE App Application BFF CDN JSON File Before After Phone APP Can’t Change Write fault info Phone APP Backoff ヤフー 1 1 2

Slide 30

Slide 30 text

感想と今後他社の取り組みからやってみたいこと Thoughts and ideas for future work Closing

Slide 31

Slide 31 text

APPENDIX

Slide 32

Slide 32 text

大晦日業務 12/31 18:00 ~ 21:00 Sit in front of PC and open dashboard Connect to Zoom 12/31 21:00 ~ 23:55 Disable toggle for optional feature 1/1 00:00 ~ 00:10 Monitor JP NewYear and report status periodically Make effort to resolve/mitigate unexpected outages due 1:00 !! 1/1 01:00 ~ 01:10 TW New year 1/1 02:00 ~ 02:10 TH New year 1/x (first working day) Retrospective meeting Work on end of year LINE All dev work as SRE for common objective

Slide 33

Slide 33 text

Enhance RateLimit function RateLimit Function B Function A BE