Slide 1

Slide 1 text


Slide 2

Slide 2 text

Agenda › New Year’s Campaign › Preparations for handling spikes of about 9x

Slide 3

Slide 3 text

New Year’s campaign in Japan

Slide 4

Slide 4 text

Challenges › We need to utilize different teams API to send recommendation or fetch the user information › It’s the biggest sticker campaign and we don’t want to affect our existing sticker services during new year season

Slide 5

Slide 5 text

New Year’s campaign in Japan › Buy a campaign sticker to get the fortune slip › From 12/26, we can send fortune slip to your friends or draw by self › One of the biggest sticker campaign in Japan

Slide 6

Slide 6 text

Design › Utilize Kafka provided by IMF team Fully asynchronous Can adjust configuration dynamically for throttling › Kafka message processing speed › External teams’ API like Messaging, Point, etc Isolated service › Separate modules rather than implementing functionality in existing services › Use Decaton to process Kafka event › Use RxJava/R2DBC to make our code fully asynchronous High throughput

Slide 7

Slide 7 text

› The api server does the minimum 
 necessary processing and sending 
 event that can be processed 
 later to Kafka System overview API/Batch Decaton Processor

Slide 8

Slide 8 text

Rate limited services › Make our Kafka processing and API client calling speed can be configured dynamically › Most APIs can be retried, and if not possible, logged and handled manually › Communicate with other teams to obey the maximum traffic can be handled by other teams › Perform load testing with other teams

Slide 9

Slide 9 text

Failover testing › For our storages › Redis › MySQL › MongoDB
 › This year we discovered a race condition issue with the database client library during failover

Slide 10

Slide 10 text

Appropriate estimation from planning › Event it self is not just single day but whole month › From 2022/12/1 to 2023/1/13
 › Well estimated OA messages › Send campaign relation information during the campaign › The system is stable during the campaign period

Slide 11

Slide 11 text

Preparations for handling spikes of about 9x

Slide 12

Slide 12 text

Features provided by LINE STICKER › Create and update the product Provider side(Official & Creator) › Listing including search and recommendation › Purchase and download the resources › Send and receive stickers User side

Slide 13

Slide 13 text

System overview Talk server Open Chat Home content CDN API gateway Web site API/Search server ES MySQL MongoDB Capability server Image server Object storage

Slide 14

Slide 14 text

System overview Talk server Open Chat Home content CDN API gateway Web site API/Search server ES MySQL MongoDB Capability server Image server Object storage Send sticker Listing/Recommend Downloading images

Slide 15

Slide 15 text

What happens at new year’s eve

Slide 16

Slide 16 text

Difficulty of new year’s eve › Annual event › Hard to estimate the load due to implementation or architecture changes › Not easy to figure out the traffic of new features › Spike all at once in a short time › About 9 times what it was a minute ago › Increased sales for a few hours › Japan(UTC+9)→Taiwan(UTC+8)→Thai(UTC+7)

Slide 17

Slide 17 text

Average growth rate The geometric mean per year from the growth rate over multiple years. It absorbs some of the ups and downs in the annual growth rate. Average growth rate +44%

Slide 18

Slide 18 text

The year-on-year ratio of the number of accesses under normal conditions Calculate the ratio of peak accesses on the weekday of November of the previous year and this year. Consider this ratio as a growth rate and multiply it by the number of accesses on New Year's Day of the previous year. If the previous year's number of accesses is not reliable, it is not predictable. 2021 2022 2023 Weekday of November 1000rps 2500rps 5000rps NY’s Day 2000rps 4500rps ???rps

Slide 19

Slide 19 text

Estimate the number › We choose max value of average growth rate and year-on-year ratio › If both data are not available, refer to other similar services Have Data No Data Have Data Max(A, B) B No Data A refer to other similar services A B

Slide 20

Slide 20 text

Preparing the instances › Services with confidence in estimation › 70% CPU usage as target › Services that have concerns about estimation › 50% CPU usage as target › Had outage last year or not enough information or we have concerns
 › Use metrics to find services where the rate of increase in CPU is far greater than the rate of increase in requests. › Adjust the configuration and checking the code to find problems

Slide 21

Slide 21 text

Load testing at production environment › Scale in and check the latency and error rate › Achieve the 70% CPU usage › Resolve bottlenecks when they are easy to resolve LB Server 1 Server 2 Server 3 Server 4 Server 5

Slide 22

Slide 22 text

Monitoring and Operations

Slide 23

Slide 23 text

Assign monitoring members by context › Overloading may affect related services in the context › Sticker › Store › Image › etc.
 › To avoid using whole monitoring members resources for single outage

Slide 24

Slide 24 text

Monitoring › Dedicated dashboard for new year › Focus on the most used services like sending/receiving, listing, image downloading related API › Make a panel for each service to avoid overloading of grafana › We use rps per node due to total rps is not meaningful if we can scale out easily

Slide 25

Slide 25 text

Playbook for unexpected requests › More than what we estimated › Share inside the team › Check if it will increase more in short time › More than what we can handled › Share to emergency channel in the company › Immediate action taken as planned

Slide 26

Slide 26 text

Preparing for priority load shedding › We have throttling feature for each API › How can we know which API affects which service, screen, API, etc. › What is the UX when an error occurs? › Workshop with stakeholders like engineer, QA, planner, etc. › Try from the most requested API › Checking the result at dev environment › It is not necessary to check all APIs, but only to identify those that have the greatest impact

Slide 27

Slide 27 text

Result › Improved our services during preparation › Able to predict the number more accurately each year › Overall service was stable on New Year's Eve 2022 and 2023

Slide 28

Slide 28 text

Thank you