Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Our approach to New Year's traffic of LINE STICKER @ TECHPULSE 2023

Our approach to New Year's traffic of LINE STICKER @ TECHPULSE 2023

- Speaker: Koji Lin
- Event: http://techpulse.line.me/

在過年的期間,大家是否有透過貼圖與親朋好友拜年呢?本次議程將透過各種視角,帶大家了解我們是如何設計與規劃系統架構,因應不同時區的新年流量,並了解在新年期間是如何透過監控與營運機制,讓服務被有效運用。

LINE Developers Taiwan

February 21, 2023
Tweet

More Decks by LINE Developers Taiwan

Other Decks in Technology

Transcript

  1. 1

  2. Challenges › We need to utilize different teams API to

    send recommendation or fetch the user information › It’s the biggest sticker campaign and we don’t want to affect our existing sticker services during new year season
  3. New Year’s campaign in Japan › Buy a campaign sticker

    to get the fortune slip › From 12/26, we can send fortune slip to your friends or draw by self › One of the biggest sticker campaign in Japan
  4. Design › Utilize Kafka provided by IMF team Fully asynchronous

    Can adjust configuration dynamically for throttling › Kafka message processing speed › External teams’ API like Messaging, Point, etc Isolated service › Separate modules rather than implementing functionality in existing services › Use Decaton to process Kafka event › Use RxJava/R2DBC to make our code fully asynchronous High throughput
  5. › The api server does the minimum 
 necessary processing

    and sending 
 event that can be processed 
 later to Kafka System overview API/Batch Decaton Processor
  6. Rate limited services › Make our Kafka processing and API

    client calling speed can be configured dynamically › Most APIs can be retried, and if not possible, logged and handled manually › Communicate with other teams to obey the maximum traffic can be handled by other teams › Perform load testing with other teams
  7. Failover testing › For our storages › Redis › MySQL

    › MongoDB
 › This year we discovered a race condition issue with the database client library during failover
  8. Appropriate estimation from planning › Event it self is not

    just single day but whole month › From 2022/12/1 to 2023/1/13
 › Well estimated OA messages › Send campaign relation information during the campaign › The system is stable during the campaign period
  9. Features provided by LINE STICKER › Create and update the

    product Provider side(Official & Creator) › Listing including search and recommendation › Purchase and download the resources › Send and receive stickers User side
  10. System overview Talk server Open Chat Home content CDN API

    gateway Web site API/Search server ES MySQL MongoDB Capability server Image server Object storage
  11. System overview Talk server Open Chat Home content CDN API

    gateway Web site API/Search server ES MySQL MongoDB Capability server Image server Object storage Send sticker Listing/Recommend Downloading images
  12. Difficulty of new year’s eve › Annual event › Hard

    to estimate the load due to implementation or architecture changes › Not easy to figure out the traffic of new features › Spike all at once in a short time › About 9 times what it was a minute ago › Increased sales for a few hours › Japan(UTC+9)→Taiwan(UTC+8)→Thai(UTC+7)
  13. Average growth rate The geometric mean per year from the

    growth rate over multiple years. It absorbs some of the ups and downs in the annual growth rate. Average growth rate +44%
  14. The year-on-year ratio of the number of accesses under normal

    conditions Calculate the ratio of peak accesses on the weekday of November of the previous year and this year. Consider this ratio as a growth rate and multiply it by the number of accesses on New Year's Day of the previous year. If the previous year's number of accesses is not reliable, it is not predictable. 2021 2022 2023 Weekday of November 1000rps 2500rps 5000rps NY’s Day 2000rps 4500rps ???rps
  15. Estimate the number › We choose max value of average

    growth rate and year-on-year ratio › If both data are not available, refer to other similar services Have Data No Data Have Data Max(A, B) B No Data A refer to other similar services A B
  16. Preparing the instances › Services with confidence in estimation ›

    70% CPU usage as target › Services that have concerns about estimation › 50% CPU usage as target › Had outage last year or not enough information or we have concerns
 › Use metrics to find services where the rate of increase in CPU is far greater than the rate of increase in requests. › Adjust the configuration and checking the code to find problems
  17. Load testing at production environment › Scale in and check

    the latency and error rate › Achieve the 70% CPU usage › Resolve bottlenecks when they are easy to resolve LB Server 1 Server 2 Server 3 Server 4 Server 5
  18. Assign monitoring members by context › Overloading may affect related

    services in the context › Sticker › Store › Image › etc.
 › To avoid using whole monitoring members resources for single outage
  19. Monitoring › Dedicated dashboard for new year › Focus on

    the most used services like sending/receiving, listing, image downloading related API › Make a panel for each service to avoid overloading of grafana › We use rps per node due to total rps is not meaningful if we can scale out easily
  20. Playbook for unexpected requests › More than what we estimated

    › Share inside the team › Check if it will increase more in short time › More than what we can handled › Share to emergency channel in the company › Immediate action taken as planned
  21. Preparing for priority load shedding › We have throttling feature

    for each API › How can we know which API affects which service, screen, API, etc. › What is the UX when an error occurs? › Workshop with stakeholders like engineer, QA, planner, etc. › Try from the most requested API › Checking the result at dev environment › It is not necessary to check all APIs, but only to identify those that have the greatest impact
  22. Result › Improved our services during preparation › Able to

    predict the number more accurately each year › Overall service was stable on New Year's Eve 2022 and 2023