Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Problems and Solutions for Two Billion Recommendations Per Day

Problems and Solutions for Two Billion Recommendations Per Day

Naoki Watanabe (hackmylife) (LINE / CRS dev team / Software engineer )

https://tech-verse.me/ja/sessions/13
https://tech-verse.me/en/sessions/13
https://tech-verse.me/ko/sessions/13

Tech-Verse2022
PRO

November 18, 2022
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. None
  2. Agenda - About Smart Channel - Architecture - Troubles &

    Solutions - Lessons learned
  3. Smart Channel

  4. Smart Channel

  5. Smart Channel

  6. Service Concept

  7. Contents Service Concept

  8. Contents Service Concept

  9. Contents Personalize Service Concept

  10. Stats Request / day MAU Items / day

  11. Stats Request / day 2 billion MAU Items / day

  12. Stats Request / day 2 billion MAU 167 million Items

    / day
  13. Stats Request / day 2 billion MAU 167 million Items

    / day 16 million
  14. Architecture

  15. Flow CRS Engine Event Tracker Learning worker Services LINE APP

  16. Flow CRS Engine Event Tracker Learning worker Services Contents LINE

    APP
  17. Flow CRS Engine Event Tracker Learning worker Request Services Contents

    LINE APP
  18. Flow CRS Engine Event Tracker Learning worker Services Contents LINE

    APP
  19. Flow CRS Engine Event Tracker Learning worker Services Contents LINE

    APP Contents
  20. Flow CRS Engine Event Tracker Learning worker Imp/Click Services Contents

    LINE APP Contents
  21. Flow CRS Engine Event Tracker Learning worker Imp/Click Log Services

    Contents LINE APP Contents
  22. Flow CRS Engine Event Tracker Learning worker Imp/Click Log Services

    Parameter Contents LINE APP Contents
  23. Key function

  24. Gathers recommended contents from services Key function

  25. Gathers recommended contents from services Select best ones of the

    many contents Key function
  26. Gathers recommended contents from services Learning users preferences & use

    them to make the next selection Select best ones of the many contents Key function
  27. https://linedevday.linecorp.com/jp/2019/sessions/B1-2

  28. Peak Traffic

  29. Peak Traffic - LINE’s peak traffic is New Year

  30. Peak Traffic - LINE’s peak traffic is New Year -

    Especially JP (GTM+9)
  31. Peak Traffic - LINE’s peak traffic is New Year -

    Especially JP (GTM+9) - 250k request / second
  32. Peak Traffic - LINE’s peak traffic is New Year -

    Especially JP (GTM+9) - 250k request / second - Client side cache is effective
  33. Request with client-side cache TTL (seconds) Estimated RPS 0 250,000

    New Year Peek of 2018
  34. Request with client-side cache TTL (seconds) Estimated RPS 0 250,000

    60 148,000 New Year Peek of 2018
  35. Request with client-side cache TTL (seconds) Estimated RPS 0 250,000

    60 148,000 180 121,000 New Year Peek of 2018
  36. Request with client-side cache TTL (seconds) Estimated RPS 0 250,000

    60 148,000 180 121,000 300 107,000 New Year Peek of 2018
  37. Request with client-side cache TTL (seconds) Estimated RPS 0 250,000

    60 148,000 180 121,000 300 107,000 600 71,000 New Year Peek of 2018
  38. Request with client-side cache TTL (seconds) Estimated RPS 0 250,000

    60 148,000 180 121,000 300 107,000 600 71,000 1800 58,000 New Year Peek of 2018
  39. Request with client-side cache TTL (seconds) Estimated RPS 0 250,000

    60 148,000 180 121,000 300 107,000 600 71,000 1800 58,000 New Year Peek of 2018
  40. Core system LINE App Services CRS Engine

  41. Core system LINE App Services CRS Engine

  42. Core system LINE App Services CRS Engine

  43. Core system LINE App Services Importer CRS Engine

  44. Core system LINE App Services Importer CRS Engine Redis Cluster

  45. Data Modeling

  46. Data Modeling

  47. Data Modeling

  48. Data Modeling

  49. Data Modeling 167Million times

  50. Data Modeling

  51. Data Modeling Item-A ID Item-B ID Item-C ID Item-A ID

    Item-D ID
  52. Data Modeling Item-A ID Item-B ID Item-C ID Item-A ID

    Item-D ID
  53. Targeting Data Modeling Item-A ID Item-B ID Item-C ID Item-A

    ID Item-D ID
  54. Information Targeting Data Modeling Item-A ID Item-B ID Item-C ID

    Item-A ID Item-D ID
  55. Redis Cluster Redis Cluster Key: User-A ID Value: Item-A ID,

    Item-B ID, ….
  56. Redis Cluster Redis Cluster Key: User-A ID Node 3 Node

    2 Node 1 Value: Item-A ID, Item-B ID, ….
  57. Redis Cluster Redis Cluster Key: User-A ID Node 3 Node

    2 Node 1 Value: Item-A ID, Item-B ID, ….
  58. Redis Cluster Redis Cluster Key: User-A ID Node 3 Node

    2 Node 1 Value: Item-A ID, Item-B ID, …. + Service X
  59. Redis Cluster Redis Cluster Key: User-A ID Node 3 Node

    2 Node 1 Value: Item-A ID, Item-B ID, …. Key: User-A ID Value: Item-X ID, Item-Y ID, …. + Service X + Service Y
  60. Redis Cluster Redis Cluster Key: User-A ID Node 3 Node

    2 Node 1 Value: Item-A ID, Item-B ID, …. Key: User-A ID Value: Item-X ID, Item-Y ID, …. + Service X + Service Y
  61. Redis Cluster Redis Cluster Key: User-A ID Node 3 Node

    2 Node 1 Value: Item-A ID, Item-B ID, …. Key: User-A ID Value: Item-X ID, Item-Y ID, …. + Service X + Service Y
  62. Redis Cluster Redis Cluster Node 3 Node 2 Node 1

    Value: Item-A ID, Item-B ID, …. Value: Item-X ID, Item-Y ID, …. + Service X + Service Y Key: {User-A ID} Key: {User-A ID}
  63. Redis Cluster Redis Cluster Node 3 Node 2 Node 1

    Value: Item-A ID, Item-B ID, …. Value: Item-X ID, Item-Y ID, …. + Service X + Service Y Key: {User-A ID} Key: {User-A ID}
  64. Redis Cluster Redis Cluster Node 3 Node 2 Node 1

    Value: Item-A ID, Item-B ID, …. Value: Item-X ID, Item-Y ID, …. + Service X + Service Y Key: {User-A ID} Key: {User-A ID} Value: Item-C ID, Item-D ID, …. + Service X Key: {User-B ID}
  65. Data Modeling Type Data format Distribution Key Targeting Redis {User-A}:Service

    x => Item-A ID, Item-B ID, … {User-A}:Service Y => Item-X ID, Item-Y ID, … {User-B}:Service X => Item-C ID, Item-D, … User ID
  66. Data Modeling Type Data format Distribution Key Targeting Redis {User-A}:Service

    x => Item-A ID, Item-B ID, … {User-A}:Service Y => Item-X ID, Item-Y ID, … {User-B}:Service X => Item-C ID, Item-D, … User ID Information Redis Item ID
  67. Delivery flow Information

  68. Delivery flow Targeting Redis Information Fetch Targeting

  69. Delivery flow Targeting Redis Fetch Information Information Redis Information Fetch

    Targeting
  70. Delivery flow Targeting Redis Fetch Information Information Redis Information Fetch

    Targeting Ranking
  71. Look before you leap - Measure performance per web server

  72. Look before you leap - Measure performance per web server

  73. Look before you leap - Measure performance per web server

  74. Look before you leap - Release gradually - Measure performance

    per web server
  75. Look before you leap - Release gradually - Always A/B

    test if anything is unclear - Measure performance per web server
  76. Troubles & Solutions

  77. 1st trouble Targeting Redis Fetch Information Information Redis Information Fetch

    Targeting Ranking
  78. 1st trouble Targeting Redis Fetch Information Information Redis Information Fetch

    Targeting Ranking
  79. Sudden slowdown Response 2020/11/10

  80. Sudden slowdown Response 99%tile 50ms 2020/11/10

  81. Sudden slowdown Response 99%tile 50ms 2020/11/10 2020/12/1

  82. Sudden slowdown Response 99%tile 50ms 99%tile 200ms 2020/11/10 2020/12/1

  83. Information Redis Stats Time

  84. Information Redis Stats Count Time

  85. Recommended Item Increasing ↓11/1 8.3 items/user 12/5 20.5 items/user↓

  86. Hypothesis - 1 Fetch Target Process Targeting Redis Item a

    ID Item B ID
  87. Fetch Information Process Hypothesis - 1 Fetch Target Process Targeting

    Redis Item a ID Item B ID
  88. Fetch Information Process Hypothesis - 1 Fetch Target Process Targeting

    Redis Item a ID Item B ID Information Redis Node 1 Node 3 Node 2 Item A Item B Item C Item D Item E Item F
  89. Fetch Information Process Hypothesis - 1 Fetch Target Process Targeting

    Redis Item a ID Item B ID Information Redis Node 1 Node 3 Node 2 Item A Item B Item C Item D Item E Item F Lettuce (mget)
  90. Fetch Information Process Hypothesis - 1 Fetch Target Process Targeting

    Redis Item a ID Item B ID Information Redis Node 1 Node 3 Node 2 Item A Item B Item C Item D Item E Item F Lettuce (mget)
  91. Fetch Information Process Hypothesis - 1 Fetch Target Process Targeting

    Redis Item a ID Item B ID Item F ID Information Redis Node 1 Node 3 Node 2 Item A Item B Item C Item D Item E Item F Lettuce (mget)
  92. Fetch Information Process Hypothesis - 1 Fetch Target Process Targeting

    Redis Item a ID Item B ID Item F ID Information Redis Node 1 Node 3 Node 2 Item A Item B Item C Item D Item E Item F Lettuce (mget)
  93. Fetch Information Process Hypothesis - 1 Fetch Target Process Targeting

    Redis Item a ID Item B ID Item D ID Item C ID Item F ID Item e ID Information Redis Node 1 Node 3 Node 2 Item A Item B Item C Item D Item E Item F Lettuce (mget)
  94. Fetch Information Process Hypothesis - 1 Fetch Target Process Targeting

    Redis Item a ID Item B ID Item D ID Item C ID Item F ID Item e ID Information Redis Node 1 Node 3 Node 2 Item A Item B Item C Item D Item E Item F Lettuce (mget)
  95. Verification -1 Execution time (99%tile)

  96. Verification -1 1-9 items Execution time (99%tile)

  97. Verification -1 1-9 items 10-19 items Execution time (99%tile)

  98. Verification -1 1-9 items 10-19 items 20-29 items Execution time

    (99%tile)
  99. Verification -1 1-9 items 10-19 items 20-29 items 30-39 items

    Execution time (99%tile)
  100. Verification -1 1-9 items 10-19 items 20-29 items 30-39 items

    Over 40 items Execution time (99%tile)
  101. Verification -1 1-9 items 10-19 items 20-29 items 30-39 items

    Over 40 items Execution time (99%tile) Command count
  102. Verification -1 1-9 items 10-19 items 20-29 items 30-39 items

    Over 40 items Execution time (99%tile) Command count 1-9 items 10-19 items 20-29 items 30-39 items Over 40 items
  103. Targeting Hypothesis - 2 X Y Z

  104. Information Redis Node 1 Node 3 Node 2 Item A

    Item B Item C Item D Item E Item F Targeting Fetch Information Process Hypothesis - 2 X Y Z
  105. Information Redis Node 1 Node 3 Node 2 Item A

    Item B Item C Item D Item E Item F Targeting Fetch Information Process Hypothesis - 2 Item a ID Item B ID X Y Z
  106. Information Redis Node 1 Node 3 Node 2 Item A

    Item B Item C Item D Item E Item F Targeting Fetch Information Process Hypothesis - 2 Item a ID Item B ID X Y Z
  107. Information Redis Node 1 Node 3 Node 2 Item A

    Item B Item C Item D Item E Item F Targeting Fetch Information Process Hypothesis - 2 Item a ID Item B ID Item a ID Item F ID X Y Z
  108. Information Redis Node 1 Node 3 Node 2 Item A

    Item B Item C Item D Item E Item F Targeting Fetch Information Process Hypothesis - 2 Item a ID Item B ID Item a ID Item F ID X Y Z
  109. Information Redis Node 1 Node 3 Node 2 Item A

    Item B Item C Item D Item E Item F Targeting Fetch Information Process Hypothesis - 2 Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID X Y Z
  110. Information Redis Node 1 Node 3 Node 2 Item A

    Item B Item C Item D Item E Item F Targeting Fetch Information Process Hypothesis - 2 Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID X Y Z
  111. Information Redis Node 1 Node 3 Node 2 Item A

    Item B Item C Item D Item E Item F Targeting Fetch Information Process Hypothesis - 2 Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID X Y Z
  112. Problem & Solution # Issue Problem

  113. Problem & Solution # Issue Problem 1 Burst of MGET

    command • Recommend items increased • Need access to multiple Redis nodes to retrieve information
  114. Problem & Solution # Issue Problem 1 Burst of MGET

    command • Recommend items increased • Need access to multiple Redis nodes to retrieve information 2 Concentration of Targeting • There is content recommended by many people • Concentration of access to specific node
  115. Solution1 - Burst of MGET command Fetch Information Information Redis

    Information Targeting Redis Fetch Targeting Ranking
  116. Solution1 - Burst of MGET command Fetch Information Information Redis

    Information Targeting Redis Fetch Targeting Ranking
  117. Solution1 - Burst of MGET command Fetch Information Information Redis

    Information Targeting Redis Fetch Targeting Ranking Limit
  118. Solution1 - Burst of MGET command Fetch Information Information Redis

    Information Targeting Redis Fetch Targeting Ranking Limit ↓ Not Smart
  119. Solution1 - Burst of MGET command Fetch Information Information Redis

    Information Targeting Redis Fetch Targeting Ranking
  120. Solution1 - Burst of MGET command Fetch Information Information Redis

    Information Targeting Redis Fetch Targeting Ranking Smart Filter
  121. Smart Filter Information Fetch Targeting Ranking Filter Fetch Information

  122. Smart Filter Information Fetch Targeting Ranking Filter Information Filter Targeting

    Filter Fetch Information
  123. Smart Filter Information Fetch Targeting Ranking Information Filter Targeting Filter

    Fetch Information
  124. Smart Filter Information Fetch Targeting Ranking Information Filter Targeting Filter

    Fetch Information
  125. Smart Filter Information Fetch Targeting Ranking Information Filter Targeting Filter

    Limit Fetch Information
  126. Smart Filter Information Fetch Targeting Ranking Information Filter Targeting Filter

    Limit Fetch Information
  127. Smart Filter Information Fetch Targeting Ranking Information Filter Targeting Filter

    Limit Fetch Information Tier 1 (user log based)
  128. Smart Filter Information Fetch Targeting Ranking Information Filter Targeting Filter

    Limit Fetch Information Tier 1 (user log based) Tier 2 (user attribute based)
  129. Smart Filter Information Fetch Targeting Ranking Information Filter Targeting Filter

    Limit Fetch Information Tier 1 (user log based) Tier 2 (user attribute based) Limit N = 5
  130. Smart Filter Information Fetch Targeting Ranking Information Filter Targeting Filter

    Limit Fetch Information Tier 1 (user log based) Tier 2 (user attribute based) Limit N = 5 Item Item Item
  131. Smart Filter Information Fetch Targeting Ranking Information Filter Targeting Filter

    Limit Fetch Information Tier 1 (user log based) Tier 2 (user attribute based) Limit N = 5 Item Item Item Item Item Item Item Item Item
  132. Smart Filter Information Fetch Targeting Ranking Information Filter Targeting Filter

    Limit Fetch Information Tier 1 (user log based) Tier 2 (user attribute based) Limit N = 5 Item Item Item Item Item Item Item Item Item
  133. Smart Filter Information Fetch Targeting Ranking Information Filter Targeting Filter

    Limit Fetch Information Tier 1 (user log based) Tier 2 (user attribute based) Limit N = 5 Item Item Item Item Item Item Item Item Item
  134. Smart Filter Tier 1 (user log based) Tier 2 (user

    attribute based) Limit N = 5 Item Item Item Item Item Item Item Item Item Item Item Item Information Fetch Targeting Ranking Information Filter Targeting Filter Limit Fetch Information
  135. Smart Filter Tier 1 (user log based) Tier 2 (user

    attribute based) Limit N = 5 Item Item Item Item Item Item Item Item Item Item Item Item Information Fetch Targeting Ranking Information Filter Targeting Filter Limit Fetch Information
  136. Solution2 - Concentration of Targeting Information Redis Node 1 Node

    3 Node 2 Item A Item B Item C Item D Item E Item F Targeting Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID X Y Z
  137. Solution2 - Concentration of Targeting Information Redis Node 1 Node

    3 Node 2 Item A Item B Item C Item D Item E Item F Targeting Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID X Y Z
  138. Solution2 - Concentration of Targeting Information Redis Node 1 Node

    3 Node 2 Item A Item B Item C Item D Item E Item F Targeting Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID Cache (size=4) X Y Z
  139. Solution2 - Concentration of Targeting Information Redis Node 1 Node

    3 Node 2 Item A Item B Item C Item D Item E Item F Targeting Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID Cache (size=4) X Y Z
  140. Solution2 - Concentration of Targeting Information Redis Node 1 Node

    3 Node 2 Item A Item B Item C Item D Item E Item F Targeting Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID Cache (size=4) X Y Z
  141. Solution2 - Concentration of Targeting Information Redis Node 1 Node

    3 Node 2 Item A Item B Item C Item D Item E Item F Targeting Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID Cache (size=4) Item A Item B X Y Z
  142. Solution2 - Concentration of Targeting Information Redis Node 1 Node

    3 Node 2 Item A Item B Item C Item D Item E Item F Targeting Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID Cache (size=4) Item A Item B X Y Z
  143. Solution2 - Concentration of Targeting Information Redis Node 1 Node

    3 Node 2 Item A Item B Item C Item D Item E Item F Targeting Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID Cache (size=4) Item A Item B X Y Z
  144. Solution2 - Concentration of Targeting Information Redis Node 1 Node

    3 Node 2 Item A Item B Item C Item D Item E Item F Targeting Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID Cache (size=4) Item A Item B Item F X Y Z
  145. Solution2 - Concentration of Targeting Information Redis Node 1 Node

    3 Node 2 Item A Item B Item C Item D Item E Item F Targeting Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID Cache (size=4) Item A Item B Item F X Y Z
  146. Solution2 - Concentration of Targeting Information Redis Node 1 Node

    3 Node 2 Item A Item B Item C Item D Item E Item F Targeting Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID Cache (size=4) Item A Item B Item F X Y Z
  147. Solution2 - Concentration of Targeting Information Redis Node 1 Node

    3 Node 2 Item A Item B Item C Item D Item E Item F Targeting Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID Cache (size=4) Item A Item B Item F Item D X Y Z
  148. Solution2 - Concentration of Targeting Information Redis Node 1 Node

    3 Node 2 Item A Item B Item C Item D Item E Item F Targeting Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID Cache (size=4) Item A Item B Item F Item D Caffeine Cache time: 5s Eviction: size-based X Y Z
  149. Solution2 - Concentration of Targeting Information Redis Node 1 Node

    3 Node 2 Item A Item B Item C Item D Item E Item F Targeting Item a ID Item B ID Item a ID Item F ID Item D ID Item B ID Cache (size=4) Item A Item B Item F Item D Caffeine Cache time: 5s Eviction: size-based X Y Z
  150. Results Percentile Response Time Before After 99.9 352ms 67ms 99.0

    176ms 29ms 95.0 126ms 22ms 75.0 92ms 16ms
  151. Results Percentile Response Time Before After 99.9 352ms 67ms 99.0

    176ms 29ms 95.0 126ms 22ms 75.0 92ms 16ms 80% faster
  152. Earthquake 2nd trouble

  153. Earthquake CHIBA M 5.9 2021/10/7 22:41 https://www.jma.go.jp/jma/press/2110/08c/202110080050.html 2nd trouble

  154. Traffic explosion

  155. Traffic explosion Request per second

  156. Traffic explosion Request per second Application server threads

  157. Service effect Earthquake notice expected

  158. Service effect Earthquake notice expected Some request time outed…

  159. Prioritized Delivery Fetch Information Information Fetch Targeting Ranking Prioritized Contents

    Information
  160. Prioritized Delivery Fetch Information Information Fetch Targeting Ranking Prioritized Contents

    Information Disaster information threshold
  161. Prioritized Delivery Fetch Information Information Fetch Targeting Ranking Prioritized Contents

    Information Disaster information threshold Japanese earthquake scale >= 6
  162. Prioritized Delivery Fetch Information Information Fetch Targeting Ranking Prioritized Contents

    Information Disaster information threshold Japanese earthquake scale >= 6 10/7 earthquake scale = 5+
  163. Scale-out Change Threshold Solution

  164. Scale-out Change Threshold Solution Servers 25 → 50

  165. Scale-out Change Threshold Solution Japanese earthquake scale 6 → 5

    Servers 25 → 50
  166. Earthquake Result

  167. Earthquake FUKUSIHIMA M7.3 2022/3/16 23:36 https://www.jma.go.jp/jma/press/2203/17a/202203170130.html Result

  168. Result 105k req/sec

  169. Result 105k req/sec Response time 88ms(99.9%)

  170. Result 105k req/sec Response time 88ms(99.9%) MGET 367µs

  171. Result 105k req/sec Response time 88ms(99.9%) MGET 367µs

  172. Unexpected issue

  173. Unexpected issue - My house lost power

  174. Unexpected issue - My house lost power - Candles and

    tethering were used
  175. Lessons learned

  176. In the design - Define system requirements for the system

    in numerical terms
  177. In the design - Be prepared for errors in assumptions

    and unexpected behavior that can be easily controlled - Define system requirements for the system in numerical terms
  178. In the design - Be prepared for errors in assumptions

    and unexpected behavior that can be easily controlled - The growth of the service sometimes be a problem. And it usually comes just when you’ve forgotten about it. - Define system requirements for the system in numerical terms
  179. In the trouble - Logs and metrics are very important.

    If they are missing, add them as needed.
  180. In the trouble - We have to make hypotheses and

    test them one by one. - Logs and metrics are very important. If they are missing, add them as needed.
  181. In the trouble - We have to make hypotheses and

    test them one by one. - The evolution of technology bring us another difficulty. There is no silver bullet yet. - Logs and metrics are very important. If they are missing, add them as needed.
  182. None
  183. I have not failed.

  184. I have not failed. I’ve just found 10,000 ways that

    won’t work.
  185. I have not failed. I’ve just found 10,000 ways that

    won’t work. - Thomas Edison