Building & Deploying CF
Improving Buzzvil’s Retargeting & my life quality by 10X
Slide 2
Slide 2 text
About myself
● 충성충성
● ML Engineer (~3 yrs)
● Ad Display Team
Slide 3
Slide 3 text
About myself
● 충성충성
● ML Engineer (~3 yrs)
● Ad Display Team
Slide 4
Slide 4 text
How did I spend my time?
● 99% Data Processing & Ops
● 1% Machine Learning
Slide 5
Slide 5 text
How do I spend my time now?
● 99% 90% Data Processing & Ops
● 1% 10% Machine Learning
Slide 6
Slide 6 text
월 부킹 금액 10억+ (>20%의 전체 매출)
Slide 7
Slide 7 text
Retargeting
Slide 8
Slide 8 text
Retargeting as a recommendation problem
Slide 9
Slide 9 text
As with any problem, there are many solutions.
● Best Selling
● Most Recently Viewed / Carted / Purchased
● BERT
● Collaborative Filtering (CF)
Slide 10
Slide 10 text
As with any problem, there are many solutions
● Best Selling
● Most Recently Viewed / Carted / Purchased
● BERT
● Collaborative Filtering (CF)
Slide 11
Slide 11 text
CF Intuition: 비슷한 유저를 찾아라!
Slide 12
Slide 12 text
0 for Not Purchased, 1 for Purchased
Product 1 Product 2 Product 3 Product 4 Product 5
User A 1 0 1 0 1
User B 0 1 0 1 1
User C 1 1 1 0 0
User D 0 1 0 0 1
Slide 13
Slide 13 text
Q: 유저 D가 산 상품들은?
Product 1 Product 2 Product 3 Product 4 Product 5
User A 1 0 1 0 1
User B 0 1 0 1 1
User C 1 1 1 0 0
User D 0 1 0 0 1
Slide 14
Slide 14 text
Q: 유저 D가 산 상품들은?
Product 1 Product 2 Product 3 Product 4 Product 5
User A 1 0 1 0 1
User B 0 1 0 1 1
User C 1 1 1 0 0
User D 0 1 0 0 1
Slide 15
Slide 15 text
Item-to-Item CF: 비슷한 유저 아이템을 찾아라!
Product 1 Product 2 Product 3 Product 4 Product 5
User A 1 0 1 0 1
User B 0 1 0 1 1
User C 1 1 1 0 0
User D 0 1 0 0 1
Slide 16
Slide 16 text
E.g. Similarity of Product 2 & Product 5?
Product 1 Product 2 Product 3 Product 4 Product 5
User A 1 0 1 0 1
User B 0 1 0 1 1
User C 1 1 1 0 0
User D 0 1 0 0 1
Slide 17
Slide 17 text
E.g. Similarity of Product 2 & Product 5?
Product 1 Product 2 Product 3 Product 4 Product 5
User A 1 0 1 0 1
User B 0 1 0 1 1
User C 1 1 1 0 0
User D 0 1 0 0 1
Sim(P2, P5) = 2
Slide 18
Slide 18 text
E.g. Similarity of Product 2 & Product 1?
Product 1 Product 2 Product 3 Product 4 Product 5
User A 1 0 1 0 1
User B 0 1 0 1 1
User C 1 1 1 0 0
User D 0 1 0 0 1
Sim(P2, P1) = 1
Slide 19
Slide 19 text
Many similarity functions can be used between two vectors
●
●
Product 4
0
1
0
0
Product 2
0
1
1
1
Slide 20
Slide 20 text
So how do we build a
recommendation system with this?
Slide 21
Slide 21 text
1. Match each of the user’s
purchased items to similar items
2. Combine them into a
recommendation list
Slide 22
Slide 22 text
쿠팡
Slide 23
Slide 23 text
Netflix
Slide 24
Slide 24 text
Youtube
Slide 25
Slide 25 text
Amazon
Slide 26
Slide 26 text
For Buzzvil?
● 2차 테스트 중간 결과는 긍정적, 더블체크를 위해 재 테스트 중
더 궁금하시면, Confluence 문서와 Redash에서 확인: 문서 #1, 문서 #2,
Slide 27
Slide 27 text
ItemCF was popularized by Amazon, 20 years ago
Slide 28
Slide 28 text
Scalable algorithm to compute item similarities
Slide 29
Slide 29 text
30일치의 SSG 구매 데이터를 가지고
Item Similarities를 Compute해보자
Product 1 ... Product
400,000
User 1 1 ... 1
... ... ... ...
User 600,000 0 ... 1
Slide 30
Slide 30 text
Version 1: Python Naive Version
Slide 31
Slide 31 text
Time complexity = 7200 hours, 300일밖에 안걸리네...흠
Slide 32
Slide 32 text
Find the bottleneck!
Slide 33
Slide 33 text
Find the bottleneck!
Slide 34
Slide 34 text
Q: How to optimize the cosine similarity computation?
Slide 35
Slide 35 text
Q: How to optimize the cosine similarity computation?
Product 4
0
1
0
0
Product 2
0
1
1
1
Slide 36
Slide 36 text
Hint: What is the difference between these two vectors?
Product 2
0
1
1
1
Product 1 ... Product
400,000
User 1 1 ... 1
... ... ... ...
User 600,000 0 ... 1
Vs.
Slide 37
Slide 37 text
Hint: What is the difference between these two vectors?
Product 2
0
1
1
1
Product 1 ... Product
400,000
User 1 1 ... 1
... ... ... ...
User 600,000 0 ... 1
Vs.
Sparse!
Slide 38
Slide 38 text
No content
Slide 39
Slide 39 text
Using sparse vectors reduces both memory and time taken
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
a = csr_matrix(...)
b = csr_matrix(...)
sim = cosine_similarity(a, b) # 6x faster
Slide 40
Slide 40 text
Using sparse vectors reduces both memory and time taken
Naive version now takes 1200 hours (50일) 300일
전역하기 전까지는 끝낼 수 있다
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
a = csr_matrix(...)
b = csr_matrix(...)
sim = cosine_similarity(a, b) # 6x faster
The wheel has been invented but it’s not yet ready
DY 화이팅!
Slide 51
Slide 51 text
Until then, let’s make my own.
Slide 52
Slide 52 text
Inspiration
● DAGs & Scheduling from AirFlow
○ Decoupling scheduling and task details
● Serverless from Lambda
○ No more SSH, Tmux, Crontab
Slide 53
Slide 53 text
Let’s see it in action!
Slide 54
Slide 54 text
Lessons Learned
● Use sparse vectors to save memory and time
● Optimize algorithms from the single-core level, then multi-core
● Deploying & managing ML is hell but most of it can be automated
● Building your own tools can increase your life quality by 10X
○ Python + boto3 makes it very easy. PeterFlow is <200 lines.