0 for Not Purchased, 1 for Purchased Product 1 Product 2 Product 3 Product 4 Product 5 User A 1 0 1 0 1 User B 0 1 0 1 1 User C 1 1 1 0 0 User D 0 1 0 0 1
Item-to-Item CF: 비슷한 유저 아이템을 찾아라! Product 1 Product 2 Product 3 Product 4 Product 5 User A 1 0 1 0 1 User B 0 1 0 1 1 User C 1 1 1 0 0 User D 0 1 0 0 1
E.g. Similarity of Product 2 & Product 5? Product 1 Product 2 Product 3 Product 4 Product 5 User A 1 0 1 0 1 User B 0 1 0 1 1 User C 1 1 1 0 0 User D 0 1 0 0 1
E.g. Similarity of Product 2 & Product 5? Product 1 Product 2 Product 3 Product 4 Product 5 User A 1 0 1 0 1 User B 0 1 0 1 1 User C 1 1 1 0 0 User D 0 1 0 0 1 Sim(P2, P5) = 2
E.g. Similarity of Product 2 & Product 1? Product 1 Product 2 Product 3 Product 4 Product 5 User A 1 0 1 0 1 User B 0 1 0 1 1 User C 1 1 1 0 0 User D 0 1 0 0 1 Sim(P2, P1) = 1
Hint: What is the difference between these two vectors? Product 2 0 1 1 1 Product 1 ... Product 400,000 User 1 1 ... 1 ... ... ... ... User 600,000 0 ... 1 Vs.
Hint: What is the difference between these two vectors? Product 2 0 1 1 1 Product 1 ... Product 400,000 User 1 1 ... 1 ... ... ... ... User 600,000 0 ... 1 Vs. Sparse!
Using sparse vectors reduces both memory and time taken from scipy.sparse import csr_matrix from sklearn.metrics.pairwise import cosine_similarity a = csr_matrix(...) b = csr_matrix(...) sim = cosine_similarity(a, b) # 6x faster
Using sparse vectors reduces both memory and time taken Naive version now takes 1200 hours (50일) 300일 전역하기 전까지는 끝낼 수 있다 from scipy.sparse import csr_matrix from sklearn.metrics.pairwise import cosine_similarity a = csr_matrix(...) b = csr_matrix(...) sim = cosine_similarity(a, b) # 6x faster
Lessons Learned ● Use sparse vectors to save memory and time ● Optimize algorithms from the single-core level, then multi-core ● Deploying & managing ML is hell but most of it can be automated ● Building your own tools can increase your life quality by 10X ○ Python + boto3 makes it very easy. PeterFlow is <200 lines.