Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building & Deploying CF

Buzzvil
November 17, 2021

Building & Deploying CF

By Peter Kim

Buzzvil

November 17, 2021
Tweet

More Decks by Buzzvil

Other Decks in Programming

Transcript

  1. How did I spend my time? • 99% Data Processing

    & Ops • 1% Machine Learning
  2. How do I spend my time now? • 99% 90%

    Data Processing & Ops • 1% 10% Machine Learning
  3. As with any problem, there are many solutions. • Best

    Selling • Most Recently Viewed / Carted / Purchased • BERT • Collaborative Filtering (CF)
  4. As with any problem, there are many solutions • Best

    Selling • Most Recently Viewed / Carted / Purchased • BERT • Collaborative Filtering (CF)
  5. 0 for Not Purchased, 1 for Purchased Product 1 Product

    2 Product 3 Product 4 Product 5 User A 1 0 1 0 1 User B 0 1 0 1 1 User C 1 1 1 0 0 User D 0 1 0 0 1
  6. Q: 유저 D가 산 상품들은? Product 1 Product 2 Product

    3 Product 4 Product 5 User A 1 0 1 0 1 User B 0 1 0 1 1 User C 1 1 1 0 0 User D 0 1 0 0 1
  7. Q: 유저 D가 산 상품들은? Product 1 Product 2 Product

    3 Product 4 Product 5 User A 1 0 1 0 1 User B 0 1 0 1 1 User C 1 1 1 0 0 User D 0 1 0 0 1
  8. Item-to-Item CF: 비슷한 유저 아이템을 찾아라! Product 1 Product 2

    Product 3 Product 4 Product 5 User A 1 0 1 0 1 User B 0 1 0 1 1 User C 1 1 1 0 0 User D 0 1 0 0 1
  9. E.g. Similarity of Product 2 & Product 5? Product 1

    Product 2 Product 3 Product 4 Product 5 User A 1 0 1 0 1 User B 0 1 0 1 1 User C 1 1 1 0 0 User D 0 1 0 0 1
  10. E.g. Similarity of Product 2 & Product 5? Product 1

    Product 2 Product 3 Product 4 Product 5 User A 1 0 1 0 1 User B 0 1 0 1 1 User C 1 1 1 0 0 User D 0 1 0 0 1 Sim(P2, P5) = 2
  11. E.g. Similarity of Product 2 & Product 1? Product 1

    Product 2 Product 3 Product 4 Product 5 User A 1 0 1 0 1 User B 0 1 0 1 1 User C 1 1 1 0 0 User D 0 1 0 0 1 Sim(P2, P1) = 1
  12. Many similarity functions can be used between two vectors •

    • Product 4 0 1 0 0 Product 2 0 1 1 1
  13. 1. Match each of the user’s purchased items to similar

    items 2. Combine them into a recommendation list
  14. For Buzzvil? • 2차 테스트 중간 결과는 긍정적, 더블체크를 위해

    재 테스트 중 더 궁금하시면, Confluence 문서와 Redash에서 확인: 문서 #1, 문서 #2,
  15. 30일치의 SSG 구매 데이터를 가지고 Item Similarities를 Compute해보자 Product 1

    ... Product 400,000 User 1 1 ... 1 ... ... ... ... User 600,000 0 ... 1
  16. Hint: What is the difference between these two vectors? Product

    2 0 1 1 1 Product 1 ... Product 400,000 User 1 1 ... 1 ... ... ... ... User 600,000 0 ... 1 Vs.
  17. Hint: What is the difference between these two vectors? Product

    2 0 1 1 1 Product 1 ... Product 400,000 User 1 1 ... 1 ... ... ... ... User 600,000 0 ... 1 Vs. Sparse!
  18. Using sparse vectors reduces both memory and time taken from

    scipy.sparse import csr_matrix from sklearn.metrics.pairwise import cosine_similarity a = csr_matrix(...) b = csr_matrix(...) sim = cosine_similarity(a, b) # 6x faster
  19. Using sparse vectors reduces both memory and time taken Naive

    version now takes 1200 hours (50일) 300일 전역하기 전까지는 끝낼 수 있다 from scipy.sparse import csr_matrix from sklearn.metrics.pairwise import cosine_similarity a = csr_matrix(...) b = csr_matrix(...) sim = cosine_similarity(a, b) # 6x faster
  20. Python Distributed Applications API: import ray ray.init() @ray.remote def compute_partial(...):

    # compute partial similarities table futures = [compute_partial(), compute_partial(), ...] ray.get(futures)
  21. 7200 hours, 1200 hours, 4 hours Python Multiprocessing with Ray

    on r5.12xlarge Instance (48 cores) takes ~4 hours
  22. 7200 hours, 1200 hours, 4 hours Python Multiprocessing with Ray

    on r5.12xlarge Instance (48 cores) takes ~4 hours Now ready for production!
  23. How many EC2 instances do we need? train-ssg train-emart train-hs

    train-ns infer-ssg infer-emart infer-hs infer-ns
  24. Inspiration • DAGs & Scheduling from AirFlow ◦ Decoupling scheduling

    and task details • Serverless from Lambda ◦ No more SSH, Tmux, Crontab
  25. Lessons Learned • Use sparse vectors to save memory and

    time • Optimize algorithms from the single-core level, then multi-core • Deploying & managing ML is hell but most of it can be automated • Building your own tools can increase your life quality by 10X ◦ Python + boto3 makes it very easy. PeterFlow is <200 lines.