Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rakuten - Viki Data Challenge Solution.

827edc42d80fceca858a1603738385b4?s=47 Dat Le
September 14, 2015

Rakuten - Viki Data Challenge Solution.

A presentation of my solution for the Rakuten Viki Data Science Challenge.
For more info: http://www.dextra.sg/challenges/rakuten-viki-video-challenge/

827edc42d80fceca858a1603738385b4?s=128

Dat Le

September 14, 2015
Tweet

Transcript

  1. Rakuten – Viki Challenge Le Nguyen The Dat

  2. About me o  2010: MSc. Computer Science – Oxford University

    o  2011: Research Engineer – A*STAR DSI o  2013: Data – ZALORA Group o  2015: Data – Commercialize TV https://github.com/lenguyenthedat https://sg.linkedin.com/in/lenguyenthedat
  3. Challenge descriptions https://www.viki.com/

  4. Challenge descriptions http://www.dextra.sg/challenges/rakuten-viki-video-challenge/

  5. Challenge descriptions Data: o  (880,000) User Attributes (country – gender)

    o  (600) Video attributes (country – language – genre – owner – casts) o  (4,880,000) User viewing behavior (video – user – score) Task: o  Recommendation engine - prediction for each user (user – top 3 videos) o  Insights Case study: o  http://www.dextra.sg/wp-content/uploads/2015/09/ CaseStudy_Viki.pdf
  6. Useful Links Tableau Public Visualization: http://tiny.cc/viki-viz Source Code: http://tiny.cc/viki-src

  7. Preliminary Analysis

  8. Analysis – Gender

  9. Analysis – Gender

  10. Analysis – Genre

  11. Analysis – Genre

  12. Analysis – Content Owner

  13. Analysis – Content Owner

  14. Analysis – Videos Traffic

  15. Algorithm Overview

  16. Training phase Recommendation Engine Videos Overall Performances • Hotness • Freshness Videos

    Similarity Matrix • Content Similarity • Collaborative Filtering
  17. Training phase Recommendation Engine Videos Overall Performances • Hotness • Freshness Videos

    Similarity Matrix • Content Similarity • Collaborative Filtering Videos Overall Performances • Hotness • Freshness
  18. Training phase Videos overall performances: *With gender filter applied Hotness∗

    = usersWatched ∑ firstDate−lastDate Freshness = 1 broadcastDate−currentDate ( )2
  19. Training phase Recommendation Engine Videos Overall Performances • Hotness • Freshness Videos

    Similarity Matrix • Content Similarity • Collaborative Filtering Videos Similarity Matrix • Content Similarity • Collaborative Filtering
  20. Training phase Videos similarity Matrix – Content Similarity o  Original

    Country: o  Original Language: o  Adult Content: o  Content Owner: V1. country ==V 2 . country V1. language ==V 2 . language V1. adult ==1 ( )& V 2 . adult ==1 ( ) V1. contentOwner ==V 2 . contentOwner
  21. Training phase Videos similarity Matrix – Content Similarity o  Episode

    Count: o  Genre: o  Cast: J v 1 1 ,v 2 ( )= G 1 ∩G 2 G 1 ∪G 2 J v 1 1 ,v 2 ( )= C 1 ∩C 2 C 1 ∪C 2 min V1. episodeCount,V 2 . episodeCount ( ) max V1. episodeCount,V 2 . episodeCount ( )
  22. Training phase Recommendation Engine Videos Overall Performances • Hotness • Freshness Videos

    Similarity Matrix • Content Similarity • Collaborative Filtering Videos Similarity Matrix • Content Similarity • Collaborative Filtering
  23. Training phase Videos similarity Matrix – Collaborative Filtering o  Jaccard

    Index - https://en.wikipedia.org/wiki/Jaccard_index: •  Set theory •  Ratio of intersection gives similarity score •  Sensitive to sparse input – limit to only top 25% videos J v 1 1 ,v 2 ( )= U 1 ∩U 2 U 1 ∪U 2
  24. Training phase Videos similarity Matrix – Collaborative Filtering o  Cosine

    Similarity - https://en.wikipedia.org/wiki/Cosine_similarity: •  Vector space model •  Angle between 2 vectors gives similarity score. •  Good for sparse input – can apply gender filter. cos v 1 1 ,v 2 ( )= ! U 1 • ! U 2 ! U 1 ! U 2 = U 1,i •U 2,i i=1 n ∑ U 1,i 2 i=1 n ∑ × U 2,i 2 i=1 n ∑
  25. Personalization phase User History Recommendation Engine Personalized Recommendations User History

    User History Videos Overall Performances Videos Similarity Matrix
  26. Performance Overall time & space complexity: o  u: number of

    users (880,000) o  v: number of videos (600) Advantages: o  Lightweight – fits in 8GB Macbook Air! o  Scalable (fully distributed with SparklingPandas) O uv2 ( )
  27. Applications Flexibility: o  Custom weightages for: •  Features •  Collaborative

    filtering similarity scores •  Video performances (hotness or freshness) •  Individual User - Video scores Not just an engine but a framework: o  To create different recommendation engines.
  28. Applications Personally picked for you: Discovery Recommendations: Shows with similar

    Genres & Actors, Actresses:
  29. Suggestions •  Additional useful data sets: o  Explicit user rating

    is also very important. o  User’s contributions data (subtitles). o  User’s and video’s interactions data (live comments). •  Training & Testing data: o  Should exclude top videos. (Promoted on front-page or banners.) •  Evaluation method: o  Equal test set splits will give an overall better result. (Models that work well with Feb 2015 data might not work very well with March 2015 data)
  30. Technology stack •  Tableau Public o  Free to download o 

    Publicly shared workbooks o  Interactive visualizations and insights •  Python o  Pandas: data analysis library o  Scikit-Learn: machine learning library o  iPython Notebook: IDE for data analysis o  Other libraries: •  Spotify’s annoy: approx. nearest neighbors calculation •  PySpark’s Mllib: spark’s machine learning •  panns: approx. nearest neighbors search •  python-recsys: recommendation system
  31. Thank you!