Rakuten - Viki Data Challenge Solution.

Rakuten – Viki Challenge Le Nguyen The Dat

About me o  2010: MSc. Computer Science – Oxford University
o  2011: Research Engineer – A*STAR DSI o  2013: Data – ZALORA Group o  2015: Data – Commercialize TV https://github.com/lenguyenthedat https://sg.linkedin.com/in/lenguyenthedat

Challenge descriptions https://www.viki.com/

Challenge descriptions http://www.dextra.sg/challenges/rakuten-viki-video-challenge/

Challenge descriptions Data: o  (880,000) User Attributes (country – gender)
o  (600) Video attributes (country – language – genre – owner – casts) o  (4,880,000) User viewing behavior (video – user – score) Task: o  Recommendation engine - prediction for each user (user – top 3 videos) o  Insights Case study: o  http://www.dextra.sg/wp-content/uploads/2015/09/ CaseStudy_Viki.pdf

Useful Links Tableau Public Visualization: http://tiny.cc/viki-viz Source Code: http://tiny.cc/viki-src

Preliminary Analysis

Analysis – Gender

Analysis – Genre

Analysis – Content Owner

Analysis – Videos Traﬃc

Algorithm Overview

Training phase Recommendation Engine Videos Overall Performances • Hotness • Freshness Videos
Similarity Matrix • Content Similarity • Collaborative Filtering

Similarity Matrix • Content Similarity • Collaborative Filtering Videos Overall Performances • Hotness • Freshness

Training phase Videos overall performances: *With gender filter applied Hotness∗
= usersWatched ∑ firstDate−lastDate Freshness = 1 broadcastDate−currentDate ( )2

Similarity Matrix • Content Similarity • Collaborative Filtering Videos Similarity Matrix • Content Similarity • Collaborative Filtering

Training phase Videos similarity Matrix – Content Similarity o  Original
Country: o  Original Language: o  Adult Content: o  Content Owner: V1. country ==V 2 . country V1. language ==V 2 . language V1. adult ==1 ( )& V 2 . adult ==1 ( ) V1. contentOwner ==V 2 . contentOwner

Training phase Videos similarity Matrix – Content Similarity o  Episode
Count: o  Genre: o  Cast: J v 1 1 ,v 2 ( )= G 1 ∩G 2 G 1 ∪G 2 J v 1 1 ,v 2 ( )= C 1 ∩C 2 C 1 ∪C 2 min V1. episodeCount,V 2 . episodeCount ( ) max V1. episodeCount,V 2 . episodeCount ( )

Similarity Matrix • Content Similarity • Collaborative Filtering Videos Similarity Matrix • Content Similarity • Collaborative Filtering

Training phase Videos similarity Matrix – Collaborative Filtering o  Jaccard
Index - https://en.wikipedia.org/wiki/Jaccard_index: •  Set theory •  Ratio of intersection gives similarity score •  Sensitive to sparse input – limit to only top 25% videos J v 1 1 ,v 2 ( )= U 1 ∩U 2 U 1 ∪U 2

Training phase Videos similarity Matrix – Collaborative Filtering o  Cosine
Similarity - https://en.wikipedia.org/wiki/Cosine_similarity: •  Vector space model •  Angle between 2 vectors gives similarity score. •  Good for sparse input – can apply gender filter. cos v 1 1 ,v 2 ( )= ! U 1 • ! U 2 ! U 1 ! U 2 = U 1,i •U 2,i i=1 n ∑ U 1,i 2 i=1 n ∑ × U 2,i 2 i=1 n ∑

Personalization phase User History Recommendation Engine Personalized Recommendations User History
User History Videos Overall Performances Videos Similarity Matrix

Performance Overall time & space complexity: o  u: number of
users (880,000) o  v: number of videos (600) Advantages: o  Lightweight – fits in 8GB Macbook Air! o  Scalable (fully distributed with SparklingPandas) O uv2 ( )

Applications Flexibility: o  Custom weightages for: •  Features •  Collaborative
filtering similarity scores •  Video performances (hotness or freshness) •  Individual User - Video scores Not just an engine but a framework: o  To create different recommendation engines.

Applications Personally picked for you: Discovery Recommendations: Shows with similar
Genres & Actors, Actresses:

Suggestions •  Additional useful data sets: o  Explicit user rating
is also very important. o  User’s contributions data (subtitles). o  User’s and video’s interactions data (live comments). •  Training & Testing data: o  Should exclude top videos. (Promoted on front-page or banners.) •  Evaluation method: o  Equal test set splits will give an overall better result. (Models that work well with Feb 2015 data might not work very well with March 2015 data)

Technology stack •  Tableau Public o  Free to download o 
Publicly shared workbooks o  Interactive visualizations and insights •  Python o  Pandas: data analysis library o  Scikit-Learn: machine learning library o  iPython Notebook: IDE for data analysis o  Other libraries: •  Spotify’s annoy: approx. nearest neighbors calculation •  PySpark’s Mllib: spark’s machine learning •  panns: approx. nearest neighbors search •  python-recsys: recommendation system

Thank you!

Rakuten - Viki Data Challenge Solution.

Rakuten - Viki Data Challenge Solution.

Dat Le

More Decks by Dat Le

Other Decks in Technology

Featured

Transcript

Rakuten – Viki Challenge Le Nguyen The Dat

About me o  2010: MSc. Computer Science – Oxford University

Challenge descriptions https://www.viki.com/

Challenge descriptions http://www.dextra.sg/challenges/rakuten-viki-video-challenge/

Challenge descriptions Data: o  (880,000) User Attributes (country – gender)

Useful Links Tableau Public Visualization: http://tiny.cc/viki-viz Source Code: http://tiny.cc/viki-src

Preliminary Analysis

Analysis – Gender

Analysis – Gender

Analysis – Genre

Analysis – Genre

Analysis – Content Owner

Analysis – Content Owner

Analysis – Videos Traﬃc

Algorithm Overview

Training phase Recommendation Engine Videos Overall Performances • Hotness • Freshness Videos

Training phase Recommendation Engine Videos Overall Performances • Hotness • Freshness Videos

Training phase Videos overall performances: *With gender filter applied Hotness∗

Training phase Recommendation Engine Videos Overall Performances • Hotness • Freshness Videos

Training phase Videos similarity Matrix – Content Similarity o  Original

Training phase Videos similarity Matrix – Content Similarity o  Episode

Training phase Recommendation Engine Videos Overall Performances • Hotness • Freshness Videos

Training phase Videos similarity Matrix – Collaborative Filtering o  Jaccard

Training phase Videos similarity Matrix – Collaborative Filtering o  Cosine

Personalization phase User History Recommendation Engine Personalized Recommendations User History

Performance Overall time & space complexity: o  u: number of

Applications Flexibility: o  Custom weightages for: •  Features •  Collaborative

Applications Personally picked for you: Discovery Recommendations: Shows with similar

Suggestions •  Additional useful data sets: o  Explicit user rating

Technology stack •  Tableau Public o  Free to download o

Thank you!