.ZEBZ,,#09
4VSWJWBM-PHBTB%BUB4DJFOUJTU
羅經凱 副理理, KKStream Data Analytics, Nov. 28, 2017
“資料科學家 1000 天⽣生存⽇日誌”
中央⼤大學, CE6143 - Introduction to Data Science
Slide 2
Slide 2 text
KKBOX since 2004
Music streaming service operated
in Taiwan, Hong Kong, Japan,
Singapore, and Malaysia.
We have rich content (40M high
quality songs) and deliver
customers personalized
experiences.
Slide 3
Slide 3 text
Next wave: Video since 2016
B2C service, aim to provide new
TV experience.
B2B service, cloud based video
solutions to provide the best video
experiences that engage your
valuable customers on every
screen.
Slide 4
Slide 4 text
Me
• 2010, KDD Cup Champion as a member
• 2014, NTU EE PhD.
• 2014, KKBOX Data scientist
• 2015, KKBOX DS team lead
• 2017, KKStream Data team lead
Slide 5
Slide 5 text
What I do in KKBOX
• [Consultant] Support business decision making
• strategy to expand video content library?
a trade off between price and customer satisfaction
• [Developer] Enhance product perceived quality
• system to deliver personalized experience?
recommender system, notification optimization, …
3 Stages of Me
Fledgeling: how I dig out insight
Collaborating: how I work with others
Advocating: how I ask others to enjoy us
Slide 13
Slide 13 text
No content
Slide 14
Slide 14 text
Collaborating
Data Aggregating
Data Mining
Application development
Slide 15
Slide 15 text
Proof of Concept
Slide 16
Slide 16 text
Performance Eval.
Slide 17
Slide 17 text
Fledgling
Fledgeling: how I dig out insight
Slide 18
Slide 18 text
User Behavior Analytics
first step to explore
Slide 19
Slide 19 text
⼤大家如何探索新⾳音樂?
Slide 20
Slide 20 text
Xiao Hu, Jin Ha Lee and Leanne Ka Yan Wong (2014), Music Information Behaviors and System
Preferences of University Students in Hong Kong
[Citation 174] JH Lee, JS Downie (2004), Survey of music information needs, uses, and seeking
behaviours: preliminary findings
52.5% (31% in 2004) by the popularity
57.4% by recommendations from other people
survey in HK, 2014
⼤大家如何探索新⾳音樂?
Slide 21
Slide 21 text
Social influence is great,
and so is popularity.
Xiao Hu, Jin Ha Lee and Leanne Ka Yan Wong (2014), Music Information Behaviors and System
Preferences of University Students in Hong Kong
[Citation 174] JH Lee, JS Downie (2004), Survey of music information needs, uses, and seeking
behaviours: preliminary findings
52.5% (31% in 2004) by the popularity
57.4% by recommendations from other people
survey in HK, 2014
⼤大家如何探索新⾳音樂?
Slide 22
Slide 22 text
No content
Slide 23
Slide 23 text
科技
始終來來⾃自於惰性
Slide 24
Slide 24 text
0 2000 4000 6000 8000 10000
0.0 0.2 0.4 0.6 0.8 1.0
play count
song number
2015
2008
2004
播放次數
歌曲比例例
Do users listen regularly?
Trace: users who purchase with mycard credits
Y-axis
聆聽時間
X-axis
⼀一週內的 168 ⼩小時
0 50 150 250
User 67158956
hours in a week
usage
24hr
Mon Wed Fri
User A
Slide 41
Slide 41 text
0 50 150 250
User 67158956
hours in a week
usage
24hr
Mon Wed Fri
0 100 200
User 8729390
hours in a week
usage
24hr
Mon Wed Fri
0 50 150
User 21570083
hours in a week
usage
24hr
Mon Wed Fri
0 50 150
User 21566513
hours in a week
usage
24hr
Mon Wed Fri
0 50 150 250
User 21574953
hours in a week
usage
24hr
Mon Wed Fri
0 100 200
User 9058153
hours in a week
usage
24hr
Mon Wed Fri
0 50 150
User 69277857
hours in a week
usage
Mon Wed Fri
0 50 100 150
User 11757913
hours in a week
usage
Mon Wed Fri
0 50 150
User 44551330
hours in a week
usage
Mon Wed Fri
規律律
不規律律
0 100 300
Group 1: 5.8%
hours in a day
usage
0 6 12 18
0 200 400
Group 2: 7.3%
hours in a day
usage
0 6 12 18
0 100 200 300
Group 3: 11.8%
hours in a day
usage
0 6 12 18
0 100 200 300
Group 4: 16.0%
hours in a day
usage
0 6 12 18
0 100 300
Group 5: 12.8%
hours in a day
usage
0 6 12 18
0 100 300
Group 6: 13.4%
hours in a day
usage
0 6 12 18
0 100 300
Group 7: 14.2%
hours in a day
usage
0 6 12 18
0 100 300
Group 8: 12.4%
hours in a day
usage
0 6 12 18
0 100 200 300
Group 9: 6.3%
hours in a day
usage
0 6 12 18
多種⽣生活型態
Slide 44
Slide 44 text
0 200 400
Group 2: 7.3%
hours in a day
usage
0 6 12 18
usage
Group 5: 12.8%
通勤勤族
使⽤用⾼高峰落落於早晨八點與夜間六點
⾼高峰持續時間短,持續僅 20 — 30 分鐘
average
median
Slide 45
Slide 45 text
0 100 200 300
Group 4: 16.0%
hours in a day
usage
0 6 12 18
usage
Group 7: 14.2%
使⽤用⾼高峰始於 10:00 到 18:00
⾼高峰持續時間長,持續僅 4 - 5 ⼩小時
辦公族
Slide 46
Slide 46 text
Popularity
Time-sensitive
Slide 47
Slide 47 text
Data Visualization
preference as vectors
Slide 48
Slide 48 text
How you describe pref.
• Latent Representation
• A multi-dimensional vector learned from
crowd, is specified by a point in a latent
space
• The similarity between two objects is
reflected in their distance in the latent
space
Slide 49
Slide 49 text
Word to Vector
• 字以空間中的點呈現,並保持以下特性
• 意義相近的字,相距近
• 字之間的相對⽅方向保留留其意義,可以做出向量量操作。
• King - Man + Woman = Queen
Slide 50
Slide 50 text
聆聽歷史→⽂文字段落落
Slide 51
Slide 51 text
Music Experience as Words
• 連續聆聽的歷程,如同句句⼦子。
• 「聆聽者」,「曲⼦子」都視為字。
⽤用⼾戶 歌曲 ⽤用⼾戶 歌曲
Slide 52
Slide 52 text
Constructing DeepWalk Graph
Slide 53
Slide 53 text
Including Session
Slide 54
Slide 54 text
Multiple Sessions
Slide 55
Slide 55 text
Multiple Users
Slide 56
Slide 56 text
In 2-D Latent Space
Users
Songs
蘇打綠
陳綺貞
五⽉月天
John Mayer
OneRepublic
Maroon 5
Slide 57
Slide 57 text
資料→網路路→向量量
Slide 58
Slide 58 text
Visualisation Framework
• Global Trend
• Album clusters,
• Artist clusters,
• …
• Individual Preference
• Diversity of preference
• Factors related to preference
• …
Slide 59
Slide 59 text
Representation (TW)
Slide 60
Slide 60 text
An Example
Slide 61
Slide 61 text
Relaxing songs
Japanese drama songs
Western drama songs
Mandarin drama songs
Slide 62
Slide 62 text
Considering time
(session)
Slide 63
Slide 63 text
Day and Night
Slide 64
Slide 64 text
Relaxing songs for baby
Mandarin pop songs
Slide 65
Slide 65 text
Considering device
Slide 66
Slide 66 text
Account sharing?
Slide 67
Slide 67 text
Korean and Western pop songs
Mandarin old songs
Slide 68
Slide 68 text
Personal Preference
• 同時會擁有單⼀一⾳音樂喜好,與多種⾳音樂喜好的⽤用⼾戶
• 多⼈人共享帳號是可以被偵測的。
Slide 69
Slide 69 text
More applications
Slide 70
Slide 70 text
song / artist / genre
Slide 71
Slide 71 text
Advocating: how I ask others to enjoy us
Advocating
Slide 72
Slide 72 text
The easiest way to win an
argument, helping him see things
from your perspective
Slide 73
Slide 73 text
Data + Game →
Arouse the awareness and interests
of data we have…
This 14-day game has
63 teams
81 players
334 downloads
835 submissions
Slide 76
Slide 76 text
Gains & Insightful
findings
from this public game
Slide 77
Slide 77 text
Happy
Boss :)
Hardworking
Boss :)
Slide 78
Slide 78 text
Champion’s secret sauce
Slide 79
Slide 79 text
First-step Observation
In training dataset,
27% customers’ labels = the last one saw in history views
37% customers’ labels = one appeared in history views
18% customers’ labels = one never appeared in training set
Slide 80
Slide 80 text
Naïve Baseline
Just fill in the last title id in view
history for each individual
You get 27%, namely, rank 20th