Upgrade to Pro — share decks privately, control downloads, hide ads and more …

My 1000-day KKBOX Survival Log as a Data Scientist

Jing-Kai Lou
December 18, 2017

My 1000-day KKBOX Survival Log as a Data Scientist

A talk presents how I survive in a company as a data scientist, from contributing individually to cultivating data culture

Jing-Kai Lou

December 18, 2017
Tweet

More Decks by Jing-Kai Lou

Other Decks in Science

Transcript

  1. .ZEBZ,,#09 4VSWJWBM-PHBTB%BUB4DJFOUJTU 羅經凱 副理理, KKStream Data Analytics, Nov. 28, 2017

    “資料科學家 1000 天⽣生存⽇日誌” 中央⼤大學, CE6143 - Introduction to Data Science
  2. KKBOX since 2004 Music streaming service operated in Taiwan, Hong

    Kong, Japan, Singapore, and Malaysia. We have rich content (40M high quality songs) and deliver customers personalized experiences.
  3. Next wave: Video since 2016 B2C service, aim to provide

    new TV experience. B2B service, cloud based video solutions to provide the best video experiences that engage your valuable customers on every screen.
  4. Me • 2010, KDD Cup Champion as a member •

    2014, NTU EE PhD. • 2014, KKBOX Data scientist • 2015, KKBOX DS team lead • 2017, KKStream Data team lead
  5. What I do in KKBOX • [Consultant] Support business decision

    making • strategy to expand video content library? 
 a trade off between price and customer satisfaction • [Developer] Enhance product perceived quality • system to deliver personalized experience? 
 recommender system, notification optimization, …
  6. 3 Stages of Me Fledgeling: how I dig out insight

    Collaborating: how I work with others Advocating: how I ask others to enjoy us
  7. Xiao Hu, Jin Ha Lee and Leanne Ka Yan Wong

    (2014), Music Information Behaviors and System Preferences of University Students in Hong Kong [Citation 174] JH Lee, JS Downie (2004), Survey of music information needs, uses, and seeking behaviours: preliminary findings 52.5% (31% in 2004) by the popularity 57.4% by recommendations from other people survey in HK, 2014 ⼤大家如何探索新⾳音樂?
  8. Social influence is great, and so is popularity. Xiao Hu,

    Jin Ha Lee and Leanne Ka Yan Wong (2014), Music Information Behaviors and System Preferences of University Students in Hong Kong [Citation 174] JH Lee, JS Downie (2004), Survey of music information needs, uses, and seeking behaviours: preliminary findings 52.5% (31% in 2004) by the popularity 57.4% by recommendations from other people survey in HK, 2014 ⼤大家如何探索新⾳音樂?
  9. 0 2000 4000 6000 8000 10000 0.0 0.2 0.4 0.6

    0.8 1.0 play count song number 2015 2008 2004 播放次數 歌曲比例例
  10. −50 −25 0 25 50 −50 −25 0 25 50

    dim1 dim2 Cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  11. −50 −25 0 25 50 −50 −25 0 25 50

    dim1 dim2 Cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 張學友, 張宇, 信樂團,
  12. −50 −25 0 25 50 −50 −25 0 25 50

    dim1 dim2 Cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 張學友, 張宇, 信樂團, 范逸⾂臣, 陶吉吉, 蕭 彭佳慧, ⿑齊秦, 杜德偉, 周杰倫倫, 陳零九, 無印良品, 嚴爵 MC Hot Dog, 張震嶽, 謝和弦 MP魔幻⼒力力量量, 黃鴻
  13. −50 −25 0 25 50 Cluster 1 2 3 4

    5 6 7 8 9 10 11 12 13 14 15 並沒有跳出同溫層太遠
  14. Two Different Subjects 23 22 21 20 19 18 17

    16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 Monday Tuesday Wednesday Thursday Friday Saturday Sunday weekday hour 50 100 150 200 250 acts 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 Monday Tuesday Wednesday Thursday Friday Saturday Sunday weekday hour 50 10 15 acts 深夜時段 ⼯工作時段 4VCKFU" 4VCKFU#
  15. Do users listen regularly? Trace: users who purchase with mycard

    credits Y-axis 聆聽時間 X-axis ⼀一週內的 168 ⼩小時 0 50 150 250 User 67158956 hours in a week usage 24hr Mon Wed Fri User A
  16. 0 50 150 250 User 67158956 hours in a week

    usage 24hr Mon Wed Fri 0 100 200 User 8729390 hours in a week usage 24hr Mon Wed Fri 0 50 150 User 21570083 hours in a week usage 24hr Mon Wed Fri 0 50 150 User 21566513 hours in a week usage 24hr Mon Wed Fri 0 50 150 250 User 21574953 hours in a week usage 24hr Mon Wed Fri 0 100 200 User 9058153 hours in a week usage 24hr Mon Wed Fri 0 50 150 User 69277857 hours in a week usage Mon Wed Fri 0 50 100 150 User 11757913 hours in a week usage Mon Wed Fri 0 50 150 User 44551330 hours in a week usage Mon Wed Fri 規律律 不規律律
  17. 24 hr 24 hr 24 hr 24 hr 24 hr

    24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 16 hr 24 hr 24 hr 24 hr 24 hr 23 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 25 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 26 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 24 hr 23 hr 多數⽤用⼾戶 有週期性
  18. 0 100 300 Group 1: 5.8% hours in a day

    usage 0 6 12 18 0 200 400 Group 2: 7.3% hours in a day usage 0 6 12 18 0 100 200 300 Group 3: 11.8% hours in a day usage 0 6 12 18 0 100 200 300 Group 4: 16.0% hours in a day usage 0 6 12 18 0 100 300 Group 5: 12.8% hours in a day usage 0 6 12 18 0 100 300 Group 6: 13.4% hours in a day usage 0 6 12 18 0 100 300 Group 7: 14.2% hours in a day usage 0 6 12 18 0 100 300 Group 8: 12.4% hours in a day usage 0 6 12 18 0 100 200 300 Group 9: 6.3% hours in a day usage 0 6 12 18 多種⽣生活型態
  19. 0 200 400 Group 2: 7.3% hours in a day

    usage 0 6 12 18 usage Group 5: 12.8% 通勤勤族 使⽤用⾼高峰落落於早晨八點與夜間六點 ⾼高峰持續時間短,持續僅 20 — 30 分鐘 average median
  20. 0 100 200 300 Group 4: 16.0% hours in a

    day usage 0 6 12 18 usage Group 7: 14.2% 使⽤用⾼高峰始於 10:00 到 18:00 ⾼高峰持續時間長,持續僅 4 - 5 ⼩小時 辦公族
  21. How you describe pref. • Latent Representation • A multi-dimensional

    vector learned from crowd, is specified by a point in a latent space • The similarity between two objects is reflected in their distance in the latent space
  22. Visualisation Framework • Global Trend • Album clusters, • Artist

    clusters, • … • Individual Preference • Diversity of preference • Factors related to preference • …
  23. First-step Observation In training dataset, 27% customers’ labels = the

    last one saw in history views 37% customers’ labels = one appeared in history views 18% customers’ labels = one never appeared in training set
  24. Naïve Baseline Just fill in the last title id in

    view history for each individual You get 27%, namely, rank 20th