Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Data Culture in your workplace

Building Data Culture in your workplace

Jing-Kai Lou

July 13, 2017
Tweet

More Decks by Jing-Kai Lou

Other Decks in Science

Transcript

  1. Self Introduction • 羅經凱, Jing-Kai Lou • Mining knowledge from

    data and turning it into decisions. • Data Scientist, KKStream, 2016 - now • Data Scientist, KKBOX Research Center, 2014 - 2016 • PhD. NTU EE, 2014 Ronald Coase: “If you torture the data long enough, it will confess.”
  2. Communicate A way to find niche market How our title

    performs Which kinds are alluring How to measure the satisfaction instead of popularity? How users binge-watch This measure leads us to find fascinating titles Better user preference understanding Observe & Measure Analysis & Forecast Smart content purchase 智慧擴充片庫
  3. Communicate In early stage, know which one to promote and

    which one to abandon Promote which titles? Expose on ones potentially grasp eyes Monitor the watch hours Build trend forecast model (according to first 3 weeks performance) Precise marketing ⾏行行銷時機掌握 Analysis & Forecast Observe & Measure
  4. Communicate Be confident to release better version Which rec. sys

    performs better? CTR of related items Apply A/B testing To tell which one is better Product Optimization 產品優化 Observe & Measure Analysis & Forecast
  5. Communicate Making decision Verifying hypothesis, Asking right questions Acquiring the

    needed data Making hypothesis, Finding patterns Building predictive models Analysis & Forecast Observe & Measure
  6. The problem: what do we do with data? The solution:

    train your org. to adapt data culture People Process Product Data access Regular meeting Reporting structure Set measurable goal Act on insights Utilize analytics and productivity tools
  7. 2 Factors for Effectively Using Data Selecting the right data

    Building the capacity, competence, and confidence of staff to effectively use data
  8. Data democratization Data democratization is the ability for information in

    a digital format to be accessible to the average end user. The goal of data democratization is to allow non-specialists to be able to gather and analyze data without requiring outside help.
  9. benefits of giving company-wide access to data outweigh the costs

    Facebook was one of the first companies to give its employees access to data at scale World bank makes its data open so volunteers can come together to clean and interpret it.
  10. 精準 (Accuracy) / 多元 (Diversity) / 新穎 (Novelty) 精準,泛指根據⽤用⼾戶過去歷史記錄,是否可以預測到⽤用⼾戶下⼀一部看到什什麼?舉例例來來說,假使有六個⽉月 的資訊,我們隱藏後兩兩個⽉月的資訊,單純憑藉前四個⽉月的資訊預測後兩兩個⽉月的發展。

    多元,推薦應避免單⼀一⼝口味的推薦,充分展現我們⼿手上的資源(coverage)。我知道這⽤用⼾戶喜歡看超級 英雄,但是不能永遠只推薦超級英雄的電影。 新穎,拋出的新物品(新劇,或者是鮮少被⼈人看過的影劇)能夠使⽤用⼾戶產⽣生正⾯面的情緒反應。這也是 ⽬目前最難的部分,也正是⼤大家正在努⼒力力的議題之⼀一。(學者認為單純仰賴 CTR ,則放⼤大了了推薦系統的 效果,⾒見見⽂文)
  11. Training set Training labels Testing set Testing labels Submission Testing

    labels Public Score Private Score Evaluation: accuracy (user 1, title A) (user 2, title B) (user 2, title B) Acc. = 0.33 You can make another guess depending on your previous ones
  12. This 14-day game has 63 teams 81 players 334 downloads

    835 submissions Internal champion
  13. First-step Observation In training dataset, 27% customers’ labels = the

    last one saw in history views 37% customers’ labels = one appeared in history views 18% customers’ labels = one never appeared in training set
  14. Naïve Baseline Just fill in the last title id in

    view history for each individual You get 27%, namely, rank 20th
  15. Transition Matrix In training data, we observe how users view

    over titles time 甄嬛傳 甄嬛傳 甄嬛傳 琅琊榜 ⽉月薪嬌妻
  16. Benefits from Transition Matrix As Collaborate filtering and Matrix Factorization

    would not obey our finding (the last one is the answer in most cases). The transition matrix method supports the our finding! So, base on it, we have high confidence to improve the score higher than baseline. 0.27421
  17. Next Observation Consider it as a sequential problem, we overlook

    the spent time on each title. We find individuals spent time differently on titles. For some, they only view no longer than 5 mins, and never watch it again. Longer spent time = Favorite
  18. XGBoost One decision model commonly used in competition. It runs

    fast. So, you can interplay with data quickly. We formulate the questions as a multiple classification problem. According to the time customer spent (feature), we pick top 40 titles and 1 (as the other) as labels to classify.
  19. How to select your submissions Offline evaluation DO NOT over-fitting

    your result! Cross-validation would be your best friend. Training set Training set Validation set