[論文紹介] A Method to Anonymize Business Metrics to Publishing Implicit Feedback Datasets (Recsys2020) / recsys20-reading-gunosy-datapub

D490d541e3d1ab04d5203e8b210b2233?s=47 ysekky
October 18, 2020

[論文紹介] A Method to Anonymize Business Metrics to Publishing Implicit Feedback Datasets (Recsys2020) / recsys20-reading-gunosy-datapub

I'm the author of this paper.
This presentation includes the process of research and collaboration with co-author.

D490d541e3d1ab04d5203e8b210b2233?s=128

ysekky

October 18, 2020
Tweet

Transcript

  1. 株式会社 Gunosy Gunosy Tech Lab 上席研究員 関 喜史 2020年10月27日 Recsys2020読み会:

    A Method to Anonymize Business Metrics to Publishing Implicit Feedback Datasets
  2. (C) Gunosy Inc. All Rights Reserved. PAGE | 2 ▪

    関 喜史 – 1987年生まれ – 富山商船情報工学科 -> 東大システム創成->東大技 術経営 – 工学博士(2017年3月) – 2011年未踏クリエイター – 未踏ジュニア メンター(2017年~) ▪ 株式会社Gunosy – Gunosy Tech Lab 研究開発チーム – 上席研究員 – 共同創業者 ▪ 専門領域 – 推薦システム – ユーザ行動分析 ▪ 趣味 – 野球、テニス – アイドル、アニメ、漫画 – 将棋、ボードゲーム 自己紹介
  3. (C) Gunosy Inc. All Rights Reserved. PAGE | 3 研究の動機

    ニュース推薦の業務レベルでぶつかる問題が、推薦システムコミュニティでは そこまで重要視されていない • 新規アイテム推薦の難しさ • アイテムの人気トレンドが急に変わることへの対応 • リストに求められる高い更新性 • こうした問題を推薦システムコミュニティに提案していくためにデータ自 体を公開するところからやる必要がある 日本の会社がもっとデータ公開するようになってほしい • 推薦システムを作りたい学生さんの熱量に対して企業側が答えられてい ないと感じることが多い • 推薦システムの研究をしたいと思った学生さんが触れられるデータがほ しい Gunosyのデータ公開をしたいとずっと考えていた
  4. (C) Gunosy Inc. All Rights Reserved. PAGE | 4 研究の動機

    https://twitter.com/YoshifumiSeki/status/1181416279967580160 データセット公開における障壁はプライバ シーと経営情報漏えいだといえる • プライバシーは究極それに関する情報 を含まなければよいのでは? • 個人情報の非匿名化に関する研究は 多い 経営情報漏洩を扱ったものは見た限り存在 しなかった • これができればデータを公開できるの ではないか? • その提案自体が研究になるのではない か? みたいなことをTwitterに書いた
  5. (C) Gunosy Inc. All Rights Reserved. PAGE | 5 研究の動機

    https://twitter.com/tmaehara/status/118142751129464423 1 前原さんからエアリプをもらう • この時点で面識はない • Twitterでは相互でした • 去年のKDD採択あたりでエアリプで褒 めてもらった記憶 この研究は前原さんのチームがAAAI-19に 採択されたもの • Fairnessのためにユーザの属性情報を サンプリングして学習用データをつくる • たしかに問題設定は似てそう
  6. (C) Gunosy Inc. All Rights Reserved. PAGE | 6 研究の過程

    • 個人情報 ◦ 当然そうした情報は含めない • 属性情報( 性別・年齢など) ◦ こうした情報はde-anonymizationにつながること が指摘されているので含めない • 記事の情報は公開しない ◦ 我々は記事を借りている立場 ◦ de-anonymizationにもつながる • IDのみなら問題ないという整理 • ビジネスKPIを復元されたくないが、KPIをすべてリスト アップするのは不可能 • ランダムサンプリングでは平均値がわかる • 一人あたりのクリック数が特定の分布になるようにすれ ば、他の情報もぜんぶ歪むのでは? なにを守らなければならないかを考える ユーザ メディア グノシー
  7. (C) Gunosy Inc. All Rights Reserved. PAGE | 7 研究の過程

    • 法務との相談 ◦ 復元不可能にするために、ユーザの属性情報は公開しないことにし た ◦ この観点から、Fairnessの観点を研究に導入することになる • メディアコミュニケーションチームとの相談 ◦ どの記事がどうだった、ということを推定できなくするために記事ID のみの公開に留めることにした ◦ この観点からPopularity Biasが提案にはいった ◦ いずれBERTベクトルなんかを付加したいと思っている • メリットの整理 ◦ ブランディングだけではないなにかが必要 ◦ 最新アルゴリズムを適用しやすくなる・我々の課題が研究コミュニ ティの問題にできる やりたくなったので社内調整をがんばる
  8. (C) Gunosy Inc. All Rights Reserved. PAGE | 8 研究の過程

    最終的にこういう稟議資料を作って経営会議の承認をもらう
  9. (C) Gunosy Inc. All Rights Reserved. PAGE | 9 研究の過程

    • 最初に連絡したのは2020/02/20 ◦ TwitterのDMで ◦ この後も連絡はTwitterのDMの み ◦ Google Meetsで2週間に1回の ペースでMTG • 目標とやることを最初に合意したので 短い時間でスムーズに進んだ ◦ 論文投稿とデータ公開をするこ とで合意 ◦ サンプリングメソッドの具体化と 実装は前原さん ◦ 全体的な論文執筆と推薦システ ムを使った検証実験を私が担当 前原さんとの共同研究開始
  10. (C) Gunosy Inc. All Rights Reserved. PAGE | 10 レビュー結果

    • 全査読者から主張を弱めろというご意見をいただく ◦ 元々のタイトルが「Challenge and Solution to publish implicit datasets from commercial service.」だったけどSolutionしてない でしょ。という意見 ◦ ごもっとも。 ◦ CameraReadyで現在のタイトルにして、solveとかsolutionみたい な記述をぜんぶ変えた • 評価されたポイント ◦ モチベーションと最適化問題を使うアプローチは受け入れられた ◦ データセットの公開とコードの公開は高く評価された ▪ 論文提出時にファイルを添付できたので、添付した。 4/3/3 -> MetaReviewerが4でAccept (正直もっといいスコアだと思ってた)
  11. Yoshifumi SEKI (Gunosy.inc, Japan) Takanori MAEHARA (RIKEN, Japan) A Method

    to Anonymize Business Metrics to Publishing Implicit Feedback Datasets
  12. (C) Gunosy Inc. All Rights Reserved. PAGE | 12 Background

    Datasets have contributed to develop recommendation system studies. • MovieLens, Netflix Prize • In recent years, some data science competitions, such as Kaggle, KDD Cup, and Recsys Challenges, promote dataset publications. Implicit feedback datasets from commercial services are not enough. • Recommendation systems have adopted in many and various service, so many and various datasets are needed. Dataset publication is important for recsys studies.
  13. (C) Gunosy Inc. All Rights Reserved. PAGE | 13 Motivation

    There are some business risks to publish dataset. • Leaking confidential business metrics. • Some reputation risks. Before publishing a dataset, researchers must get approval by a business manager. • many business managers are not specialists in machine learning or recommender system. • The researchers should be responsible for explaining the risks and benefits. We focus on an implicit feedback datasets. • Implicit feedback datasets include confidential business information and users’ personal information. • Explicit feedback datasets are often constructed by crawling public web resources, such as user reviews and ratings available online. We would like to make it easier for commercial services to publish datasets.
  14. (C) Gunosy Inc. All Rights Reserved. PAGE | 14 Contributions

    • We summarize the challenges of building and publishing datasets from commercial service • We formulate the problem of building and publishing a dataset as a optimization problem that seeks the sampling weight of users. • We applied our method to build datasets from the raw data of our real-world mobile news delivery service Gunosy, which is a popular news delivery service in Japan ◦ The raw data has more than 1,000,000 users with 100,000,000 interactions. • The implementation of our proposed method and a dataset built by our proposed method are public https://github.com/gunosy/publishing-dataset-recsys20
  15. (C) Gunosy Inc. All Rights Reserved. PAGE | 15 Tasks

    • User behavior logs: When user u clicks article a at time t, the triplet (u, a, t) is recorded as a log • User attributes: each user has attributes, such as age and gender. • Article category: Each news articles has a category, such as sports, entertainment, and politics. Our task is to publish a subset of the user behavior logs. We build dataset by “sampling user” approach. 1. Samples users from user behavior logs. 2. Collects all the user behavior logs associated with the sampled users We only focus on the following three data to simplify the situation.
  16. (C) Gunosy Inc. All Rights Reserved. PAGE | 16 Sampling

    Approach User behavior logs User A User B User C (item A, item C, item D, item G) (item B, item C, item D, item F, item G) (item B, item C, item E)
  17. (C) Gunosy Inc. All Rights Reserved. PAGE | 17 Sampling

    Approach User behavior logs User A User B User C (item A, item C, item D, item G) (item B, item C, item D, item F, item G) (item B, item C, item E) Dataset sampling behavior log The consumption histories of the users are missing.
  18. (C) Gunosy Inc. All Rights Reserved. PAGE | 18 Sampling

    Approach User behavior logs User A User B User C (item A, item C, item D, item G) (item B, item C, item D, item F, item G) (item B, item C, item E) Dataset sampling user The consumption histories of the users are keeping.
  19. (C) Gunosy Inc. All Rights Reserved. PAGE | 19 Challenges

    1. Anonymize the Business Metrics 2. Maintain Faireness 3. Reduce Popularity Bias We pose the following three challenges.
  20. (C) Gunosy Inc. All Rights Reserved. PAGE | 20 Challenges

    1. Anonymize the Business Metrics 2. Maintain Faireness 3. Reduce Popularity Bias • Do not want to disclose confidential business metrics. ◦ operating income ◦ the average number of clicks ◦ the average active rate of users • If the users are sampled uniformly, some business metrics could be easily estimated. ◦ the average number of clicks ◦ the average active rate of users • We must sample users with a non-uniform distribution. We pose the following three challenges.
  21. (C) Gunosy Inc. All Rights Reserved. PAGE | 21 Challenges

    1. Anonymize the Business Metrics 2. Maintain Faireness 3. Reduce Popularity Bias • Publishing a fair dataset is very important. ◦ Some existing methods that maintain fairness use user attributes; hence the user attributes cause de-anonymization. ◦ Publishing unfair dataset indirectly contributes to creating unfair machine learning models. • This risk will damage the company's reputation. We pose the following three challenges.
  22. (C) Gunosy Inc. All Rights Reserved. PAGE | 22 Challenges

    1. Anonymize the Business Metrics 2. Maintain Faireness 3. Reduce Popularity Bias • Recommender systems are expected to match long-tailed items with users; thus, algorithms suffering the popularity bias cannot achieve their role. • We believe popularity bias is a problem in building dataset. ◦ If the dataset is built by the uniform sampling, the items of unpopular categories are less frequently sampled. ◦ Because researchers cannot increase the number of interactions, the publisher must keep a certain amount of interactions with unpopular category items. We pose the following three challenges.
  23. (C) Gunosy Inc. All Rights Reserved. PAGE | 23 Mathematical

    Formulation We formulate our task as a problem of finding the sampling weight of users: w(u). We assume that our business metric are anonymized if the distribution of the number of clicks in the dataset is different from one in the raw data. • formulating this challenge is impossible because it needs to enumerate all the metrics that we should anonymize. • several important metrics are strongly correlated with the distribution of the number of clicks. We sample users to make the distribution of datasets closer to a target distribution. user sampling click distribution of raw data target distribution
  24. (C) Gunosy Inc. All Rights Reserved. PAGE | 24 Finding

    Sampling Weight User behavior logs User A User B User C (item A, item C, item D, item G) (item B, item C, item D, item F, item G) (item B, item C, item E) Dataset sampling by weight: w(u) w(User A) w(User B) w(User C) Finding optimal w(u) to close target distribution
  25. (C) Gunosy Inc. All Rights Reserved. PAGE | 25 Mathematical

    Formulation We sample users to make the distribution of datasets closer to a target distribution. : the expected click distribution on the dataset. : the target distribution : Wasserstein distance.
  26. (C) Gunosy Inc. All Rights Reserved. PAGE | 26 Mathematical

    Formulation We sample users to make the distribution of datasets closer to a target distribution. : the expected click distribution on the dataset. : the target distribution : Wasserstein distance on the real line. We also sample users to make the distribution of user attributes and clicks in each article categories to a specific distribution. D is the KL divergence. Each expected distribution is simply calculated using sampling weight.
  27. (C) Gunosy Inc. All Rights Reserved. PAGE | 27 Mathematical

    Formulation We find a sampling weight at which all the loss functions have small values. We apply the gradient descent-type algorithms to minimize loss function.
  28. (C) Gunosy Inc. All Rights Reserved. PAGE | 28 Experiments

    We built eight dataset from user behavior logs • sample 60,000 users from raw data. • two type target click distributions. ◦ Zipf(1) and Zipf(2) • controlled/un-controlled target distribtion of Attributes and Category We built a dataset from the raw data in our news delivery services.
  29. (C) Gunosy Inc. All Rights Reserved. PAGE | 29 Experiments

    We built eight dataset from user behavior logs • sample 60,000 users from raw data. • two type target click distributions. ◦ Zipf(1) and Zipf(2) • controlled/un-controlled target distribtion of Attributes and Category • We built a dataset from the raw data in our news delivery services. Zipf(2) datasets are more sparse than Zipf(1)
  30. (C) Gunosy Inc. All Rights Reserved. PAGE | 30 Experiments

    We built eight dataset from user behavior logs • sample 60,000 users from raw data. • two type target click distributions. ◦ Zipf(1) and Zipf(2) • controlled/un-controlled target distribtion of Attributes and Category We built a dataset from the raw data in our news delivery services. category controlled datasets are more sparse than uncontrolled datasets
  31. (C) Gunosy Inc. All Rights Reserved. PAGE | 31 Experiments

    We successfully controlled the click distributions. Zipf(1)’s distribution Zipf(2)’s distribution
  32. (C) Gunosy Inc. All Rights Reserved. PAGE | 32 Experiments

    We successfully controlled the click distributions. Zipf(1)’s distribution Zipf(2)’s distribution The distributions of both user attributes and article categories are also controlled successfully.
  33. (C) Gunosy Inc. All Rights Reserved. PAGE | 33 Experiment

    Comparing algorithms evaluations for each dataset The performance of the algorithms differed in how the datasets were built.
  34. (C) Gunosy Inc. All Rights Reserved. PAGE | 34 Experiment

    Comparing algorithms evaluations for each dataset Evaluations on Zipf(1)’s datasets were similar to uniform. Best Second
  35. (C) Gunosy Inc. All Rights Reserved. PAGE | 35 Experiment

    Comparing algorithms evaluations for each dataset Evaluation results on Zipf(2)’s datasets were worse than Zipf(1)’s. This may because Zipf(2) datasets were sparse. Best
  36. (C) Gunosy Inc. All Rights Reserved. PAGE | 36 Experiment

    Comparing algorithms evaluations for each dataset It is necessary to select sampling settings according to the purpose, and it may be important to publish datasets with various settings. It is necessary to select sampling settings according to the purpose, and it may be important to publish datasets with various settings.
  37. (C) Gunosy Inc. All Rights Reserved. PAGE | 37 Conclusion

    1. summarizing the challenges of building and publishing datasets from commercial service. 2. formulating the problem of building and publishing a dataset as a optimization problem that seeks the sampling weight of users. 3. appling our method to build datasets from the raw data of our real-world mobile news delivery service Limitations & Future Works • We did not give a theoretical guarantee if the impossibility of the estimation. Providing such an impossibility is an important. • This study only considered the user-item interactions. However real world services may have different types of behavior logs. This study is the first attempt to reduce business risks in publishing datasets
  38. (C) Gunosy Inc. All Rights Reserved. PAGE | 38 Conclusion

    Previously, researchers has not disclosed how to build the dataset and has not shared the knowledge with the community. We hope that our work will lead to more discussions on the process of building and publishing datasets and that many datasets will be published. This study is the first attempt to reduce business risks in publishing datasets https://github.com/gunosy/publishing-dataset-recsys20 our implementation and dataset avaiavle Feel free to contact me: yoshifumi.seki@gunosy.com
  39. (C) Gunosy Inc. All Rights Reserved. PAGE | 39 まとめ

    Q&Aやその後のコミュニケーションなどかなり盛り上がりました • より議論がしたいと直接連絡をくれた研究者も • オンライン学会はQ&Aが盛り上がるなぁと思った • Recsysのシングルセッション&2回発表という特性がポジティブに作用し てそう 論文ベースの産学連携の成功例になったと思う • タスクと論文というゴールが決まっていると研究者側にとってやりやすい • 企業側が役割を持って取り組むことが重要 データ公開できる前例を作ったのでそれを前提とした研究ができる • こみ入った問題設定でもデータごと公開できるので、コントリビューション が作りやすい • 過去の研究の問題点に踏み込みやすい • 公開前提でいくので共同研究もしやすい データ公開におけるサンプリングをKPIを匿名化するように 最適化問題で解く研究を発表しました
  40. 情報を世界中の人に最適に届ける