Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[論文紹介] A Method to Anonymize Business Metrics to Publishing Implicit Feedback Datasets (Recsys2020) / recsys20-reading-gunosy-datapub

ysekky
October 18, 2020

[論文紹介] A Method to Anonymize Business Metrics to Publishing Implicit Feedback Datasets (Recsys2020) / recsys20-reading-gunosy-datapub

I'm the author of this paper.
This presentation includes the process of research and collaboration with co-author.

ysekky

October 18, 2020
Tweet

More Decks by ysekky

Other Decks in Research

Transcript

  1. 株式会社 Gunosy
    Gunosy Tech Lab
    上席研究員 関 喜史
    2020年10月27日
    Recsys2020読み会:
    A Method to Anonymize Business Metrics
    to Publishing Implicit Feedback Datasets

    View full-size slide

  2. (C) Gunosy Inc. All Rights Reserved. PAGE | 2
    ■ 関 喜史
    – 1987年生まれ
    – 富山商船情報工学科 -> 東大システム創成->東大技
    術経営
    – 工学博士(2017年3月)
    – 2011年未踏クリエイター
    – 未踏ジュニア メンター(2017年~)
    ■ 株式会社Gunosy
    – Gunosy Tech Lab 研究開発チーム
    – 上席研究員
    – 共同創業者
    ■ 専門領域
    – 推薦システム
    – ユーザ行動分析
    ■ 趣味
    – 野球、テニス
    – アイドル、アニメ、漫画
    – 将棋、ボードゲーム
    自己紹介

    View full-size slide

  3. (C) Gunosy Inc. All Rights Reserved. PAGE | 3
    研究の動機
    ニュース推薦の業務レベルでぶつかる問題が、推薦システムコミュニティでは
    そこまで重要視されていない
    ● 新規アイテム推薦の難しさ
    ● アイテムの人気トレンドが急に変わることへの対応
    ● リストに求められる高い更新性
    ● こうした問題を推薦システムコミュニティに提案していくためにデータ自
    体を公開するところからやる必要がある
    日本の会社がもっとデータ公開するようになってほしい
    ● 推薦システムを作りたい学生さんの熱量に対して企業側が答えられてい
    ないと感じることが多い
    ● 推薦システムの研究をしたいと思った学生さんが触れられるデータがほ
    しい
    Gunosyのデータ公開をしたいとずっと考えていた

    View full-size slide

  4. (C) Gunosy Inc. All Rights Reserved. PAGE | 4
    研究の動機
    https://twitter.com/YoshifumiSeki/status/1181416279967580160
    データセット公開における障壁はプライバ
    シーと経営情報漏えいだといえる
    ● プライバシーは究極それに関する情報
    を含まなければよいのでは?
    ● 個人情報の非匿名化に関する研究は
    多い
    経営情報漏洩を扱ったものは見た限り存在
    しなかった
    ● これができればデータを公開できるの
    ではないか?
    ● その提案自体が研究になるのではない
    か?
    みたいなことをTwitterに書いた

    View full-size slide

  5. (C) Gunosy Inc. All Rights Reserved. PAGE | 5
    研究の動機
    https://twitter.com/tmaehara/status/118142751129464423
    1
    前原さんからエアリプをもらう
    ● この時点で面識はない
    ● Twitterでは相互でした
    ● 去年のKDD採択あたりでエアリプで褒
    めてもらった記憶
    この研究は前原さんのチームがAAAI-19に
    採択されたもの
    ● Fairnessのためにユーザの属性情報を
    サンプリングして学習用データをつくる
    ● たしかに問題設定は似てそう

    View full-size slide

  6. (C) Gunosy Inc. All Rights Reserved. PAGE | 6
    研究の過程
    ● 個人情報
    ○ 当然そうした情報は含めない
    ● 属性情報( 性別・年齢など)
    ○ こうした情報はde-anonymizationにつながること
    が指摘されているので含めない
    ● 記事の情報は公開しない
    ○ 我々は記事を借りている立場
    ○ de-anonymizationにもつながる
    ● IDのみなら問題ないという整理
    ● ビジネスKPIを復元されたくないが、KPIをすべてリスト
    アップするのは不可能
    ● ランダムサンプリングでは平均値がわかる
    ● 一人あたりのクリック数が特定の分布になるようにすれ
    ば、他の情報もぜんぶ歪むのでは?
    なにを守らなければならないかを考える
    ユーザ
    メディア
    グノシー

    View full-size slide

  7. (C) Gunosy Inc. All Rights Reserved. PAGE | 7
    研究の過程
    ● 法務との相談
    ○ 復元不可能にするために、ユーザの属性情報は公開しないことにし

    ○ この観点から、Fairnessの観点を研究に導入することになる
    ● メディアコミュニケーションチームとの相談
    ○ どの記事がどうだった、ということを推定できなくするために記事ID
    のみの公開に留めることにした
    ○ この観点からPopularity Biasが提案にはいった
    ○ いずれBERTベクトルなんかを付加したいと思っている
    ● メリットの整理
    ○ ブランディングだけではないなにかが必要
    ○ 最新アルゴリズムを適用しやすくなる・我々の課題が研究コミュニ
    ティの問題にできる
    やりたくなったので社内調整をがんばる

    View full-size slide

  8. (C) Gunosy Inc. All Rights Reserved. PAGE | 8
    研究の過程
    最終的にこういう稟議資料を作って経営会議の承認をもらう

    View full-size slide

  9. (C) Gunosy Inc. All Rights Reserved. PAGE | 9
    研究の過程
    ● 最初に連絡したのは2020/02/20
    ○ TwitterのDMで
    ○ この後も連絡はTwitterのDMの

    ○ Google Meetsで2週間に1回の
    ペースでMTG
    ● 目標とやることを最初に合意したので
    短い時間でスムーズに進んだ
    ○ 論文投稿とデータ公開をするこ
    とで合意
    ○ サンプリングメソッドの具体化と
    実装は前原さん
    ○ 全体的な論文執筆と推薦システ
    ムを使った検証実験を私が担当
    前原さんとの共同研究開始

    View full-size slide

  10. (C) Gunosy Inc. All Rights Reserved. PAGE | 10
    レビュー結果
    ● 全査読者から主張を弱めろというご意見をいただく
    ○ 元々のタイトルが「Challenge and Solution to publish implicit
    datasets from commercial service.」だったけどSolutionしてない
    でしょ。という意見
    ○ ごもっとも。
    ○ CameraReadyで現在のタイトルにして、solveとかsolutionみたい
    な記述をぜんぶ変えた
    ● 評価されたポイント
    ○ モチベーションと最適化問題を使うアプローチは受け入れられた
    ○ データセットの公開とコードの公開は高く評価された
    ■ 論文提出時にファイルを添付できたので、添付した。
    4/3/3 -> MetaReviewerが4でAccept (正直もっといいスコアだと思ってた)

    View full-size slide

  11. Yoshifumi SEKI (Gunosy.inc, Japan)
    Takanori MAEHARA (RIKEN, Japan)
    A Method to Anonymize Business Metrics
    to Publishing Implicit Feedback Datasets

    View full-size slide

  12. (C) Gunosy Inc. All Rights Reserved. PAGE | 12
    Background
    Datasets have contributed to develop recommendation system studies.
    ● MovieLens, Netflix Prize
    ● In recent years, some data science competitions, such as Kaggle,
    KDD Cup, and Recsys Challenges, promote dataset publications.
    Implicit feedback datasets from commercial services are not enough.
    ● Recommendation systems have adopted in many and various
    service, so many and various datasets are needed.
    Dataset publication is important for recsys studies.

    View full-size slide

  13. (C) Gunosy Inc. All Rights Reserved. PAGE | 13
    Motivation
    There are some business risks to publish dataset.
    ● Leaking confidential business metrics.
    ● Some reputation risks.
    Before publishing a dataset, researchers must get approval by a
    business manager.
    ● many business managers are not specialists in machine learning or
    recommender system.
    ● The researchers should be responsible for explaining the risks and
    benefits.
    We focus on an implicit feedback datasets.
    ● Implicit feedback datasets include confidential business information
    and users’ personal information.
    ● Explicit feedback datasets are often constructed by crawling public
    web resources, such as user reviews and ratings available online.
    We would like to make it easier for commercial services to publish datasets.

    View full-size slide

  14. (C) Gunosy Inc. All Rights Reserved. PAGE | 14
    Contributions
    ● We summarize the challenges of building and publishing datasets
    from commercial service
    ● We formulate the problem of building and publishing a dataset as a
    optimization problem that seeks the sampling weight of users.
    ● We applied our method to build datasets from the raw data of our
    real-world mobile news delivery service Gunosy, which is a popular
    news delivery service in Japan
    ○ The raw data has more than 1,000,000 users with 100,000,000
    interactions.
    ● The implementation of our proposed method and a dataset built by
    our proposed method are public
    https://github.com/gunosy/publishing-dataset-recsys20

    View full-size slide

  15. (C) Gunosy Inc. All Rights Reserved. PAGE | 15
    Tasks
    ● User behavior logs: When user u clicks article a at time t, the
    triplet (u, a, t) is recorded as a log
    ● User attributes: each user has attributes, such as age and gender.
    ● Article category: Each news articles has a category, such as
    sports, entertainment, and politics.
    Our task is to publish a subset of the user behavior logs.
    We build dataset by “sampling user” approach.
    1. Samples users from user behavior logs.
    2. Collects all the user behavior logs associated with the sampled
    users
    We only focus on the following three data to simplify the situation.

    View full-size slide

  16. (C) Gunosy Inc. All Rights Reserved. PAGE | 16
    Sampling Approach
    User behavior logs
    User A
    User B
    User C
    (item A, item C, item D, item G)
    (item B, item C, item D, item F, item G)
    (item B, item C, item E)

    View full-size slide

  17. (C) Gunosy Inc. All Rights Reserved. PAGE | 17
    Sampling Approach
    User behavior logs
    User A
    User B
    User C
    (item A, item C, item D, item G)
    (item B, item C, item D, item F, item G)
    (item B, item C, item E)
    Dataset
    sampling behavior log
    The consumption histories of the users are missing.

    View full-size slide

  18. (C) Gunosy Inc. All Rights Reserved. PAGE | 18
    Sampling Approach
    User behavior logs
    User A
    User B
    User C
    (item A, item C, item D, item G)
    (item B, item C, item D, item F, item G)
    (item B, item C, item E)
    Dataset
    sampling user
    The consumption histories of the users are keeping.

    View full-size slide

  19. (C) Gunosy Inc. All Rights Reserved. PAGE | 19
    Challenges
    1. Anonymize the Business Metrics
    2. Maintain Faireness
    3. Reduce Popularity Bias
    We pose the following three challenges.

    View full-size slide

  20. (C) Gunosy Inc. All Rights Reserved. PAGE | 20
    Challenges
    1. Anonymize the Business Metrics
    2. Maintain Faireness
    3. Reduce Popularity Bias
    ● Do not want to disclose confidential business metrics.
    ○ operating income
    ○ the average number of clicks
    ○ the average active rate of users
    ● If the users are sampled uniformly, some business metrics could be
    easily estimated.
    ○ the average number of clicks
    ○ the average active rate of users
    ● We must sample users with a non-uniform distribution.
    We pose the following three challenges.

    View full-size slide

  21. (C) Gunosy Inc. All Rights Reserved. PAGE | 21
    Challenges
    1. Anonymize the Business Metrics
    2. Maintain Faireness
    3. Reduce Popularity Bias
    ● Publishing a fair dataset is very important.
    ○ Some existing methods that maintain fairness use user
    attributes; hence the user attributes cause de-anonymization.
    ○ Publishing unfair dataset indirectly contributes to creating unfair
    machine learning models.
    ● This risk will damage the company's reputation.
    We pose the following three challenges.

    View full-size slide

  22. (C) Gunosy Inc. All Rights Reserved. PAGE | 22
    Challenges
    1. Anonymize the Business Metrics
    2. Maintain Faireness
    3. Reduce Popularity Bias
    ● Recommender systems are expected to match long-tailed items
    with users; thus, algorithms suffering the popularity bias cannot
    achieve their role.
    ● We believe popularity bias is a problem in building dataset.
    ○ If the dataset is built by the uniform sampling, the items of
    unpopular categories are less frequently sampled.
    ○ Because researchers cannot increase the number of
    interactions, the publisher must keep a certain amount of
    interactions with unpopular category items.
    We pose the following three challenges.

    View full-size slide

  23. (C) Gunosy Inc. All Rights Reserved. PAGE | 23
    Mathematical Formulation
    We formulate our task as a problem of finding the sampling weight of users:
    w(u).
    We assume that our business metric are anonymized if the distribution of
    the number of clicks in the dataset is different from one in the raw data.
    ● formulating this challenge is impossible because it needs to
    enumerate all the metrics that we should anonymize.
    ● several important metrics are strongly correlated with the distribution
    of the number of clicks.
    We sample users to make the distribution of datasets closer to a
    target distribution.
    user
    sampling
    click distribution of raw data target distribution

    View full-size slide

  24. (C) Gunosy Inc. All Rights Reserved. PAGE | 24
    Finding Sampling Weight
    User behavior logs
    User A
    User B
    User C
    (item A, item C, item D, item G)
    (item B, item C, item D, item F, item G)
    (item B, item C, item E)
    Dataset
    sampling by weight: w(u)
    w(User A)
    w(User B)
    w(User C)
    Finding optimal w(u) to close target distribution

    View full-size slide

  25. (C) Gunosy Inc. All Rights Reserved. PAGE | 25
    Mathematical Formulation
    We sample users to make the distribution of datasets closer to a target
    distribution.
    : the expected click distribution on the dataset.
    : the target distribution
    : Wasserstein distance.

    View full-size slide

  26. (C) Gunosy Inc. All Rights Reserved. PAGE | 26
    Mathematical Formulation
    We sample users to make the distribution of datasets closer to a target
    distribution.
    : the expected click distribution on the dataset.
    : the target distribution
    : Wasserstein distance on the real line.
    We also sample users to make the distribution of user attributes and
    clicks in each article categories to a specific distribution.
    D is the KL divergence.
    Each expected distribution is simply calculated using sampling weight.

    View full-size slide

  27. (C) Gunosy Inc. All Rights Reserved. PAGE | 27
    Mathematical Formulation
    We find a sampling weight at which all the loss functions have small
    values.
    We apply the gradient descent-type algorithms to minimize
    loss function.

    View full-size slide

  28. (C) Gunosy Inc. All Rights Reserved. PAGE | 28
    Experiments
    We built eight dataset from user behavior logs
    ● sample 60,000 users from raw data.
    ● two type target click distributions.
    ○ Zipf(1) and Zipf(2)
    ● controlled/un-controlled target distribtion of Attributes and Category
    We built a dataset from the raw data in our news delivery services.

    View full-size slide

  29. (C) Gunosy Inc. All Rights Reserved. PAGE | 29
    Experiments
    We built eight dataset from user behavior logs
    ● sample 60,000 users from raw data.
    ● two type target click distributions.
    ○ Zipf(1) and Zipf(2)
    ● controlled/un-controlled target distribtion of Attributes and Category

    We built a dataset from the raw data in our news delivery services.
    Zipf(2) datasets are more sparse than Zipf(1)

    View full-size slide

  30. (C) Gunosy Inc. All Rights Reserved. PAGE | 30
    Experiments
    We built eight dataset from user behavior logs
    ● sample 60,000 users from raw data.
    ● two type target click distributions.
    ○ Zipf(1) and Zipf(2)
    ● controlled/un-controlled target distribtion of Attributes and Category
    We built a dataset from the raw data in our news delivery services.
    category controlled datasets are more sparse than uncontrolled datasets

    View full-size slide

  31. (C) Gunosy Inc. All Rights Reserved. PAGE | 31
    Experiments
    We successfully controlled the click distributions.
    Zipf(1)’s distribution Zipf(2)’s distribution

    View full-size slide

  32. (C) Gunosy Inc. All Rights Reserved. PAGE | 32
    Experiments
    We successfully controlled the click distributions.
    Zipf(1)’s distribution Zipf(2)’s distribution
    The distributions of both user attributes and article
    categories are also controlled successfully.

    View full-size slide

  33. (C) Gunosy Inc. All Rights Reserved. PAGE | 33
    Experiment
    Comparing algorithms evaluations for each dataset
    The performance of the algorithms differed in how the datasets were built.

    View full-size slide

  34. (C) Gunosy Inc. All Rights Reserved. PAGE | 34
    Experiment
    Comparing algorithms evaluations for each dataset
    Evaluations on Zipf(1)’s datasets were similar to uniform.
    Best
    Second

    View full-size slide

  35. (C) Gunosy Inc. All Rights Reserved. PAGE | 35
    Experiment
    Comparing algorithms evaluations for each dataset
    Evaluation results on Zipf(2)’s datasets were worse than Zipf(1)’s.
    This may because Zipf(2) datasets were sparse.
    Best

    View full-size slide

  36. (C) Gunosy Inc. All Rights Reserved. PAGE | 36
    Experiment
    Comparing algorithms evaluations for each dataset
    It is necessary to select sampling settings according to the purpose, and it
    may be important to publish datasets with various settings.
    It is necessary to select sampling settings according to the
    purpose, and it may be important to publish datasets with
    various settings.

    View full-size slide

  37. (C) Gunosy Inc. All Rights Reserved. PAGE | 37
    Conclusion
    1. summarizing the challenges of building and publishing datasets
    from commercial service.
    2. formulating the problem of building and publishing a dataset as a
    optimization problem that seeks the sampling weight of users.
    3. appling our method to build datasets from the raw data of our
    real-world mobile news delivery service
    Limitations & Future Works
    ● We did not give a theoretical guarantee if the impossibility of the
    estimation. Providing such an impossibility is an important.
    ● This study only considered the user-item interactions. However real
    world services may have different types of behavior logs.
    This study is the first attempt to reduce business risks in publishing datasets

    View full-size slide

  38. (C) Gunosy Inc. All Rights Reserved. PAGE | 38
    Conclusion
    Previously, researchers has not disclosed how to build
    the dataset and has not shared the knowledge with the
    community.
    We hope that our work will lead to more discussions on
    the process of building and publishing datasets and that
    many datasets will be published.
    This study is the first attempt to reduce business risks in publishing datasets
    https://github.com/gunosy/publishing-dataset-recsys20
    our implementation and dataset avaiavle
    Feel free to contact me: [email protected]

    View full-size slide

  39. (C) Gunosy Inc. All Rights Reserved. PAGE | 39
    まとめ
    Q&Aやその後のコミュニケーションなどかなり盛り上がりました
    ● より議論がしたいと直接連絡をくれた研究者も
    ● オンライン学会はQ&Aが盛り上がるなぁと思った
    ● Recsysのシングルセッション&2回発表という特性がポジティブに作用し
    てそう
    論文ベースの産学連携の成功例になったと思う
    ● タスクと論文というゴールが決まっていると研究者側にとってやりやすい
    ● 企業側が役割を持って取り組むことが重要
    データ公開できる前例を作ったのでそれを前提とした研究ができる
    ● こみ入った問題設定でもデータごと公開できるので、コントリビューション
    が作りやすい
    ● 過去の研究の問題点に踏み込みやすい
    ● 公開前提でいくので共同研究もしやすい
    データ公開におけるサンプリングをKPIを匿名化するように
    最適化問題で解く研究を発表しました

    View full-size slide

  40. 情報を世界中の人に最適に届ける

    View full-size slide