Residual2Vec: Debiasing graph embedding with random graphs

Residual2Vec: Debiasing graph embedding with random graphs Sansan株式会社技術本部研究開発部
⿊⽊裕鷹 NeurIPS 2021 論⽂読み会 2/25

Data Strategy and Operation Center ⾃⼰紹介⿊⽊裕鷹オンライン名刺 •
2020年4⽉⼊社 • 4⽉の⻑野マラソンに出場 Yutaka Kuroki Sansan 株式会社技術本部研究開発部 Data Analysis Group 研究員プロダクト戦略室兼務 @kur0cky_y

Data Strategy and Operation Center 論⽂情報 • 筆頭著者は Indiana⼤学のポスドク
• 複雑ネットワークと CS (ML) の交差点に興味がありそう • 剣道 3段らしい著者： NeurIPS 2021 ポスター学会：

Data Strategy and Operation Center 1ページでまとめ • ランダムウォークベースの node embedding
では，ノードのサンプリングに偏りが⽣じる • 偏り除去のための埋め込みアルゴリズム “residual2vec” を提案 > word2vec の Negative Sampling から着想 > 任意のグラフモデルを仮定し，それを超える情報を⾏列分解 > 偏りの除去により，後段タスクの性能が向上することを確認 • コードは github ( https://github.com/skojaku/residual2vec )

背景友達のパラドックス

Data Strategy and Operation Center 友情のパラドックス「友達の⽅が友達が多い」

Data Strategy and Operation Center 友情のパラドックス「友達の⽅が友達が多い」 • 任意のノードについて，その隣接ノードの⽅が平均的に次数が⼤きい •
いわゆる “スケールフリー性” に起因 > ⼀部のノードが多数のエッジを占有する状態（ハブの存在） > 現実の多くのグラフデータに認められる性質 à グラフ上のランダムウォークも次数由来の “偏り” を受ける

Data Strategy and Operation Center 簡単な例 • 単純な⽣成モデル (Stochastic Block
Model; SBM) でのシミュレーション • 20％のコアノードが，ランダムウォーク系列の過半数を占める • 次数の⼩さいノードが過⼩評価されることにつながる Figure 4 (Kojaku et al., 2021)

Data Strategy and Operation Center 研究の⽬的 Q1: 次数由来の “偏り” が悪いとすれば，どうすれば取り除けるのか
Q2: そもそも ”偏り” は node embedding にとって良いのか，悪いのか

背景 Skip-Gram Negative Sampling

Data Strategy and Operation Center word2vec (Skip-Gram) 注⽬する単語 𝑖 の周辺に，別の単語
𝑗 が現れる確率を softmax で表す • 最尤法で推定するが，語彙数 𝑁 の増加に応じて分⺟の計算が厳しい Negative Sampling • 実際に現れる単語ペアを正例 (𝑌 ! = 1) に，ランダムな関係ないワード ℓ とのペアを負例 (𝑌 ! = 0) にとり，分類問題を解く • ℓ は，ノイズ分布 𝑝" ℓ = 𝑃 ℓ|𝑌ℓ = 0 に従ってサンプリングする à … à

Data Strategy and Operation Center アイデア • 𝑝! 𝑗 が，ランダムウォークにおける次数の偏りを軽減している
• 𝑝" 𝑗 がベースラインとなり，類似度 𝑢$ %𝑣! がそこからの偏差とみなせる • SGNS では 𝑝" 𝑗 が単語の頻度によって決まるため，得られる 𝑢$ %𝑣! では頻度の偏りがかかっていない • 𝑝! ℓ は⾃由に設定することができる • 任意のグラフモデルを仮定し利⽤することができる • また，よく知られたnull model であれば，解析的に求まる

提案⼿法

Data Strategy and Operation Center 提案⼿法：residual2vec 1. グラフの “null model”
を与える 2. 埋め込むグラフデータ上のランダムウォーク， null model の⽣成グラフ上のランダムウォーク系列をそれぞれ考える 3. 2つの系列を⽐較し null model に含まれない残情報 (residual information) を抽出 > ここでの情報とは，PMI (Pointwise Mutual Information) ⾏列 > null model によっては，null 側の系列を作成せずに済む（解析的に求める） 4. PMI⾏列の差（残情報）を⾏列因⼦分解して低次元近似

Data Strategy and Operation Center 提案⼿法：residual2vec Figure 2 (Kojaku et
al., 2021)

実験・結果

Data Strategy and Operation Center 実験 1. Link prediction >
様々なグラフデータセットで評価（実験の詳細がちょっと謎） > 次数の偏りを除くため Configuration model (Fosdick et al., 2018) を null model に > AUC-ROC で評価 2. Community detection > データはベンチマークモデル (Lancichinetti et al., 2009)からの⽣成 > 1 vs rest の AUC-ROC で評価 3. Case Study > 掲載論⽂誌と出版年ごとにノードとした引⽤ネットワーク (Web of Science) > 次数の偏りだけでなく，昔の論⽂ほど引⽤が多くなる偏りがある > 時系列の偏りを取り除くため，出版年をブロックとした degree-corrected SBM (Karrer and Newman, 2011) を null model に

Data Strategy and Operation Center 結果 (Link prediction, Community detection)
Figure 3 (Kojaku et al., 2021) 提案⼿法がほぼ最良の結果に

Data Strategy and Operation Center 結果（引⽤ネットワークの埋め込み） • Glove, node2vec では昔のノードが中⼼に来ている（時間バイアス）
• 提案⼿法では取り除けている Figure 4 (Kojaku et al., 2021)

Data Strategy and Operation Center 結果（引⽤ネットワークの埋め込み）分野・IFの予測では提案⼿法がほぼ最良の結果に Figure 4 (Kojaku
et al., 2021)

Data Strategy and Operation Center まとめ・感想 • 次数に起因する “偏り” を除くことを⽬的とした
node embedding を開発 • Negative Sampling から着想を得て，ノイズ分布に任意のランダムグラフを仮定する residual2vec を提案 • PMI の残差⾏列を因⼦分解 • シンプルなアイデアで効果的な結果を出していてすごい • 複雑ネットワークとグラフ ML の知識を有機的に繋いでいきたい（願望） • 既存研究との⽐較があまりなされていなかった • personalized pagerank を利⽤した debias への⾔及がない • 実験の詳細にかけるところがある

Data Strategy and Operation Center 参考⽂献 • Fosdick, B. K.,
Larremore, D. B., Nishimura, J., & Ugander, J. (2018). Configuring random graph models with fixed degree sequences. Siam Review, 60(2), 315-355. • Karrer, B., & Newman, M. E. (2011). Stochastic blockmodels and community structure in networks. Physical review E, 83(1), 016107. • Kojaku, S., Yoon, J., Constantino, I., & Ahn, Y. Y. (2021). Residual2Vec: Debiasing graph embedding with random graphs. Advances in Neural Information Processing Systems, 34. • Lancichinetti, A., & Fortunato, S. (2009). Community detection algorithms: a comparative analysis. Physical review E, 80(5), 056117.

Residual2Vec: Debiasing graph embedding with r...

Residual2Vec: Debiasing graph embedding with random graphs

Sansan DSOC

More Decks by Sansan DSOC

Other Decks in Technology

Featured

Transcript