Graph Representation Learning: From Knowledge Graphs To Recommender Systems

Graph Representation Learning: From Knowledge Graphs To Recommender Systems Hongwei
Wang University of Illinois Urbana-Champaign Sep 28, 2021

A Short Bio 2 o Education o B.E., Computer Science
Shanghai Jiao Tong University, 2010-2014 o Ph.D., Computer Science Shanghai Jiao Tong University, 2014-2018 o Postdoc, Computer Science Stanford University, 2019-2021 o Postdoc, Computer Science University of Illinois Urbana-Champaign, 2021- o Awards o 2018 Google Ph.D. Fellowship o 2020 CCF (China Computer Federation) Outstanding Doctoral Dissertation o Research Interests o Graph neural networks, knowledge graphs, recommender systems

Content 3 o Graph representation learning o Graph neural networks
o Knowledge graphs o Knowledge graph completion o Embedding-based methods o Knowledge-graph-aware recommendation o Embedding-based methods: DKN o Structure-based methods: KGCN and KGNN-LS

Graphs are Ubiquitous 4 A graph is a structure amounting
to a set of objects in which some pairs of objects are in some sense “related” Molecule Protein Protein-protein interaction Synthetic routes Social networks Knowledge graphs Navigation map Flight routes Atom-level Molecule-level Human-level World-level

Representing a Graph 5 𝐺 = (𝑉, 𝐸) 0 1
1 0 0 0 1 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 1 0 1 0 0 0 0 Adjacency matrix 𝐴 Graph 𝐺

Representing a Graph 6 When the graph is very large…
o Storage inefficiency: 𝑂(𝑁!) o Hard to compute node similarity

Graph Representation Learning 7 Node embeddings in ℝ! (𝑑 ≪
#𝑛𝑜𝑑𝑒𝑠) Graph 𝐺 Nodes Edges Subgraphs Graphs Points (embeddings) in low-dimensional space ℝ" Graph representation learning (GRL) Structural information Semantic information

Downstream Tasks of GRL Link Prediction 8 ? 𝑣# 𝑣!
Learning a mapping: 𝑓: [𝐞$%"&#, 𝐞$%"&!] ↦ {0,1} o Are the two users friends in a social network? o Is there a flight between the two airports? ……

Downstream Tasks of GRL Node Classification 9 label? 𝑣 Learning
a mapping: 𝑓: 𝐞$%"& ↦ 𝑠𝑒𝑡 𝑜𝑓 𝑛𝑜𝑑𝑒 𝑙𝑎𝑏𝑒𝑙𝑠 o Is a user male or female in a social network? o What research field does a paper belong to in a citation network? ……

Downstream Tasks of GRL Graph Classification 10 Learning a mapping:
𝑓: 𝐞'()*+ ↦ 𝑠𝑒𝑡 𝑜𝑓 𝑔𝑟𝑎𝑝ℎ 𝑙𝑎𝑏𝑒𝑙𝑠 toxic nontoxic Toxic or nontoxic?

Graph Neural Networks (GNNs) 11 𝑥# 𝑥": initial feature of
node 𝑣" 𝑥, 𝑥! 𝑥- 𝑥. 𝑥/ 𝑥0 𝑥1 𝑥2 𝑥3 GNNs follow a neighborhood aggregation strategy: for 𝑘 = 1, … , 𝐾: for each node 𝑣! : ℎ! " = 𝑥! for each node 𝑣! return ℎ! # for each node 𝑣!

Graph Neural Networks (GNNs) 12 AGGREGATE function in GCN: 𝑊$
is a learnable transformation matrix for layer 𝑘 𝛼!% = 1/ |𝒩(𝑖)| E |𝒩(𝑖)| is a normalization factor 𝜎 is a activation function such as ReLU 𝑥 = max(𝑥, 0) Graph Convolutional Network (GCN) Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." The 5th International Conference on Learning Representations (2017).

Relational Message Passing for Knowledge Graph Completion The 27th SIGKDD
Conference on Knowledge Discovery and Data Mining (KDD 2021) Hongwei Wang, Hongyu Ren, Jure Leskovec Stanford University GRL in Knowledge Graphs 13

o Knowledge graphs (KGs) store structured information of real-world entities
and facts Cast Away The Green Mile Tom Hanks Robert Zemeckis Adventure Back to the Future Steven Spielberg genre genre style starred 𝐺 = {(ℎ, 𝑟, 𝑡)} Head entity Tail entity Relation directed collaborate Knowledge Graphs 14

o Knowledge graphs are usually incomplete and noisy o KG
completion: given (ℎ, ? , 𝑡), predict 𝑟 o Modeling the distribution over relation types: 𝑝 𝑟 ℎ, 𝑡) ? Cast Away The Green Mile Tom Hanks Robert Zemeckis Adventure Back to the Future Steven Spielberg genre genre style starred directed collaborate Knowledge Graph Completion 15

o Relational context (neighbor edges of a given edge) graduated
from a person a school person.birthplace person.gender institution.location university.founder university.president movie.language Relations are Correlated… 16

o Relational paths (paths connecting the two endpoints of a
given edge) graduated from has alumni schoolmate of graduated from Relations are Correlated… 17

𝑟? head ℎ tail 𝑡 Relational context module The Proposed
Method: PathCon 18

𝑟? head ℎ tail 𝑡 Relational context module first-order neighbor
second-order neighbor relational message passing The Proposed Method: PathCon 19

𝑟? head ℎ tail 𝑡 Relational paths module connecting path
The Proposed Method: PathCon 20

o Aggregates neighbor nodes information: o Updates node information: o
Does not work well on KGs because: o In most KGs, edge have features (relation types) but nodes don’t o Making use of node identity fails in inductive setting o The number of nodes are much larger than the number of relation types Node-based Message Passing 21

o Aggregates neighbor edge information: o Updates edge information: o
Avoids the drawbacks of node-based message passing, but introduces a new issue of computational efficiency Relational (Edge-Based) Message Passing 22

Consider a graph with 𝑁 nodes and 𝑀 edges o
Complexity of node-based message passing in each iteration: 𝟐𝑵 + 𝟐𝑴 o Complexity of relational message passing: 𝑵 < 𝐕𝐚𝐫 𝒅 + 𝟒𝑴𝟐 𝑵 , where Var 𝑑 is the variance of node degrees Message Passing Complexity 23

o Aggregates neighbor edge information to nodes: o Aggregate neighbor
nodes information to edges: o Update edge information: ① ① ② Alternate Message Passing 24

Consider a graph with 𝑁 nodes and 𝑀 edges o
Complexity of node-based message passing in each iteration: 𝟐𝑵 + 𝟐𝑴 o Complexity of relational message passing: 𝑵 < 𝐕𝐚𝐫 𝒅 + 𝟒𝑴𝟐 𝑵 , where Var 𝑑 is the variance of node degrees o Complexity of alternate relational message passing: 𝟔𝑴 Message Passing Complexity 25

𝑠& ": the hidden state of edge 𝑒 in iteration
𝑖 (𝑠& ' is 𝑒’s initial feature) 𝑚( " : the message stored at node 𝑣 in iteration 𝑖 Making use of relational context: o Final messages of (ℎ, 𝑡): 𝒎𝒉 𝑲+𝟏 and 𝒎𝒕 𝑲+𝟏, where 𝐾 is the number of message passing o Message passing in each iteration: Relational Context 26

A raw path from ℎ to 𝑡: Making use of
relational paths: The corresponding relational path: o Enumerate all relational paths with length ≤ 𝐿 o Assign an independent embedding vector 𝑠. for each relational path 𝑃 Relational Paths 27

o Combine final message 𝑚/ 0+1 and 𝑚2 0+1 together
to get the context information of (ℎ, 𝑡): o Aggregate the information of all paths from ℎ to 𝑡 with attention: where o Make prediction by combining the above two: o Train the model: Combining Relational Context and Paths 28

Datasets Our proposed new dataset Table 1: Statistics of all
datasets. Experiments 29

Baselines Embedding-based models o TransE o ComplEx o DistMult o
RotatE o SimplE o QuatE Path-based models o DRUM Ablation studies o Path o Con Table 2: Number of model parameters on DDB14. Experiments 30

Comparison with baselines Table 3: MRR and Hit@1 results on
all datasets. Hit@1 gain: 0.2% 0.6% 0.9% 16.7% 6.3% 1.8% Experiments 31

all datasets. The performance variance is very small Experiments 32

all datasets. The performance of Path and Con is already quite good Experiments 33

Inductiveness fully transductive → fully inductive random guessing 0.954 →
0.922 Figure 1: Hit@1 results on WN18RR. Experiments 34

Explainability Indices of all relations in DDB14. Experiments 35

Explainability Figure 4: The learned correlation between all relation paths
with length ≤ 2 and the predicted relations on DDB14. (a, is associated with, b) ∧ (b, is associated with, c) ⟹ (a, is associated with, c) (a, may be allelic with, b) ∧ (b, may be allelic with, c) ⟹ (a, may be allelic with, c) Experiments 36

Explainability Figure 4: The learned correlation between all relation paths
with length ≤ 2 and the predicted relations on DDB14. (a, belong(s) to the category of, b) ⟺ (a, is a subtype of, b) (a, is a risk factor for, b) ⟹ (a, may cause, b) (a, may cause, c) ∧ (b, may cause, c) ⟹ (a, may be allelic with, b) Experiments 37

Recommender Systems Movie Recommender systems (RS) intend to address the
information explosion by finding a small set of items for users to meet their personalized interests 38

Recommender Systems Book Recommender systems (RS) intend to address the

Recommender Systems Trip Recommender systems (RS) intend to address the

Recommender Systems 41 QA Short video Music

Rating/CTR Prediction 2 ？ 3 ？？？ 4 ？
？ 5 ？ 2 3 1 4 ？ 1 ？ 0 ？？？ 1 ？？ 0 ？ 1 0 1 0 ？ Rating prediction Click-through rate (CTR) prediction Explicit feedback Implicit feedback 42

Collaborative Filtering 43 2 3 3 1 3 1 1
4 2 5 4 5 ？ 3 4 1 similarity with u4 0.7 0.1 0.2 u1 u2 u3 u4 i1 i2 i3 i4 ? = 0.7×2 + 0.1×3 + 0.2×2 = 2.1

o Sparsity of user-item interactions o Cold start problem CF
Cannot Handle... 2 1 5 2 ？ 4 2 3 1 5 2 ？ Sparsity Cold start 44

CF + Side Information Social networks User/item attributes Alice Female
California … Multimedia (images, texts, videos, audios ...) Contexts purchase Time: 20:10 Location: Beijing What else in the cart：… iPhone X 2017 5.8 inch $999 … 45

Why Using KGs in Recommender Systems? 46 Cast Away The
Green Mile Tom Hanks Robert Zemeckis Adventure Back to the Future Steven Spielberg genre starred genre Forrest Gump Raiders of the Lost Ark Interstellar include include star starred directed direct direct style collaborate items (movies) non-item entities a user watched items (movies)

47 Boris Johnson Donald Trump Iran Nuclear Congress EMP ……
Politician Weapon United States North Korea North Korean EMP Attack Would Cause Mass U.S. Starvation, Says Congressional Report News the user may also like Boris Johnson Has Warned Donald Trump To Stick To The Iran Nuclear Deal News the user has read Why Using KGs in Recommender Systems?

48 Users Items User engagement labels 𝑦ST ∈ {0,1} ……
Non-item entities Knowledge graph 𝒢 Goal: Learn predicted engagement probability ) 𝑦LM Problem Formulation

KG-Enhanced Recommender Systems Embedding-based methods 49 Knowledge Graphs Recommender systems
Entity embeddings Relation embeddings User embeddings Item embeddings Model

KG-Enhanced Recommender Systems Structure-based methods 50 User-item interactions Knowledge graphs
Model Structure information

Deep Knowledge-Aware Network for News Recommendation The 2018 Web Conference
(WWW 2018) Hongwei Wang, Fuzheng Zhang, Xing Xie, Minyi Guo Shnghai Jiao Tong University Embedding-Based Method 51

52 Trump praises Las Vegas medical team Apple CEO Tim
Cook: iPhone 8 and Apple Watch Series 3 are sold out in some places EU Spain: Juncker does not want Catalonian independence …… Donald Trump: Donald Trump is the 45th president … Las Vegas: Las Vegas is the 28th-most populated city … Apple Inc.: Apple Inc. is an American multinational … CEO: A chief executive officer is the position of the … Tim Cook: Timothy Cook is an American business … iPhone 8: iPhone 8 is smartphone designed, … Entity linking …… Knowledge subgraph construction Knowledge graph embedding Donald Trump: (0.32, 0.48) Las Vegas: (0.71, -0.49) Apple Inc.: (-0.48, -0.41) CEO: (-0.57, 0.06) Tim Cook: (-0.61, -0.59) iPhone 8: (-0.46, -0.75) Entity embeddings Knowledge Distillation

53 Context of entities “Fight Club” Context Embedding

54 𝑤&:( = [Donald Trump praises Las Vegas medical team]
𝒅×𝒏 word embedding matrix Sentence Feature maps Max pooling Sentence representation Convolution Kim CNN

55 𝒅×𝒏 word embeddings 𝒅×𝒏 entity embeddings 𝒅×𝒏 context embeddings
CNN layer pooling multiple channels Knowledge-Aware CNN

56 User’s clicked news Candidate news Attention Net KCNN KCNN
KCNN KCNN concat. User embedding Candidate news embedding Click probability element-wise + element-wise × DKN Attention net: User interest extraction: CTR prediction: Attention-Based User Interest Extraction

o Dataset: Bing News o (timestamp, user_id, news_url, news_title, click_label)
o Training set: October 16, 2016 ~ June 11, 2017 o Test set: June 12, 2017 ~ August 11, 2017 o Knowledge graph: Microsoft Satori Dataset 57 Table: Dataset statistics.

Experimental Results 58 Table: F1 and AUC scores of DKN
and baselines. Comparison with baselines

Knowledge Graph Convolutional Networks for Recommender Systems The 2019 Web
Conference (WWW 2019) Hongwei Wang, Miao Zhao, Xing Xie, Wenjie Li, Minyi Guo Shnghai Jiao Tong University Structure-Based Method 59 Knowledge-Aware Convolutional Networks with Label Smoothness Regularization for Recommender Systems The 25th SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2019) Hongwei Wang, Fuzheng Zhang, Mengdi Zhang, Jure Leskovec, Miao Zhao, Wenjie Li, Zhongyuan Wang Stanford University

Relation Scoring Function 60 o No explicit weights for edges
(relations) in a KG o Transforming a KG to a weighted graph by introducing a trainable and personalized relation scoring function 𝑠3 𝑟 o 𝑢: a user; 𝑟: a type of relation o 𝑠) 𝑟 identifies important relations for a given user o E.g., 𝑠) 𝑟 = 𝐮*𝐫 Knowledge graph 𝒢 Adjacency matrix 𝐴S

Knowledge Graph Convolutional Networks 61 o Layer-wise forward propagation: Adjacency
matrix of the KG for particular user 𝑢

62 o Layer-wise forward propagation: Diagonal degree matrix of 𝐴S
Knowledge Graph Convolutional Networks

63 o Layer-wise forward propagation: Trainable transformation matrix Knowledge Graph
Convolutional Networks

64 o Layer-wise forward propagation: Entity embedding matrix Knowledge Graph
Convolutional Networks

65 o Layer-wise forward propagation: = 𝜎 ( ) Knowledge
Graph Convolutional Networks

66 o Layer-wise feature propagation: Knowledge Graph Convolutional Networks

Predicting Engagement Probability 67 User embeddings

Predicting Engagement Probability 68 User embeddings Entity (item) embeddings from
the last KGNN layer

Predicting Engagement Probability 69 User embeddings Entity (item) embeddings from
the last KGNN layer Inner product, MLP, etc.

In Traditional GNNs… 70 Trainable Fixed

But in KGCN… 71 How to solve the problem of
overfitting? Trainable Trainable

User Engagement Labels 72 0 1 1 User-engagement labels 𝑦ST
for a particular user Negative items Positive items Non-item entities (unlabeled) 0

73 0 1 1 User-engagement labels 𝑦ST for a particular
user ? How can we get the label for an unlabeled node? User Engagement Labels 0 Negative items Positive items Non-item entities (unlabeled)

74 0 1 1 ? weighted average o For a
given node, take the weighted average of its neighborhood labels as its own label Label Propagation Algorithm (LPA) 0 How can we get the label for an unlabeled node?

75 0 1 1 ? o For a given node,
take the weighted average of its neighborhood labels as its own label o Repeat the first step for every unlabeled node until convergence Label Propagation Algorithm (LPA) 0 How can we get the label for an unlabeled node?

Label Smoothness Assumption 76 o LPA minimizes the following objective:
𝐸 = 1 2 . (N,O)∈ℰ 𝐴L [𝑖, 𝑗] (4 𝑦LN − 4 𝑦LO )R O 𝑦 is the label predicted by LPA o Adjacent entities in the KG are more likely to have similar labels

Label Smoothness Regularization 77 Hold out the label of 𝑣
0 1 1 0 0 𝒗

78 0 1 1 0 𝒗 Hold out the label
of 𝑣 Label Smoothness Regularization 0

79 Predict the label of 𝑣 by label propagation algorithm
0 1 1 𝒗 Label Smoothness Regularization 0

80 0 1 1 𝒗 Label Smoothness Regularization 0 True
label of 𝑣: 𝑦ST Predicted label of 𝑣: O 𝑦ST Cross-entropy loss 𝐽(𝑦3(, U 𝑦3()

81 0 1 1 𝒗 𝑅 𝐴 = R S
𝑅(𝐴S) = R S R T 𝐽(𝑦ST, O 𝑦ST) Label Smoothness Regularization 0 O 𝑦ST = 𝐿𝑃𝐴 𝑌\{𝑦ST ; 𝐴S)

The Unified Model: KGNN-LS 82 Original KG Adjacency matrix Step
1 learn edge weights Step 2 GNN Entity (item) embeddings User embeddings Step 3 predict ! 𝑦#$ (predicted labels by GNN) Step 4: Label propagation 𝑦#$ (ground truth) # 𝑦#$ (predicted labels by LPA) loss(! 𝑦, 𝑦) loss(# 𝑦, 𝑦) Update W and A Update A

Click-through Rate Prediction 5.1% 6.9% 8.3% 4.3% Average improvements in
AUC 83

LS Regularization 84 without LS regularization with LS regularization o
Dataset: Last.FM

Cold Start Scenario 85 o Dataset: MovieLens-20M o Varying the
size of training set from 𝑟 = 100% to 𝑟 = 20% AUC decreases by 8.4% 5.9% 5.4% 3.6% 2.8% 4.1% 1.8% More sparse

Comparison o Performance o KGNN-LS (Aug 2019) > KGCN (May
2019) > DKN (Apr 2018) o Scalability o Embedding-based methods > structure-based methods o User-item interactions change with time, but KGs don’t o Knowledge graph embeddings can be reused o Explainability o Structure-based methods > embedding-based methods o Graph structures are more intuitive than embeddings 86

Take-Aways o Graph representation learning is a fundamental step in
graph-related tasks o Graph neural networks is a special type of GRL methods o Knowledge graph is a special type of graphs o Knowledge graph completion o PathCon: Combining context and paths information o Knowledge-graph-aware recommendation o DKN for news recommendation o KGCN/KGNN-LS for aggregating neighboring entity information on KGs using GNNs 87

Q & A More information is available at: https://hongweiw.net All
the source codes are available at: https://github.com/hwwang55 Thanks ! 88

Graph Representation Learning: From Knowledge G...

Graph Representation Learning: From Knowledge Graphs To Recommender Systems

More Decks by wing.nus

Other Decks in Research

Featured

Transcript