Graph Representation Learning: From Knowledge Graphs To Recommender Systems

Slide 1

Slide 1 text

Graph Representation Learning: From Knowledge Graphs To Recommender Systems Hongwei Wang University of Illinois Urbana-Champaign Sep 28, 2021

Slide 2

Slide 2 text

A Short Bio 2 o Education o B.E., Computer Science Shanghai Jiao Tong University, 2010-2014 o Ph.D., Computer Science Shanghai Jiao Tong University, 2014-2018 o Postdoc, Computer Science Stanford University, 2019-2021 o Postdoc, Computer Science University of Illinois Urbana-Champaign, 2021- o Awards o 2018 Google Ph.D. Fellowship o 2020 CCF (China Computer Federation) Outstanding Doctoral Dissertation o Research Interests o Graph neural networks, knowledge graphs, recommender systems

Slide 3

Slide 3 text

Content 3 o Graph representation learning o Graph neural networks o Knowledge graphs o Knowledge graph completion o Embedding-based methods o Knowledge-graph-aware recommendation o Embedding-based methods: DKN o Structure-based methods: KGCN and KGNN-LS

Slide 4

Slide 4 text

Graphs are Ubiquitous 4 A graph is a structure amounting to a set of objects in which some pairs of objects are in some sense “related” Molecule Protein Protein-protein interaction Synthetic routes Social networks Knowledge graphs Navigation map Flight routes Atom-level Molecule-level Human-level World-level

Slide 5

Slide 5 text

Representing a Graph 5 𝐺 = (𝑉, 𝐸) 0 1 1 0 0 0 1 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 1 0 1 0 0 0 0 Adjacency matrix 𝐴 Graph 𝐺

Slide 6

Slide 6 text

Representing a Graph 6 When the graph is very large… o Storage inefficiency: 𝑂(𝑁!) o Hard to compute node similarity

Slide 7

Slide 7 text

Graph Representation Learning 7 Node embeddings in ℝ! (𝑑 ≪ #𝑛𝑜𝑑𝑒𝑠) Graph 𝐺 Nodes Edges Subgraphs Graphs Points (embeddings) in low-dimensional space ℝ" Graph representation learning (GRL) Structural information Semantic information

Slide 8

Slide 8 text

Downstream Tasks of GRL Link Prediction 8 ? 𝑣# 𝑣! Learning a mapping: 𝑓: [𝐞$%"&#, 𝐞$%"&!] ↦ {0,1} o Are the two users friends in a social network? o Is there a flight between the two airports? ……

Slide 9

Slide 9 text

Downstream Tasks of GRL Node Classification 9 label? 𝑣 Learning a mapping: 𝑓: 𝐞$%"& ↦ 𝑠𝑒𝑡 𝑜𝑓 𝑛𝑜𝑑𝑒 𝑙𝑎𝑏𝑒𝑙𝑠 o Is a user male or female in a social network? o What research field does a paper belong to in a citation network? ……

Slide 10

Slide 10 text

Downstream Tasks of GRL Graph Classification 10 Learning a mapping: 𝑓: 𝐞'()*+ ↦ 𝑠𝑒𝑡 𝑜𝑓 𝑔𝑟𝑎𝑝ℎ 𝑙𝑎𝑏𝑒𝑙𝑠 toxic nontoxic Toxic or nontoxic?

Slide 11

Slide 11 text

Graph Neural Networks (GNNs) 11 𝑥# 𝑥": initial feature of node 𝑣" 𝑥, 𝑥! 𝑥- 𝑥. 𝑥/ 𝑥0 𝑥1 𝑥2 𝑥3 GNNs follow a neighborhood aggregation strategy: for 𝑘 = 1, … , 𝐾: for each node 𝑣! : ℎ! " = 𝑥! for each node 𝑣! return ℎ! # for each node 𝑣!

Slide 12

Slide 12 text

Graph Neural Networks (GNNs) 12 AGGREGATE function in GCN: 𝑊$ is a learnable transformation matrix for layer 𝑘 𝛼!% = 1/ |𝒩(𝑖)| E |𝒩(𝑖)| is a normalization factor 𝜎 is a activation function such as ReLU 𝑥 = max(𝑥, 0) Graph Convolutional Network (GCN) Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." The 5th International Conference on Learning Representations (2017).

Slide 13

Slide 13 text

Relational Message Passing for Knowledge Graph Completion The 27th SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2021) Hongwei Wang, Hongyu Ren, Jure Leskovec Stanford University GRL in Knowledge Graphs 13

Slide 14

Slide 14 text

o Knowledge graphs (KGs) store structured information of real-world entities and facts Cast Away The Green Mile Tom Hanks Robert Zemeckis Adventure Back to the Future Steven Spielberg genre genre style starred 𝐺 = {(ℎ, 𝑟, 𝑡)} Head entity Tail entity Relation directed collaborate Knowledge Graphs 14

Slide 15

Slide 15 text

o Knowledge graphs are usually incomplete and noisy o KG completion: given (ℎ, ? , 𝑡), predict 𝑟 o Modeling the distribution over relation types: 𝑝 𝑟 ℎ, 𝑡) ? Cast Away The Green Mile Tom Hanks Robert Zemeckis Adventure Back to the Future Steven Spielberg genre genre style starred directed collaborate Knowledge Graph Completion 15

Slide 16

Slide 16 text

o Relational context (neighbor edges of a given edge) graduated from a person a school person.birthplace person.gender institution.location university.founder university.president movie.language Relations are Correlated… 16

Slide 17

Slide 17 text

o Relational paths (paths connecting the two endpoints of a given edge) graduated from has alumni schoolmate of graduated from Relations are Correlated… 17

Slide 18

Slide 18 text

𝑟? head ℎ tail 𝑡 Relational context module The Proposed Method: PathCon 18

Slide 19

Slide 19 text

𝑟? head ℎ tail 𝑡 Relational context module first-order neighbor second-order neighbor relational message passing The Proposed Method: PathCon 19

Slide 20

Slide 20 text

𝑟? head ℎ tail 𝑡 Relational paths module connecting path The Proposed Method: PathCon 20

Slide 21

Slide 21 text

o Aggregates neighbor nodes information: o Updates node information: o Does not work well on KGs because: o In most KGs, edge have features (relation types) but nodes don’t o Making use of node identity fails in inductive setting o The number of nodes are much larger than the number of relation types Node-based Message Passing 21

Slide 22

Slide 22 text

o Aggregates neighbor edge information: o Updates edge information: o Avoids the drawbacks of node-based message passing, but introduces a new issue of computational efficiency Relational (Edge-Based) Message Passing 22

Slide 23

Slide 23 text

Consider a graph with 𝑁 nodes and 𝑀 edges o Complexity of node-based message passing in each iteration: 𝟐𝑵 + 𝟐𝑴 o Complexity of relational message passing: 𝑵 < 𝐕𝐚𝐫 𝒅 + 𝟒𝑴𝟐 𝑵 , where Var 𝑑 is the variance of node degrees Message Passing Complexity 23

Slide 24

Slide 24 text

o Aggregates neighbor edge information to nodes: o Aggregate neighbor nodes information to edges: o Update edge information: ① ① ② Alternate Message Passing 24

Slide 25

Slide 25 text

Slide 26

Slide 26 text

𝑠& ": the hidden state of edge 𝑒 in iteration 𝑖 (𝑠& ' is 𝑒’s initial feature) 𝑚( " : the message stored at node 𝑣 in iteration 𝑖 Making use of relational context: o Final messages of (ℎ, 𝑡): 𝒎𝒉 𝑲+𝟏 and 𝒎𝒕 𝑲+𝟏, where 𝐾 is the number of message passing o Message passing in each iteration: Relational Context 26

Slide 27

Slide 27 text

A raw path from ℎ to 𝑡: Making use of relational paths: The corresponding relational path: o Enumerate all relational paths with length ≤ 𝐿 o Assign an independent embedding vector 𝑠. for each relational path 𝑃 Relational Paths 27

Slide 28

Slide 28 text

o Combine final message 𝑚/ 0+1 and 𝑚2 0+1 together to get the context information of (ℎ, 𝑡): o Aggregate the information of all paths from ℎ to 𝑡 with attention: where o Make prediction by combining the above two: o Train the model: Combining Relational Context and Paths 28

Slide 29

Slide 29 text

Datasets Our proposed new dataset Table 1: Statistics of all datasets. Experiments 29

Slide 30

Slide 30 text

Baselines Embedding-based models o TransE o ComplEx o DistMult o RotatE o SimplE o QuatE Path-based models o DRUM Ablation studies o Path o Con Table 2: Number of model parameters on DDB14. Experiments 30

Slide 31

Slide 31 text

Comparison with baselines Table 3: MRR and Hit@1 results on all datasets. Hit@1 gain: 0.2% 0.6% 0.9% 16.7% 6.3% 1.8% Experiments 31

Slide 32

Slide 32 text

Comparison with baselines Table 3: MRR and Hit@1 results on all datasets. The performance variance is very small Experiments 32

Slide 33

Slide 33 text

Comparison with baselines Table 3: MRR and Hit@1 results on all datasets. The performance of Path and Con is already quite good Experiments 33

Slide 34

Slide 34 text

Inductiveness fully transductive → fully inductive random guessing 0.954 → 0.922 Figure 1: Hit@1 results on WN18RR. Experiments 34

Slide 35

Slide 35 text

Explainability Indices of all relations in DDB14. Experiments 35

Slide 36

Slide 36 text

Explainability Figure 4: The learned correlation between all relation paths with length ≤ 2 and the predicted relations on DDB14. (a, is associated with, b) ∧ (b, is associated with, c) ⟹ (a, is associated with, c) (a, may be allelic with, b) ∧ (b, may be allelic with, c) ⟹ (a, may be allelic with, c) Experiments 36

Slide 37

Slide 37 text

Explainability Figure 4: The learned correlation between all relation paths with length ≤ 2 and the predicted relations on DDB14. (a, belong(s) to the category of, b) ⟺ (a, is a subtype of, b) (a, is a risk factor for, b) ⟹ (a, may cause, b) (a, may cause, c) ∧ (b, may cause, c) ⟹ (a, may be allelic with, b) Experiments 37

Slide 38

Slide 38 text

Recommender Systems Movie Recommender systems (RS) intend to address the information explosion by finding a small set of items for users to meet their personalized interests 38

Slide 39

Slide 39 text

Recommender Systems Book Recommender systems (RS) intend to address the information explosion by finding a small set of items for users to meet their personalized interests 39

Slide 40

Slide 40 text

Recommender Systems Trip Recommender systems (RS) intend to address the information explosion by finding a small set of items for users to meet their personalized interests 40

Slide 41

Slide 41 text

Recommender Systems 41 QA Short video Music

Slide 42

Slide 42 text

Rating/CTR Prediction 2 ？ 3 ？？？ 4 ？？ 5 ？ 2 3 1 4 ？ 1 ？ 0 ？？？ 1 ？？ 0 ？ 1 0 1 0 ？ Rating prediction Click-through rate (CTR) prediction Explicit feedback Implicit feedback 42

Slide 43

Slide 43 text

Collaborative Filtering 43 2 3 3 1 3 1 1 4 2 5 4 5 ？ 3 4 1 similarity with u4 0.7 0.1 0.2 u1 u2 u3 u4 i1 i2 i3 i4 ? = 0.7×2 + 0.1×3 + 0.2×2 = 2.1

Slide 44

Slide 44 text

o Sparsity of user-item interactions o Cold start problem CF Cannot Handle... 2 1 5 2 ？ 4 2 3 1 5 2 ？ Sparsity Cold start 44

Slide 45

Slide 45 text

CF + Side Information Social networks User/item attributes Alice Female California … Multimedia (images, texts, videos, audios ...) Contexts purchase Time: 20:10 Location: Beijing What else in the cart：… iPhone X 2017 5.8 inch $999 … 45

Slide 46

Slide 46 text

Why Using KGs in Recommender Systems? 46 Cast Away The Green Mile Tom Hanks Robert Zemeckis Adventure Back to the Future Steven Spielberg genre starred genre Forrest Gump Raiders of the Lost Ark Interstellar include include star starred directed direct direct style collaborate items (movies) non-item entities a user watched items (movies)

Slide 47

Slide 47 text

47 Boris Johnson Donald Trump Iran Nuclear Congress EMP …… Politician Weapon United States North Korea North Korean EMP Attack Would Cause Mass U.S. Starvation, Says Congressional Report News the user may also like Boris Johnson Has Warned Donald Trump To Stick To The Iran Nuclear Deal News the user has read Why Using KGs in Recommender Systems?

Slide 48

Slide 48 text

48 Users Items User engagement labels 𝑦ST ∈ {0,1} …… Non-item entities Knowledge graph 𝒢 Goal: Learn predicted engagement probability ) 𝑦LM Problem Formulation

Slide 49

Slide 49 text

KG-Enhanced Recommender Systems Embedding-based methods 49 Knowledge Graphs Recommender systems Entity embeddings Relation embeddings User embeddings Item embeddings Model

Slide 50

Slide 50 text

KG-Enhanced Recommender Systems Structure-based methods 50 User-item interactions Knowledge graphs Model Structure information

Slide 51

Slide 51 text

Deep Knowledge-Aware Network for News Recommendation The 2018 Web Conference (WWW 2018) Hongwei Wang, Fuzheng Zhang, Xing Xie, Minyi Guo Shnghai Jiao Tong University Embedding-Based Method 51

Slide 52

Slide 52 text

52 Trump praises Las Vegas medical team Apple CEO Tim Cook: iPhone 8 and Apple Watch Series 3 are sold out in some places EU Spain: Juncker does not want Catalonian independence …… Donald Trump: Donald Trump is the 45th president … Las Vegas: Las Vegas is the 28th-most populated city … Apple Inc.: Apple Inc. is an American multinational … CEO: A chief executive officer is the position of the … Tim Cook: Timothy Cook is an American business … iPhone 8: iPhone 8 is smartphone designed, … Entity linking …… Knowledge subgraph construction Knowledge graph embedding Donald Trump: (0.32, 0.48) Las Vegas: (0.71, -0.49) Apple Inc.: (-0.48, -0.41) CEO: (-0.57, 0.06) Tim Cook: (-0.61, -0.59) iPhone 8: (-0.46, -0.75) Entity embeddings Knowledge Distillation

Slide 53

Slide 53 text

53 Context of entities “Fight Club” Context Embedding

Slide 54

Slide 54 text

54 𝑤&:( = [Donald Trump praises Las Vegas medical team] 𝒅×𝒏 word embedding matrix Sentence Feature maps Max pooling Sentence representation Convolution Kim CNN

Slide 55

Slide 55 text

55 𝒅×𝒏 word embeddings 𝒅×𝒏 entity embeddings 𝒅×𝒏 context embeddings CNN layer pooling multiple channels Knowledge-Aware CNN

Slide 56

Slide 56 text

56 User’s clicked news Candidate news Attention Net KCNN KCNN KCNN KCNN concat. User embedding Candidate news embedding Click probability element-wise + element-wise × DKN Attention net: User interest extraction: CTR prediction: Attention-Based User Interest Extraction

Slide 57

Slide 57 text

o Dataset: Bing News o (timestamp, user_id, news_url, news_title, click_label) o Training set: October 16, 2016 ~ June 11, 2017 o Test set: June 12, 2017 ~ August 11, 2017 o Knowledge graph: Microsoft Satori Dataset 57 Table: Dataset statistics.

Slide 58

Slide 58 text

Experimental Results 58 Table: F1 and AUC scores of DKN and baselines. Comparison with baselines

Slide 59

Slide 59 text

Knowledge Graph Convolutional Networks for Recommender Systems The 2019 Web Conference (WWW 2019) Hongwei Wang, Miao Zhao, Xing Xie, Wenjie Li, Minyi Guo Shnghai Jiao Tong University Structure-Based Method 59 Knowledge-Aware Convolutional Networks with Label Smoothness Regularization for Recommender Systems The 25th SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2019) Hongwei Wang, Fuzheng Zhang, Mengdi Zhang, Jure Leskovec, Miao Zhao, Wenjie Li, Zhongyuan Wang Stanford University

Slide 60

Slide 60 text

Relation Scoring Function 60 o No explicit weights for edges (relations) in a KG o Transforming a KG to a weighted graph by introducing a trainable and personalized relation scoring function 𝑠3 𝑟 o 𝑢: a user; 𝑟: a type of relation o 𝑠) 𝑟 identifies important relations for a given user o E.g., 𝑠) 𝑟 = 𝐮*𝐫 Knowledge graph 𝒢 Adjacency matrix 𝐴S

Slide 61

Slide 61 text

Knowledge Graph Convolutional Networks 61 o Layer-wise forward propagation: Adjacency matrix of the KG for particular user 𝑢

Slide 62

Slide 62 text

62 o Layer-wise forward propagation: Diagonal degree matrix of 𝐴S Knowledge Graph Convolutional Networks

Slide 63

Slide 63 text

63 o Layer-wise forward propagation: Trainable transformation matrix Knowledge Graph Convolutional Networks

Slide 64

Slide 64 text

64 o Layer-wise forward propagation: Entity embedding matrix Knowledge Graph Convolutional Networks

Slide 65

Slide 65 text

65 o Layer-wise forward propagation: = 𝜎 ( ) Knowledge Graph Convolutional Networks

Slide 66

Slide 66 text

66 o Layer-wise feature propagation: Knowledge Graph Convolutional Networks

Slide 67

Slide 67 text

Predicting Engagement Probability 67 User embeddings

Slide 68

Slide 68 text

Predicting Engagement Probability 68 User embeddings Entity (item) embeddings from the last KGNN layer

Slide 69

Slide 69 text

Predicting Engagement Probability 69 User embeddings Entity (item) embeddings from the last KGNN layer Inner product, MLP, etc.

Slide 70

Slide 70 text

In Traditional GNNs… 70 Trainable Fixed

Slide 71

Slide 71 text

But in KGCN… 71 How to solve the problem of overfitting? Trainable Trainable

Slide 72

Slide 72 text

User Engagement Labels 72 0 1 1 User-engagement labels 𝑦ST for a particular user Negative items Positive items Non-item entities (unlabeled) 0

Slide 73

Slide 73 text

73 0 1 1 User-engagement labels 𝑦ST for a particular user ? How can we get the label for an unlabeled node? User Engagement Labels 0 Negative items Positive items Non-item entities (unlabeled)

Slide 74

Slide 74 text

74 0 1 1 ? weighted average o For a given node, take the weighted average of its neighborhood labels as its own label Label Propagation Algorithm (LPA) 0 How can we get the label for an unlabeled node?

Slide 75

Slide 75 text

75 0 1 1 ? o For a given node, take the weighted average of its neighborhood labels as its own label o Repeat the first step for every unlabeled node until convergence Label Propagation Algorithm (LPA) 0 How can we get the label for an unlabeled node?

Slide 76

Slide 76 text

Label Smoothness Assumption 76 o LPA minimizes the following objective: 𝐸 = 1 2 . (N,O)∈ℰ 𝐴L [𝑖, 𝑗] (4 𝑦LN − 4 𝑦LO )R O 𝑦 is the label predicted by LPA o Adjacent entities in the KG are more likely to have similar labels

Slide 77

Slide 77 text

Label Smoothness Regularization 77 Hold out the label of 𝑣 0 1 1 0 0 𝒗

Slide 78

Slide 78 text

78 0 1 1 0 𝒗 Hold out the label of 𝑣 Label Smoothness Regularization 0

Slide 79

Slide 79 text

79 Predict the label of 𝑣 by label propagation algorithm 0 1 1 𝒗 Label Smoothness Regularization 0

Slide 80

Slide 80 text

80 0 1 1 𝒗 Label Smoothness Regularization 0 True label of 𝑣: 𝑦ST Predicted label of 𝑣: O 𝑦ST Cross-entropy loss 𝐽(𝑦3(, U 𝑦3()

Slide 81

Slide 81 text

81 0 1 1 𝒗 𝑅 𝐴 = R S 𝑅(𝐴S) = R S R T 𝐽(𝑦ST, O 𝑦ST) Label Smoothness Regularization 0 O 𝑦ST = 𝐿𝑃𝐴 𝑌\{𝑦ST ; 𝐴S)

Slide 82

Slide 82 text

The Unified Model: KGNN-LS 82 Original KG Adjacency matrix Step 1 learn edge weights Step 2 GNN Entity (item) embeddings User embeddings Step 3 predict ! 𝑦#$ (predicted labels by GNN) Step 4: Label propagation 𝑦#$ (ground truth) # 𝑦#$ (predicted labels by LPA) loss(! 𝑦, 𝑦) loss(# 𝑦, 𝑦) Update W and A Update A

Slide 83

Slide 83 text

Click-through Rate Prediction 5.1% 6.9% 8.3% 4.3% Average improvements in AUC 83

Slide 84

Slide 84 text

LS Regularization 84 without LS regularization with LS regularization o Dataset: Last.FM

Slide 85

Slide 85 text

Cold Start Scenario 85 o Dataset: MovieLens-20M o Varying the size of training set from 𝑟 = 100% to 𝑟 = 20% AUC decreases by 8.4% 5.9% 5.4% 3.6% 2.8% 4.1% 1.8% More sparse

Slide 86

Slide 86 text

Comparison o Performance o KGNN-LS (Aug 2019) > KGCN (May 2019) > DKN (Apr 2018) o Scalability o Embedding-based methods > structure-based methods o User-item interactions change with time, but KGs don’t o Knowledge graph embeddings can be reused o Explainability o Structure-based methods > embedding-based methods o Graph structures are more intuitive than embeddings 86

Slide 87

Slide 87 text

Take-Aways o Graph representation learning is a fundamental step in graph-related tasks o Graph neural networks is a special type of GRL methods o Knowledge graph is a special type of graphs o Knowledge graph completion o PathCon: Combining context and paths information o Knowledge-graph-aware recommendation o DKN for news recommendation o KGCN/KGNN-LS for aggregating neighboring entity information on KGs using GNNs 87

Slide 88

Slide 88 text

Q & A More information is available at: https://hongweiw.net All the source codes are available at: https://github.com/hwwang55 Thanks ! 88