Save 37% off PRO during our Black Friday Sale! »

Graph Representation Learning: From Knowledge Graphs To Recommender Systems

14da6ebc2e909305afdb348e7970de81?s=47 wing.nus
October 01, 2021

Graph Representation Learning: From Knowledge Graphs To Recommender Systems

Graphs are ubiquitous in the real world. To facilitate machine learning algorithms making use of graph-structured data, researchers proposed graph representation learning (GRL) methods, which learns a real low-dimensional vector for each node in a graph. In this talk, I will first briefly introduce graph representation learning, graph neural networks (GNNs, a special type of GRL methods), and knowledge graphs (KGs, a special type of graphs). Then my talk will consist of two parts: (1) Knowledge graph completion. I will introduce PathCon, a GNN-based method that combines relational context and relational paths information to predict the relation type of an edge in a KG. (2) Knowledge-graph-aware recommendation. Knowledge graphs can provide additional item-item relationship and thus alleviate the cold start problem in recommender systems. I will introduce two KG-aware recommendation algorithms, including an embedding-based method DKN and two structure-based methods KGCN and KGNN-LS.

14da6ebc2e909305afdb348e7970de81?s=128

wing.nus

October 01, 2021
Tweet

Transcript

  1. Graph Representation Learning: From Knowledge Graphs To Recommender Systems Hongwei

    Wang University of Illinois Urbana-Champaign Sep 28, 2021
  2. A Short Bio 2 o Education o B.E., Computer Science

    Shanghai Jiao Tong University, 2010-2014 o Ph.D., Computer Science Shanghai Jiao Tong University, 2014-2018 o Postdoc, Computer Science Stanford University, 2019-2021 o Postdoc, Computer Science University of Illinois Urbana-Champaign, 2021- o Awards o 2018 Google Ph.D. Fellowship o 2020 CCF (China Computer Federation) Outstanding Doctoral Dissertation o Research Interests o Graph neural networks, knowledge graphs, recommender systems
  3. Content 3 o Graph representation learning o Graph neural networks

    o Knowledge graphs o Knowledge graph completion o Embedding-based methods o Knowledge-graph-aware recommendation o Embedding-based methods: DKN o Structure-based methods: KGCN and KGNN-LS
  4. Graphs are Ubiquitous 4 A graph is a structure amounting

    to a set of objects in which some pairs of objects are in some sense “related” Molecule Protein Protein-protein interaction Synthetic routes Social networks Knowledge graphs Navigation map Flight routes Atom-level Molecule-level Human-level World-level
  5. Representing a Graph 5 𝐺 = (𝑉, 𝐸) 0 1

    1 0 0 0 1 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 1 0 1 0 0 0 0 Adjacency matrix 𝐴 Graph 𝐺
  6. Representing a Graph 6 When the graph is very large…

    o Storage inefficiency: 𝑂(𝑁!) o Hard to compute node similarity
  7. Graph Representation Learning 7 Node embeddings in ℝ! (𝑑 ≪

    #𝑛𝑜𝑑𝑒𝑠) Graph 𝐺 Nodes Edges Subgraphs Graphs Points (embeddings) in low-dimensional space ℝ" Graph representation learning (GRL) Structural information Semantic information
  8. Downstream Tasks of GRL Link Prediction 8 ? 𝑣# 𝑣!

    Learning a mapping: 𝑓: [𝐞$%"&#, 𝐞$%"&!] ↦ {0,1} o Are the two users friends in a social network? o Is there a flight between the two airports? ……
  9. Downstream Tasks of GRL Node Classification 9 label? 𝑣 Learning

    a mapping: 𝑓: 𝐞$%"& ↦ 𝑠𝑒𝑡 𝑜𝑓 𝑛𝑜𝑑𝑒 𝑙𝑎𝑏𝑒𝑙𝑠 o Is a user male or female in a social network? o What research field does a paper belong to in a citation network? ……
  10. Downstream Tasks of GRL Graph Classification 10 Learning a mapping:

    𝑓: 𝐞'()*+ ↦ 𝑠𝑒𝑡 𝑜𝑓 𝑔𝑟𝑎𝑝ℎ 𝑙𝑎𝑏𝑒𝑙𝑠 toxic nontoxic Toxic or nontoxic?
  11. Graph Neural Networks (GNNs) 11 𝑥# 𝑥": initial feature of

    node 𝑣" 𝑥, 𝑥! 𝑥- 𝑥. 𝑥/ 𝑥0 𝑥1 𝑥2 𝑥3 GNNs follow a neighborhood aggregation strategy: for 𝑘 = 1, … , 𝐾: for each node 𝑣! : ℎ! " = 𝑥! for each node 𝑣! return ℎ! # for each node 𝑣!
  12. Graph Neural Networks (GNNs) 12 AGGREGATE function in GCN: 𝑊$

    is a learnable transformation matrix for layer 𝑘 𝛼!% = 1/ |𝒩(𝑖)| E |𝒩(𝑖)| is a normalization factor 𝜎 is a activation function such as ReLU 𝑥 = max(𝑥, 0) Graph Convolutional Network (GCN) Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." The 5th International Conference on Learning Representations (2017).
  13. Relational Message Passing for Knowledge Graph Completion The 27th SIGKDD

    Conference on Knowledge Discovery and Data Mining (KDD 2021) Hongwei Wang, Hongyu Ren, Jure Leskovec Stanford University GRL in Knowledge Graphs 13
  14. o Knowledge graphs (KGs) store structured information of real-world entities

    and facts Cast Away The Green Mile Tom Hanks Robert Zemeckis Adventure Back to the Future Steven Spielberg genre genre style starred 𝐺 = {(ℎ, 𝑟, 𝑡)} Head entity Tail entity Relation directed collaborate Knowledge Graphs 14
  15. o Knowledge graphs are usually incomplete and noisy o KG

    completion: given (ℎ, ? , 𝑡), predict 𝑟 o Modeling the distribution over relation types: 𝑝 𝑟 ℎ, 𝑡) ? Cast Away The Green Mile Tom Hanks Robert Zemeckis Adventure Back to the Future Steven Spielberg genre genre style starred directed collaborate Knowledge Graph Completion 15
  16. o Relational context (neighbor edges of a given edge) graduated

    from a person a school person.birthplace person.gender institution.location university.founder university.president movie.language Relations are Correlated… 16
  17. o Relational paths (paths connecting the two endpoints of a

    given edge) graduated from has alumni schoolmate of graduated from Relations are Correlated… 17
  18. 𝑟? head ℎ tail 𝑡 Relational context module The Proposed

    Method: PathCon 18
  19. 𝑟? head ℎ tail 𝑡 Relational context module first-order neighbor

    second-order neighbor relational message passing The Proposed Method: PathCon 19
  20. 𝑟? head ℎ tail 𝑡 Relational paths module connecting path

    The Proposed Method: PathCon 20
  21. o Aggregates neighbor nodes information: o Updates node information: o

    Does not work well on KGs because: o In most KGs, edge have features (relation types) but nodes don’t o Making use of node identity fails in inductive setting o The number of nodes are much larger than the number of relation types Node-based Message Passing 21
  22. o Aggregates neighbor edge information: o Updates edge information: o

    Avoids the drawbacks of node-based message passing, but introduces a new issue of computational efficiency Relational (Edge-Based) Message Passing 22
  23. Consider a graph with 𝑁 nodes and 𝑀 edges o

    Complexity of node-based message passing in each iteration: 𝟐𝑵 + 𝟐𝑴 o Complexity of relational message passing: 𝑵 < 𝐕𝐚𝐫 𝒅 + 𝟒𝑴𝟐 𝑵 , where Var 𝑑 is the variance of node degrees Message Passing Complexity 23
  24. o Aggregates neighbor edge information to nodes: o Aggregate neighbor

    nodes information to edges: o Update edge information: ① ① ② Alternate Message Passing 24
  25. Consider a graph with 𝑁 nodes and 𝑀 edges o

    Complexity of node-based message passing in each iteration: 𝟐𝑵 + 𝟐𝑴 o Complexity of relational message passing: 𝑵 < 𝐕𝐚𝐫 𝒅 + 𝟒𝑴𝟐 𝑵 , where Var 𝑑 is the variance of node degrees o Complexity of alternate relational message passing: 𝟔𝑴 Message Passing Complexity 25
  26. 𝑠& ": the hidden state of edge 𝑒 in iteration

    𝑖 (𝑠& ' is 𝑒’s initial feature) 𝑚( " : the message stored at node 𝑣 in iteration 𝑖 Making use of relational context: o Final messages of (ℎ, 𝑡): 𝒎𝒉 𝑲+𝟏 and 𝒎𝒕 𝑲+𝟏, where 𝐾 is the number of message passing o Message passing in each iteration: Relational Context 26
  27. A raw path from ℎ to 𝑡: Making use of

    relational paths: The corresponding relational path: o Enumerate all relational paths with length ≤ 𝐿 o Assign an independent embedding vector 𝑠. for each relational path 𝑃 Relational Paths 27
  28. o Combine final message 𝑚/ 0+1 and 𝑚2 0+1 together

    to get the context information of (ℎ, 𝑡): o Aggregate the information of all paths from ℎ to 𝑡 with attention: where o Make prediction by combining the above two: o Train the model: Combining Relational Context and Paths 28
  29. Datasets Our proposed new dataset Table 1: Statistics of all

    datasets. Experiments 29
  30. Baselines Embedding-based models o TransE o ComplEx o DistMult o

    RotatE o SimplE o QuatE Path-based models o DRUM Ablation studies o Path o Con Table 2: Number of model parameters on DDB14. Experiments 30
  31. Comparison with baselines Table 3: MRR and Hit@1 results on

    all datasets. Hit@1 gain: 0.2% 0.6% 0.9% 16.7% 6.3% 1.8% Experiments 31
  32. Comparison with baselines Table 3: MRR and Hit@1 results on

    all datasets. The performance variance is very small Experiments 32
  33. Comparison with baselines Table 3: MRR and Hit@1 results on

    all datasets. The performance of Path and Con is already quite good Experiments 33
  34. Inductiveness fully transductive → fully inductive random guessing 0.954 →

    0.922 Figure 1: Hit@1 results on WN18RR. Experiments 34
  35. Explainability Indices of all relations in DDB14. Experiments 35

  36. Explainability Figure 4: The learned correlation between all relation paths

    with length ≤ 2 and the predicted relations on DDB14. (a, is associated with, b) ∧ (b, is associated with, c) ⟹ (a, is associated with, c) (a, may be allelic with, b) ∧ (b, may be allelic with, c) ⟹ (a, may be allelic with, c) Experiments 36
  37. Explainability Figure 4: The learned correlation between all relation paths

    with length ≤ 2 and the predicted relations on DDB14. (a, belong(s) to the category of, b) ⟺ (a, is a subtype of, b) (a, is a risk factor for, b) ⟹ (a, may cause, b) (a, may cause, c) ∧ (b, may cause, c) ⟹ (a, may be allelic with, b) Experiments 37
  38. Recommender Systems Movie Recommender systems (RS) intend to address the

    information explosion by finding a small set of items for users to meet their personalized interests 38
  39. Recommender Systems Book Recommender systems (RS) intend to address the

    information explosion by finding a small set of items for users to meet their personalized interests 39
  40. Recommender Systems Trip Recommender systems (RS) intend to address the

    information explosion by finding a small set of items for users to meet their personalized interests 40
  41. Recommender Systems 41 QA Short video Music

  42. Rating/CTR Prediction 2 ? 3 ? ? ? 4 ?

    ? 5 ? 2 3 1 4 ? 1 ? 0 ? ? ? 1 ? ? 0 ? 1 0 1 0 ? Rating prediction Click-through rate (CTR) prediction Explicit feedback Implicit feedback 42
  43. Collaborative Filtering 43 2 3 3 1 3 1 1

    4 2 5 4 5 ? 3 4 1 similarity with u4 0.7 0.1 0.2 u1 u2 u3 u4 i1 i2 i3 i4 ? = 0.7×2 + 0.1×3 + 0.2×2 = 2.1
  44. o Sparsity of user-item interactions o Cold start problem CF

    Cannot Handle... 2 1 5 2 ? 4 2 3 1 5 2 ? Sparsity Cold start 44
  45. CF + Side Information Social networks User/item attributes Alice Female

    California … Multimedia (images, texts, videos, audios ...) Contexts purchase Time: 20:10 Location: Beijing What else in the cart:… iPhone X 2017 5.8 inch $999 … 45
  46. Why Using KGs in Recommender Systems? 46 Cast Away The

    Green Mile Tom Hanks Robert Zemeckis Adventure Back to the Future Steven Spielberg genre starred genre Forrest Gump Raiders of the Lost Ark Interstellar include include star starred directed direct direct style collaborate items (movies) non-item entities a user watched items (movies)
  47. 47 Boris Johnson Donald Trump Iran Nuclear Congress EMP ……

    Politician Weapon United States North Korea North Korean EMP Attack Would Cause Mass U.S. Starvation, Says Congressional Report News the user may also like Boris Johnson Has Warned Donald Trump To Stick To The Iran Nuclear Deal News the user has read Why Using KGs in Recommender Systems?
  48. 48 Users Items User engagement labels 𝑦ST ∈ {0,1} ……

    Non-item entities Knowledge graph 𝒢 Goal: Learn predicted engagement probability ) 𝑦LM Problem Formulation
  49. KG-Enhanced Recommender Systems Embedding-based methods 49 Knowledge Graphs Recommender systems

    Entity embeddings Relation embeddings User embeddings Item embeddings Model
  50. KG-Enhanced Recommender Systems Structure-based methods 50 User-item interactions Knowledge graphs

    Model Structure information
  51. Deep Knowledge-Aware Network for News Recommendation The 2018 Web Conference

    (WWW 2018) Hongwei Wang, Fuzheng Zhang, Xing Xie, Minyi Guo Shnghai Jiao Tong University Embedding-Based Method 51
  52. 52 Trump praises Las Vegas medical team Apple CEO Tim

    Cook: iPhone 8 and Apple Watch Series 3 are sold out in some places EU Spain: Juncker does not want Catalonian independence …… Donald Trump: Donald Trump is the 45th president … Las Vegas: Las Vegas is the 28th-most populated city … Apple Inc.: Apple Inc. is an American multinational … CEO: A chief executive officer is the position of the … Tim Cook: Timothy Cook is an American business … iPhone 8: iPhone 8 is smartphone designed, … Entity linking …… Knowledge subgraph construction Knowledge graph embedding Donald Trump: (0.32, 0.48) Las Vegas: (0.71, -0.49) Apple Inc.: (-0.48, -0.41) CEO: (-0.57, 0.06) Tim Cook: (-0.61, -0.59) iPhone 8: (-0.46, -0.75) Entity embeddings Knowledge Distillation
  53. 53 Context of entities “Fight Club” Context Embedding

  54. 54 𝑤&:( = [Donald Trump praises Las Vegas medical team]

    𝒅×𝒏 word embedding matrix Sentence Feature maps Max pooling Sentence representation Convolution Kim CNN
  55. 55 𝒅×𝒏 word embeddings 𝒅×𝒏 entity embeddings 𝒅×𝒏 context embeddings

    CNN layer pooling multiple channels Knowledge-Aware CNN
  56. 56 User’s clicked news Candidate news Attention Net KCNN KCNN

    KCNN KCNN concat. User embedding Candidate news embedding Click probability element-wise + element-wise × DKN Attention net: User interest extraction: CTR prediction: Attention-Based User Interest Extraction
  57. o Dataset: Bing News o (timestamp, user_id, news_url, news_title, click_label)

    o Training set: October 16, 2016 ~ June 11, 2017 o Test set: June 12, 2017 ~ August 11, 2017 o Knowledge graph: Microsoft Satori Dataset 57 Table: Dataset statistics.
  58. Experimental Results 58 Table: F1 and AUC scores of DKN

    and baselines. Comparison with baselines
  59. Knowledge Graph Convolutional Networks for Recommender Systems The 2019 Web

    Conference (WWW 2019) Hongwei Wang, Miao Zhao, Xing Xie, Wenjie Li, Minyi Guo Shnghai Jiao Tong University Structure-Based Method 59 Knowledge-Aware Convolutional Networks with Label Smoothness Regularization for Recommender Systems The 25th SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2019) Hongwei Wang, Fuzheng Zhang, Mengdi Zhang, Jure Leskovec, Miao Zhao, Wenjie Li, Zhongyuan Wang Stanford University
  60. Relation Scoring Function 60 o No explicit weights for edges

    (relations) in a KG o Transforming a KG to a weighted graph by introducing a trainable and personalized relation scoring function 𝑠3 𝑟 o 𝑢: a user; 𝑟: a type of relation o 𝑠) 𝑟 identifies important relations for a given user o E.g., 𝑠) 𝑟 = 𝐮*𝐫 Knowledge graph 𝒢 Adjacency matrix 𝐴S
  61. Knowledge Graph Convolutional Networks 61 o Layer-wise forward propagation: Adjacency

    matrix of the KG for particular user 𝑢
  62. 62 o Layer-wise forward propagation: Diagonal degree matrix of 𝐴S

    Knowledge Graph Convolutional Networks
  63. 63 o Layer-wise forward propagation: Trainable transformation matrix Knowledge Graph

    Convolutional Networks
  64. 64 o Layer-wise forward propagation: Entity embedding matrix Knowledge Graph

    Convolutional Networks
  65. 65 o Layer-wise forward propagation: = 𝜎 ( ) Knowledge

    Graph Convolutional Networks
  66. 66 o Layer-wise feature propagation: Knowledge Graph Convolutional Networks

  67. Predicting Engagement Probability 67 User embeddings

  68. Predicting Engagement Probability 68 User embeddings Entity (item) embeddings from

    the last KGNN layer
  69. Predicting Engagement Probability 69 User embeddings Entity (item) embeddings from

    the last KGNN layer Inner product, MLP, etc.
  70. In Traditional GNNs… 70 Trainable Fixed

  71. But in KGCN… 71 How to solve the problem of

    overfitting? Trainable Trainable
  72. User Engagement Labels 72 0 1 1 User-engagement labels 𝑦ST

    for a particular user Negative items Positive items Non-item entities (unlabeled) 0
  73. 73 0 1 1 User-engagement labels 𝑦ST for a particular

    user ? How can we get the label for an unlabeled node? User Engagement Labels 0 Negative items Positive items Non-item entities (unlabeled)
  74. 74 0 1 1 ? weighted average o For a

    given node, take the weighted average of its neighborhood labels as its own label Label Propagation Algorithm (LPA) 0 How can we get the label for an unlabeled node?
  75. 75 0 1 1 ? o For a given node,

    take the weighted average of its neighborhood labels as its own label o Repeat the first step for every unlabeled node until convergence Label Propagation Algorithm (LPA) 0 How can we get the label for an unlabeled node?
  76. Label Smoothness Assumption 76 o LPA minimizes the following objective:

    𝐸 = 1 2 . (N,O)∈ℰ 𝐴L [𝑖, 𝑗] (4 𝑦LN − 4 𝑦LO )R O 𝑦 is the label predicted by LPA o Adjacent entities in the KG are more likely to have similar labels
  77. Label Smoothness Regularization 77 Hold out the label of 𝑣

    0 1 1 0 0 𝒗
  78. 78 0 1 1 0 𝒗 Hold out the label

    of 𝑣 Label Smoothness Regularization 0
  79. 79 Predict the label of 𝑣 by label propagation algorithm

    0 1 1 𝒗 Label Smoothness Regularization 0
  80. 80 0 1 1 𝒗 Label Smoothness Regularization 0 True

    label of 𝑣: 𝑦ST Predicted label of 𝑣: O 𝑦ST Cross-entropy loss 𝐽(𝑦3(, U 𝑦3()
  81. 81 0 1 1 𝒗 𝑅 𝐴 = R S

    𝑅(𝐴S) = R S R T 𝐽(𝑦ST, O 𝑦ST) Label Smoothness Regularization 0 O 𝑦ST = 𝐿𝑃𝐴 𝑌\{𝑦ST ; 𝐴S)
  82. The Unified Model: KGNN-LS 82 Original KG Adjacency matrix Step

    1 learn edge weights Step 2 GNN Entity (item) embeddings User embeddings Step 3 predict ! 𝑦#$ (predicted labels by GNN) Step 4: Label propagation 𝑦#$ (ground truth) # 𝑦#$ (predicted labels by LPA) loss(! 𝑦, 𝑦) loss(# 𝑦, 𝑦) Update W and A Update A
  83. Click-through Rate Prediction 5.1% 6.9% 8.3% 4.3% Average improvements in

    AUC 83
  84. LS Regularization 84 without LS regularization with LS regularization o

    Dataset: Last.FM
  85. Cold Start Scenario 85 o Dataset: MovieLens-20M o Varying the

    size of training set from 𝑟 = 100% to 𝑟 = 20% AUC decreases by 8.4% 5.9% 5.4% 3.6% 2.8% 4.1% 1.8% More sparse
  86. Comparison o Performance o KGNN-LS (Aug 2019) > KGCN (May

    2019) > DKN (Apr 2018) o Scalability o Embedding-based methods > structure-based methods o User-item interactions change with time, but KGs don’t o Knowledge graph embeddings can be reused o Explainability o Structure-based methods > embedding-based methods o Graph structures are more intuitive than embeddings 86
  87. Take-Aways o Graph representation learning is a fundamental step in

    graph-related tasks o Graph neural networks is a special type of GRL methods o Knowledge graph is a special type of graphs o Knowledge graph completion o PathCon: Combining context and paths information o Knowledge-graph-aware recommendation o DKN for news recommendation o KGCN/KGNN-LS for aggregating neighboring entity information on KGs using GNNs 87
  88. Q & A More information is available at: https://hongweiw.net All

    the source codes are available at: https://github.com/hwwang55 Thanks ! 88