Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Graph Representation Learning: From Knowledge G...

wing.nus
October 01, 2021

Graph Representation Learning: From Knowledge Graphs To Recommender Systems

Graphs are ubiquitous in the real world. To facilitate machine learning algorithms making use of graph-structured data, researchers proposed graph representation learning (GRL) methods, which learns a real low-dimensional vector for each node in a graph. In this talk, I will first briefly introduce graph representation learning, graph neural networks (GNNs, a special type of GRL methods), and knowledge graphs (KGs, a special type of graphs). Then my talk will consist of two parts: (1) Knowledge graph completion. I will introduce PathCon, a GNN-based method that combines relational context and relational paths information to predict the relation type of an edge in a KG. (2) Knowledge-graph-aware recommendation. Knowledge graphs can provide additional item-item relationship and thus alleviate the cold start problem in recommender systems. I will introduce two KG-aware recommendation algorithms, including an embedding-based method DKN and two structure-based methods KGCN and KGNN-LS.

wing.nus

October 01, 2021
Tweet

More Decks by wing.nus

Other Decks in Research

Transcript

  1. Graph Representation Learning: From Knowledge Graphs To Recommender Systems Hongwei

    Wang University of Illinois Urbana-Champaign Sep 28, 2021
  2. A Short Bio 2 o Education o B.E., Computer Science

    Shanghai Jiao Tong University, 2010-2014 o Ph.D., Computer Science Shanghai Jiao Tong University, 2014-2018 o Postdoc, Computer Science Stanford University, 2019-2021 o Postdoc, Computer Science University of Illinois Urbana-Champaign, 2021- o Awards o 2018 Google Ph.D. Fellowship o 2020 CCF (China Computer Federation) Outstanding Doctoral Dissertation o Research Interests o Graph neural networks, knowledge graphs, recommender systems
  3. Content 3 o Graph representation learning o Graph neural networks

    o Knowledge graphs o Knowledge graph completion o Embedding-based methods o Knowledge-graph-aware recommendation o Embedding-based methods: DKN o Structure-based methods: KGCN and KGNN-LS
  4. Graphs are Ubiquitous 4 A graph is a structure amounting

    to a set of objects in which some pairs of objects are in some sense “related” Molecule Protein Protein-protein interaction Synthetic routes Social networks Knowledge graphs Navigation map Flight routes Atom-level Molecule-level Human-level World-level
  5. Representing a Graph 5 𝐺 = (𝑉, 𝐸) 0 1

    1 0 0 0 1 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 1 0 1 0 0 0 0 Adjacency matrix 𝐴 Graph 𝐺
  6. Representing a Graph 6 When the graph is very large…

    o Storage inefficiency: 𝑂(𝑁!) o Hard to compute node similarity
  7. Graph Representation Learning 7 Node embeddings in ℝ! (𝑑 ≪

    #𝑛𝑜𝑑𝑒𝑠) Graph 𝐺 Nodes Edges Subgraphs Graphs Points (embeddings) in low-dimensional space ℝ" Graph representation learning (GRL) Structural information Semantic information
  8. Downstream Tasks of GRL Link Prediction 8 ? 𝑣# 𝑣!

    Learning a mapping: 𝑓: [𝐞$%"&#, 𝐞$%"&!] ↦ {0,1} o Are the two users friends in a social network? o Is there a flight between the two airports? ……
  9. Downstream Tasks of GRL Node Classification 9 label? 𝑣 Learning

    a mapping: 𝑓: 𝐞$%"& ↦ 𝑠𝑒𝑡 𝑜𝑓 𝑛𝑜𝑑𝑒 𝑙𝑎𝑏𝑒𝑙𝑠 o Is a user male or female in a social network? o What research field does a paper belong to in a citation network? ……
  10. Downstream Tasks of GRL Graph Classification 10 Learning a mapping:

    𝑓: 𝐞'()*+ ↦ 𝑠𝑒𝑡 𝑜𝑓 𝑔𝑟𝑎𝑝ℎ 𝑙𝑎𝑏𝑒𝑙𝑠 toxic nontoxic Toxic or nontoxic?
  11. Graph Neural Networks (GNNs) 11 𝑥# 𝑥": initial feature of

    node 𝑣" 𝑥, 𝑥! 𝑥- 𝑥. 𝑥/ 𝑥0 𝑥1 𝑥2 𝑥3 GNNs follow a neighborhood aggregation strategy: for 𝑘 = 1, … , 𝐾: for each node 𝑣! : ℎ! " = 𝑥! for each node 𝑣! return ℎ! # for each node 𝑣!
  12. Graph Neural Networks (GNNs) 12 AGGREGATE function in GCN: 𝑊$

    is a learnable transformation matrix for layer 𝑘 𝛼!% = 1/ |𝒩(𝑖)| E |𝒩(𝑖)| is a normalization factor 𝜎 is a activation function such as ReLU 𝑥 = max(𝑥, 0) Graph Convolutional Network (GCN) Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." The 5th International Conference on Learning Representations (2017).
  13. Relational Message Passing for Knowledge Graph Completion The 27th SIGKDD

    Conference on Knowledge Discovery and Data Mining (KDD 2021) Hongwei Wang, Hongyu Ren, Jure Leskovec Stanford University GRL in Knowledge Graphs 13
  14. o Knowledge graphs (KGs) store structured information of real-world entities

    and facts Cast Away The Green Mile Tom Hanks Robert Zemeckis Adventure Back to the Future Steven Spielberg genre genre style starred 𝐺 = {(ℎ, 𝑟, 𝑡)} Head entity Tail entity Relation directed collaborate Knowledge Graphs 14
  15. o Knowledge graphs are usually incomplete and noisy o KG

    completion: given (ℎ, ? , 𝑡), predict 𝑟 o Modeling the distribution over relation types: 𝑝 𝑟 ℎ, 𝑡) ? Cast Away The Green Mile Tom Hanks Robert Zemeckis Adventure Back to the Future Steven Spielberg genre genre style starred directed collaborate Knowledge Graph Completion 15
  16. o Relational context (neighbor edges of a given edge) graduated

    from a person a school person.birthplace person.gender institution.location university.founder university.president movie.language Relations are Correlated… 16
  17. o Relational paths (paths connecting the two endpoints of a

    given edge) graduated from has alumni schoolmate of graduated from Relations are Correlated… 17
  18. 𝑟? head ℎ tail 𝑡 Relational context module first-order neighbor

    second-order neighbor relational message passing The Proposed Method: PathCon 19
  19. o Aggregates neighbor nodes information: o Updates node information: o

    Does not work well on KGs because: o In most KGs, edge have features (relation types) but nodes don’t o Making use of node identity fails in inductive setting o The number of nodes are much larger than the number of relation types Node-based Message Passing 21
  20. o Aggregates neighbor edge information: o Updates edge information: o

    Avoids the drawbacks of node-based message passing, but introduces a new issue of computational efficiency Relational (Edge-Based) Message Passing 22
  21. Consider a graph with 𝑁 nodes and 𝑀 edges o

    Complexity of node-based message passing in each iteration: 𝟐𝑵 + 𝟐𝑴 o Complexity of relational message passing: 𝑵 < 𝐕𝐚𝐫 𝒅 + 𝟒𝑴𝟐 𝑵 , where Var 𝑑 is the variance of node degrees Message Passing Complexity 23
  22. o Aggregates neighbor edge information to nodes: o Aggregate neighbor

    nodes information to edges: o Update edge information: ① ① ② Alternate Message Passing 24
  23. Consider a graph with 𝑁 nodes and 𝑀 edges o

    Complexity of node-based message passing in each iteration: 𝟐𝑵 + 𝟐𝑴 o Complexity of relational message passing: 𝑵 < 𝐕𝐚𝐫 𝒅 + 𝟒𝑴𝟐 𝑵 , where Var 𝑑 is the variance of node degrees o Complexity of alternate relational message passing: 𝟔𝑴 Message Passing Complexity 25
  24. 𝑠& ": the hidden state of edge 𝑒 in iteration

    𝑖 (𝑠& ' is 𝑒’s initial feature) 𝑚( " : the message stored at node 𝑣 in iteration 𝑖 Making use of relational context: o Final messages of (ℎ, 𝑡): 𝒎𝒉 𝑲+𝟏 and 𝒎𝒕 𝑲+𝟏, where 𝐾 is the number of message passing o Message passing in each iteration: Relational Context 26
  25. A raw path from ℎ to 𝑡: Making use of

    relational paths: The corresponding relational path: o Enumerate all relational paths with length ≤ 𝐿 o Assign an independent embedding vector 𝑠. for each relational path 𝑃 Relational Paths 27
  26. o Combine final message 𝑚/ 0+1 and 𝑚2 0+1 together

    to get the context information of (ℎ, 𝑡): o Aggregate the information of all paths from ℎ to 𝑡 with attention: where o Make prediction by combining the above two: o Train the model: Combining Relational Context and Paths 28
  27. Baselines Embedding-based models o TransE o ComplEx o DistMult o

    RotatE o SimplE o QuatE Path-based models o DRUM Ablation studies o Path o Con Table 2: Number of model parameters on DDB14. Experiments 30
  28. Comparison with baselines Table 3: MRR and Hit@1 results on

    all datasets. Hit@1 gain: 0.2% 0.6% 0.9% 16.7% 6.3% 1.8% Experiments 31
  29. Comparison with baselines Table 3: MRR and Hit@1 results on

    all datasets. The performance variance is very small Experiments 32
  30. Comparison with baselines Table 3: MRR and Hit@1 results on

    all datasets. The performance of Path and Con is already quite good Experiments 33
  31. Inductiveness fully transductive → fully inductive random guessing 0.954 →

    0.922 Figure 1: Hit@1 results on WN18RR. Experiments 34
  32. Explainability Figure 4: The learned correlation between all relation paths

    with length ≤ 2 and the predicted relations on DDB14. (a, is associated with, b) ∧ (b, is associated with, c) ⟹ (a, is associated with, c) (a, may be allelic with, b) ∧ (b, may be allelic with, c) ⟹ (a, may be allelic with, c) Experiments 36
  33. Explainability Figure 4: The learned correlation between all relation paths

    with length ≤ 2 and the predicted relations on DDB14. (a, belong(s) to the category of, b) ⟺ (a, is a subtype of, b) (a, is a risk factor for, b) ⟹ (a, may cause, b) (a, may cause, c) ∧ (b, may cause, c) ⟹ (a, may be allelic with, b) Experiments 37
  34. Recommender Systems Movie Recommender systems (RS) intend to address the

    information explosion by finding a small set of items for users to meet their personalized interests 38
  35. Recommender Systems Book Recommender systems (RS) intend to address the

    information explosion by finding a small set of items for users to meet their personalized interests 39
  36. Recommender Systems Trip Recommender systems (RS) intend to address the

    information explosion by finding a small set of items for users to meet their personalized interests 40
  37. Rating/CTR Prediction 2 ? 3 ? ? ? 4 ?

    ? 5 ? 2 3 1 4 ? 1 ? 0 ? ? ? 1 ? ? 0 ? 1 0 1 0 ? Rating prediction Click-through rate (CTR) prediction Explicit feedback Implicit feedback 42
  38. Collaborative Filtering 43 2 3 3 1 3 1 1

    4 2 5 4 5 ? 3 4 1 similarity with u4 0.7 0.1 0.2 u1 u2 u3 u4 i1 i2 i3 i4 ? = 0.7×2 + 0.1×3 + 0.2×2 = 2.1
  39. o Sparsity of user-item interactions o Cold start problem CF

    Cannot Handle... 2 1 5 2 ? 4 2 3 1 5 2 ? Sparsity Cold start 44
  40. CF + Side Information Social networks User/item attributes Alice Female

    California … Multimedia (images, texts, videos, audios ...) Contexts purchase Time: 20:10 Location: Beijing What else in the cart:… iPhone X 2017 5.8 inch $999 … 45
  41. Why Using KGs in Recommender Systems? 46 Cast Away The

    Green Mile Tom Hanks Robert Zemeckis Adventure Back to the Future Steven Spielberg genre starred genre Forrest Gump Raiders of the Lost Ark Interstellar include include star starred directed direct direct style collaborate items (movies) non-item entities a user watched items (movies)
  42. 47 Boris Johnson Donald Trump Iran Nuclear Congress EMP ……

    Politician Weapon United States North Korea North Korean EMP Attack Would Cause Mass U.S. Starvation, Says Congressional Report News the user may also like Boris Johnson Has Warned Donald Trump To Stick To The Iran Nuclear Deal News the user has read Why Using KGs in Recommender Systems?
  43. 48 Users Items User engagement labels 𝑦ST ∈ {0,1} ……

    Non-item entities Knowledge graph 𝒢 Goal: Learn predicted engagement probability ) 𝑦LM Problem Formulation
  44. KG-Enhanced Recommender Systems Embedding-based methods 49 Knowledge Graphs Recommender systems

    Entity embeddings Relation embeddings User embeddings Item embeddings Model
  45. Deep Knowledge-Aware Network for News Recommendation The 2018 Web Conference

    (WWW 2018) Hongwei Wang, Fuzheng Zhang, Xing Xie, Minyi Guo Shnghai Jiao Tong University Embedding-Based Method 51
  46. 52 Trump praises Las Vegas medical team Apple CEO Tim

    Cook: iPhone 8 and Apple Watch Series 3 are sold out in some places EU Spain: Juncker does not want Catalonian independence …… Donald Trump: Donald Trump is the 45th president … Las Vegas: Las Vegas is the 28th-most populated city … Apple Inc.: Apple Inc. is an American multinational … CEO: A chief executive officer is the position of the … Tim Cook: Timothy Cook is an American business … iPhone 8: iPhone 8 is smartphone designed, … Entity linking …… Knowledge subgraph construction Knowledge graph embedding Donald Trump: (0.32, 0.48) Las Vegas: (0.71, -0.49) Apple Inc.: (-0.48, -0.41) CEO: (-0.57, 0.06) Tim Cook: (-0.61, -0.59) iPhone 8: (-0.46, -0.75) Entity embeddings Knowledge Distillation
  47. 54 𝑤&:( = [Donald Trump praises Las Vegas medical team]

    𝒅×𝒏 word embedding matrix Sentence Feature maps Max pooling Sentence representation Convolution Kim CNN
  48. 56 User’s clicked news Candidate news Attention Net KCNN KCNN

    KCNN KCNN concat. User embedding Candidate news embedding Click probability element-wise + element-wise × DKN Attention net: User interest extraction: CTR prediction: Attention-Based User Interest Extraction
  49. o Dataset: Bing News o (timestamp, user_id, news_url, news_title, click_label)

    o Training set: October 16, 2016 ~ June 11, 2017 o Test set: June 12, 2017 ~ August 11, 2017 o Knowledge graph: Microsoft Satori Dataset 57 Table: Dataset statistics.
  50. Experimental Results 58 Table: F1 and AUC scores of DKN

    and baselines. Comparison with baselines
  51. Knowledge Graph Convolutional Networks for Recommender Systems The 2019 Web

    Conference (WWW 2019) Hongwei Wang, Miao Zhao, Xing Xie, Wenjie Li, Minyi Guo Shnghai Jiao Tong University Structure-Based Method 59 Knowledge-Aware Convolutional Networks with Label Smoothness Regularization for Recommender Systems The 25th SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2019) Hongwei Wang, Fuzheng Zhang, Mengdi Zhang, Jure Leskovec, Miao Zhao, Wenjie Li, Zhongyuan Wang Stanford University
  52. Relation Scoring Function 60 o No explicit weights for edges

    (relations) in a KG o Transforming a KG to a weighted graph by introducing a trainable and personalized relation scoring function 𝑠3 𝑟 o 𝑢: a user; 𝑟: a type of relation o 𝑠) 𝑟 identifies important relations for a given user o E.g., 𝑠) 𝑟 = 𝐮*𝐫 Knowledge graph 𝒢 Adjacency matrix 𝐴S
  53. But in KGCN… 71 How to solve the problem of

    overfitting? Trainable Trainable
  54. User Engagement Labels 72 0 1 1 User-engagement labels 𝑦ST

    for a particular user Negative items Positive items Non-item entities (unlabeled) 0
  55. 73 0 1 1 User-engagement labels 𝑦ST for a particular

    user ? How can we get the label for an unlabeled node? User Engagement Labels 0 Negative items Positive items Non-item entities (unlabeled)
  56. 74 0 1 1 ? weighted average o For a

    given node, take the weighted average of its neighborhood labels as its own label Label Propagation Algorithm (LPA) 0 How can we get the label for an unlabeled node?
  57. 75 0 1 1 ? o For a given node,

    take the weighted average of its neighborhood labels as its own label o Repeat the first step for every unlabeled node until convergence Label Propagation Algorithm (LPA) 0 How can we get the label for an unlabeled node?
  58. Label Smoothness Assumption 76 o LPA minimizes the following objective:

    𝐸 = 1 2 . (N,O)∈ℰ 𝐴L [𝑖, 𝑗] (4 𝑦LN − 4 𝑦LO )R O 𝑦 is the label predicted by LPA o Adjacent entities in the KG are more likely to have similar labels
  59. 78 0 1 1 0 𝒗 Hold out the label

    of 𝑣 Label Smoothness Regularization 0
  60. 79 Predict the label of 𝑣 by label propagation algorithm

    0 1 1 𝒗 Label Smoothness Regularization 0
  61. 80 0 1 1 𝒗 Label Smoothness Regularization 0 True

    label of 𝑣: 𝑦ST Predicted label of 𝑣: O 𝑦ST Cross-entropy loss 𝐽(𝑦3(, U 𝑦3()
  62. 81 0 1 1 𝒗 𝑅 𝐴 = R S

    𝑅(𝐴S) = R S R T 𝐽(𝑦ST, O 𝑦ST) Label Smoothness Regularization 0 O 𝑦ST = 𝐿𝑃𝐴 𝑌\{𝑦ST ; 𝐴S)
  63. The Unified Model: KGNN-LS 82 Original KG Adjacency matrix Step

    1 learn edge weights Step 2 GNN Entity (item) embeddings User embeddings Step 3 predict ! 𝑦#$ (predicted labels by GNN) Step 4: Label propagation 𝑦#$ (ground truth) # 𝑦#$ (predicted labels by LPA) loss(! 𝑦, 𝑦) loss(# 𝑦, 𝑦) Update W and A Update A
  64. Cold Start Scenario 85 o Dataset: MovieLens-20M o Varying the

    size of training set from 𝑟 = 100% to 𝑟 = 20% AUC decreases by 8.4% 5.9% 5.4% 3.6% 2.8% 4.1% 1.8% More sparse
  65. Comparison o Performance o KGNN-LS (Aug 2019) > KGCN (May

    2019) > DKN (Apr 2018) o Scalability o Embedding-based methods > structure-based methods o User-item interactions change with time, but KGs don’t o Knowledge graph embeddings can be reused o Explainability o Structure-based methods > embedding-based methods o Graph structures are more intuitive than embeddings 86
  66. Take-Aways o Graph representation learning is a fundamental step in

    graph-related tasks o Graph neural networks is a special type of GRL methods o Knowledge graph is a special type of graphs o Knowledge graph completion o PathCon: Combining context and paths information o Knowledge-graph-aware recommendation o DKN for news recommendation o KGCN/KGNN-LS for aggregating neighboring entity information on KGs using GNNs 87
  67. Q & A More information is available at: https://hongweiw.net All

    the source codes are available at: https://github.com/hwwang55 Thanks ! 88