Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Recommender Systems Part 4 - 2022.01.24

059fb717431a8cd2b509ffebc57d905a?s=47 Y. Yamamoto
January 17, 2022

Recommender Systems Part 4 - 2022.01.24

1. PageRank
2. Topic-sensitive PageRank
3. Programming Work

059fb717431a8cd2b509ffebc57d905a?s=128

Y. Yamamoto

January 17, 2022
Tweet

More Decks by Y. Yamamoto

Other Decks in Science

Transcript

  1. Link Analysis: Find important nodes in large-scale graph Yusuke Yamamoto

    Associate Professor Faculty of Informatics yusuke_yamamoto@acm.org Data Engineering (Recommender Systems 4) 2022.01.24
  2. Graph data 2 A graph is a data structure consisting

    of collection of nodes and edges (links). Each edge represents the relation between two nodes.
  3. Graph data is often observed in real life 3 Image

    from William L. Hamilton’s COMP551 special topic lecture Paper citation networks Web
  4. Important nodes in graphs 4 Image from William L. Hamilton’s

    COMP551 special topic lecture We often want to find which nodes are important in graphs. Who is the most influential person? Which is the best paper? Which is the most popular webpage? Paper citation graphs Web Social networks
  5. Important nodes in graphs 5 Image from William L. Hamilton’s

    COMP551 special topic lecture We often want to know which nodes are important in graph. Who is the most influential person? Which is the best paper? Which is the most popular webpage? Paper citation networks Web How can we compute the importance of nodes in graphs? Q. Link analysis can help us!! A.
  6. What do we learn today? 6 PageRank Topic-sensitive PageRank 1.

    2.
  7. 1 7 PageRank Google introduced a new method to evaluate

    webpages
  8. The objective of PageRank 8 A C D B E

    Importance Ranking 1. node B 2. node D 3. node A 4. node C 5. node E 0.40pt 0.26pt 0.20pt 0.11pt 0.03pt Based on graph structure, PageRank evaluates and ranks webpages Web graph (Hyperlink structure)
  9. Simple method to evaluate webpage importance 9 Simple assumption (majority

    voting) If a webpage is linked by a lot of webpages, the webpage can be important. A C D B E #in-links = 3 #in-links = 2 #in-links = 2
  10. Simple method to evaluate webpage importance 10 Simple assumption (majority

    voting) If a webpage is linked by a lot of webpages, the webpage can be important. A C D B E #in-links = 3 #in-links = 2 #in-links = 2 Is this assumption really good?
  11. Problems on simple link counting (1/2) 11 A C D

    B E Malicious websites can easily their scores by creating ‘spam farm’ of a million pages #in-links: 2
  12. Problems on simple link counting (1/2) 12 A C D

    B E #in-links: 2 ⇒ 100 Malicious websites can easily their scores by creating ‘spam farm’ of a million pages M M M M M M Spam farm (98 pages)
  13. Problems on simple link counting (2/2) 13 Simple method doesn’t

    consider whether where a webpage is linked by important pages or non-important pages A C D B E #in-links: 3 #in-links: 2 #in-links: 2 linked by B whose #in-link=3 linked by E whose #in-link=0 Which is more important, page C or D?
  14. Basic idea of PageRank If a page is linked by

    a lot of IMPORTANT pages, the page can be important Assumption A C D B E more important than E #in-links: 2 #in-links: 2 D is more important than C because D is linked by more important node (B) than D
  15. Another interpretation of basic idea of PageRank 15 They are

    more likely to stay on more important pages 1.When people are browsing a page, we assume that they randomly select a link on it for next browsing 2.They are likely to move to a page from links of more important pages. A C D B E How can we calculate the likelihood to stay? 3.
  16. Toy example to check the basic idea of PageRank 16

    A C D B Q. Suppose that a random surfer is now at A. He randomly selects one of links on each page to decide which page he will visit. Which page has the highest chance of him to stay after he move repeatedly? Prob. = 1 Prob. = 0 Prob. = 0 Prob. = 0 Random surfer
  17. Toy example to check the basic idea of PageRank 17

    A C D B Q. Which page has the highest chance of him to stay after he move repeatedly? Prob. = 1
  18. Toy example to check the basic idea of PageRank 18

    A C D B Q. A surfer randomly select a link to move 1/3 1/3 Transition probability 1/3 Which page has the highest chance of him to stay after he move repeatedly?
  19. Toy example to check the basic idea of PageRank 19

    A C D B 1/3 1/3 Transition probability 1/3 How large is the chance that he will be on page B after his first transition? Q. Which page has the highest chance of him to stay after he move repeatedly?
  20. Toy example to check the basic idea of PageRank 20

    A C D B 1x(1/3)=1/3 1x(1/3)=1/3 1x(1/3)=1/3 0 1/3 1/3 1/3 Q. Which page has the highest chance of him to stay after he move repeatedly?
  21. Toy example to check the basic idea of PageRank 21

    A C D B 1x(1/3)=1/3 1x(1/3)=1/3 1x(1/3)=1/3 0 How likely is he to stay on page B after his SECOND transition? Q. Which page has the highest chance of him to stay after he move repeatedly?
  22. Toy example to check the basic idea of PageRank 22

    A C D B Q. Which page has the highest chance of him to stay after he move repeatedly? Prob = 0 Prob = 1/3 Prob = 1/3 Prob = 1/3
  23. Toy example to check the basic idea of PageRank 23

    A C D B 1/2 1/2 1 Transition probability 1/2 1/2 Q. Which page has the highest chance of him to stay after he move repeatedly?
  24. Toy example to check the basic idea of PageRank 24

    A C D B 1/2 1/2 1 Transition probability 1/2 1/2 1/3 1/3 1/3 0 Q. Which page has the highest chance of him to stay after he move repeatedly?
  25. Toy example to check the basic idea of PageRank 25

    A C D B 1/2 1/2 1 Transition probability 1/2 1/2 1/3 1/3 1/3 Q. Which page has the highest chance of him to stay after he move repeatedly? The probability A→B→A is 𝟏 𝟑 × 𝟏 𝟐
  26. Toy example to check the basic idea of PageRank 26

    A C D B 1/2 1/2 1 Transition probability 1/2 1/2 1/3 1/3 1/3 Q. Which page has the highest chance of him to stay after he move repeatedly? The probability A→C→A is 𝟏 𝟑 ×𝟏
  27. Toy example to check the basic idea of PageRank 27

    A C D B 1/2 1/2 1 Transition probability 1/2 1/2 1/3 1/3 1/3 Q. Which page has the highest chance of him to stay after he move repeatedly? The probability A→?→A is 𝟏 𝟑 × 𝟏 𝟐 + 𝟏 𝟑 ×𝟏 = 𝟏 𝟐
  28. Toy example to check the basic idea of PageRank 28

    A C D B 1/2 1/2 1 Transition probability 1 3 × 1 2 + 1 3 ×1 = 1 2 1 3 × 1 2 + 1 3 ×0 = 1 6 1 3 × 1 2 + 0× 1 3 = 1 6 0× 1 3 + 1 3 × 1 2 = 1 6 1/2 1/2 Q. Which page has the highest chance of him to stay after he move repeatedly?
  29. Toy example to check the basic idea of PageRank 29

    0 1 2 3 4 5 A 1 0 0.5 0.25 0.375 0.313 B 0 0.333 0.167 0.25 0.208 0.229 C 0 0.333 0.167 0.25 0.208 0.229 D 0 0.333 0.167 0.25 0.208 0.229 Node Iter. Probability change in each iteration Q. Which page has the highest chance of him to stay after he move repeatedly?
  30. Toy example to check the basic idea of PageRank 30

    0 5 10 20 … 1000 A 1 0.313 0.334 0.333 0.333 B 0 0.229 0.222 0.222 0.222 C 0 0.229 0.222 0.222 0.222 D 0 0.229 0.222 0.222 0.222 Node Iter. When transition repeats, each probability will be converged. The prob. mean the likelihood of people to visit (i.e., PageRank) Probability change in each iteration Q. Which page has the highest chance of him to stay after he move repeatedly?
  31. Mathematical procedure to calculate simple PageRank (1/4) 31 Initial probability

    of users to be on each node 𝒓𝟎 = 1 0 0 0 Transition probability from node to node 𝑴 = 0 1/2 0 0 1/3 0 0 1/2 1/3 0 1 1/2 1/3 1/2 0 0 A C D B Prob.=1 Prob.=0 Prob.=0 Prob.=0 1/2 1/3 1/3 1/3 1/2 1/2 1/2 1
  32. Mathematical procedure to calculate simple PageRank (2/4) 32 𝒓𝟏 =

    𝑴𝒓𝟎 = 0 1/2 0 0 1/3 0 0 1/2 1/3 0 1 1/2 1/3 1/2 0 0 1 0 0 0
  33. Mathematical procedure to calculate simple PageRank (3/4) 33 𝒓𝟐 =

    𝑴𝒓𝟏 = 0 1/2 0 0 1/3 0 0 1/2 1/3 0 1 1/2 1/3 1/2 0 0 𝟐 1 0 0 0 = 𝑴𝑴𝑻𝒓𝟎 = 𝑴𝟐𝒓𝟎
  34. Mathematical procedure to calculate simple PageRank (4/4) 34 𝒓𝒏 =

    𝑴𝒓𝒏"𝟏 = 𝑴𝑴𝒓𝒏"𝟐 = 𝑴𝟐𝒓𝒏"𝟐 = 𝑴𝒏𝒓𝟎 … If n is enough large, we think rn represents the likelihood of people to visit each node
  35. Problems of simple PageRank (1/3) 35 A C D B

    A C D B Dead end Spider trap Several of link structures violate the PageRank assumption
  36. Problems of simple PageRank (2/3) 36 A C D B

    Dead end Several of link structures violate the PageRank assumption 0 1 10 100 A 1 0 0.01 0 B 0 0.333 0.015 0 C 0 0.333 0.015 0 D 0 0.333 0.015 0 Probability change in each iteration
  37. Problems of simple PageRank (3/3) 37 A C D B

    Spider trap Several of link structures violate the PageRank assumption 0 1 10 100 A 1 0 0.01 0 B 0 0.333 0.015 0 C 0 0.333 0.961 1 D 0 0.333 0.015 0 Probability change in each iteration
  38. Revision of PageRank assumption (Complete PageRank) 38 1.When people are

    browsing a page, we assume that they randomly select links on it for next transition A C D B Most cases: people use links A C D B Sometimes: people directly jump 2.Sometimes, people directly visit pages without using hyperlinks (called, random jump)
  39. Revision of PageRank assumption (Complete PageRank) 39 1.When people are

    browsing a page, we assume that they randomly select links on it for next transition 2.Sometimes, people directly visit pages without using hyperlinks (called, random jump) A C D B Most cases: people use links A C D B Sometimes: people directly jump
  40. Algorithm of complete PageRank (1/5) 40 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

    1 − α 𝒅 2.Starting with n = 0, update rn with the below formula 1. Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d
  41. Algorithm of complete PageRank (1/5) 41 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

    1 − α 𝒅 2.Starting with n = 0, update rn with the below formula 1. Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d
  42. Algorithm of complete PageRank (1/5) 42 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

    1 − α 𝒅 2.Starting with n = 0, update rn with the below formula Corresponds to the case where people use links to visit pages 1. Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d
  43. Algorithm of complete PageRank (2/5) 43 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

    1 − α 𝒅 Transition matrix (which derived from link structure) 0 1/2 0 0 1/3 0 0 1/2 1/3 0 1 1/2 1/3 1/2 0 0 A C D B 1/3 1/3 1/3 1/2 1/2 1/2 2.Starting with n = 0, update rn with the below formula 1. Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d
  44. Algorithm of complete PageRank (1/5) 44 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

    1 − α 𝒅 2.Starting with n = 0, update rn with the below formula Corresponds to the case where people directly visit pages 1. Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d
  45. Algorithm of complete PageRank (3/5) 45 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

    1 − α 𝒅 Random surf vector: Probability of people to directly visit pages (uniform distribution of prob.) 1/4 1/4 1/4 1/4 A C D B 2.Starting with n = 0, update rn with the below formula 1. Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d
  46. Algorithm of complete PageRank (1/5) 46 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

    1 − α 𝒅 2.Starting with n = 0, update rn with the below formula 1. Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d
  47. Algorithm of complete PageRank (4/5) 47 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

    1 − α 𝒅 Probabilities (parameters) to decide which of the two factors people use. (Empirically, α is set in the range from 0.8 to 0.9) 2.Starting with n = 0, update rn with the below formula 1. Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d
  48. Algorithm of complete PageRank (5/5) 48 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

    1 − α 𝒅 2.Starting with n = 0, update rn with the below formula 1. Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d 3.If rn is converged (it does not change), the algorithm finishes. The converged rn is the PageRank!!
  49. Simple PageRank vs. complete PageRank 49 A C D B

    Spider trap 0 1 10 100 A 1 0 0.01 0 B 0 0.333 0.015 0 C 0 0.333 0.961 1 D 0 0.333 0.015 0 Probability change in each iteration 0 1 10 100 A 1 0.05 0.102 0.101 B 0 0.316 0.129 0.128 C 0 0.316 0.639 0.642 D 0 0.316 0.129 0.128 Complete PageRank Simple PageRank
  50. What can PageRank provide us? 50 PageRank can evaluate centrality

    of nodes in graph (network) data Influential people Good papers to cite Popular webpage Paper citation networks Web PageRank PageRank PageRank
  51. 2 51 Topic-sensitive PageRank Improved PageRank to consider node’s topic

  52. Issues of normal PageRank 52 Normal PageRank ignores what kinds

    of topics each node is related to. A C D B E ▪ Pages about medicine ▪ Pages about cosmetic F G Normal PageRank 1. Page C 0.282pt 2. A 0.174pt 3. F 0.133pt 4. D 0.132pt 5. B 0.093pt 6. E 0.092pt 7. G 0.092pt
  53. Issues of normal PageRank 53 Normal PageRank ignores what kinds

    of topics each node is related to. A C D B E ▪ Pages about medicine ▪ Pages about cosmetic F G Normal PageRank 1. Page C 0.282pt 2. A 0.174pt 3. F 0.133pt 4. D 0.132pt 5. B 0.093pt 6. E 0.092pt 7. G 0.092pt
  54. Which node is the most important about medicine? 54 ▪

    Pages about medicine ▪ Pages about cosmetic A C D B E F G Many pages link to C, but only one of them is about med. A is linked by more pages about medicine than C
  55. Which node is the most important about medicine? 55 ▪

    Pages about medicine ▪ Pages about cosmetic A C D B E F G Many pages link to C, but only one of them is about med. A is linked by more pages about medicine than C If people often move to a page from important pages about the topic, such page should be important for the topic! Assumption
  56. Assumption of Topic-sensitive PageRank 56 Normal PageRank • People follow

    links on pages to visit other pages. • They sometimes randomly visit pages without links. Topic-sensitive PageRank • People follow links on pages to visit other pages. • They sometimes randomly visit pages without links. only a kind of
  57. Algorithm of Topic-sensitive PageRank (1/2) 57 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

    1 − α 𝒅 Starting with n = 0, update rn with the below formula 0 0 1/4 1/4 1/4 0 1/4 1/2 0 0 1/2 0 0 0 0 0 0 0 0 0 0 0 1/3 1/3 0 0 1/3 0 0 0 1/2 0 0 1/2 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 A C D B E F G 1 1/4 1/4 1/4 1/4 1/2 1/2 1/3 1/3 1/3 1 1/2 1/2
  58. Algorithm of Topic-sensitive PageRank (2/2) 58 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

    1 − α 𝒅 1/7 1/7 1/7 1/7 1/7 1/7 1/7 Normal PageRank A C D B E F G Starting with n = 0, update rn with the below formula
  59. Algorithm of Topic-sensitive PageRank (2/2) 59 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

    1 − α 𝒅 1/7 1/7 1/7 1/7 1/7 1/7 1/7 Normal PageRank A C D B E F G 1/4 1/4 1/4 0 0 0 1/4 Topic-sensitive PageRank Starting with n = 0, update rn with the below formula
  60. Results of Topic-sensitive PageRank (TsPR) 60 • TsPR gives high

    scores to pages about a target topic ▪ Pages about medicine ▪ Pages about cosmetic A C D B E F G Normal PageRank 1. C 0.282pt 2. A 0.174pt 3. F 0.133pt 4. D 0.132pt 5. B 0.093pt 6. E 0.092pt 7. G 0.092pt Topic-sensitive PR 1. A 0.266pt 2. C 0.248pt 3. G 0.147pt 4. B 0.121pt 5. D 0.108pt 6. E 0.057pt 7. F 0.055pt • Even if a page is not about a target topic, if the page is linked by important pages, TsPR gives high score to it.
  61. When do we use Topic-sensitive PageRank? 61 Finding important nodes

    on a graph for target topics 1. Finding important nodes for individual users (personalized PageRank) 2. - For that, Give random surf values to only nodes for target topics - If you know the nodes of a user to frequently visit, give random surf values to only the nodes. A C D B E F G ▪ Pages which a user likes 𝒓𝒏 = 𝜶𝑴𝒓𝒏$𝟏 + 1 − α 𝒅 0 1/3 0 0 1/3 0 1/3
  62. 3 Programming Work 62

  63. 63 Visit the following URL: https://recsys2021.hontolab.org/

  64. 64 Click this link to learn today’s contents