Y. Yamamoto
January 17, 2022
47

# Recommender Systems Part 4 - 2022.01.24

1. PageRank
2. Topic-sensitive PageRank
3. Programming Work

January 17, 2022

## Transcript

1. ### Link Analysis: Find important nodes in large-scale graph Yusuke Yamamoto

Associate Professor Faculty of Informatics yusuke_yamamoto@acm.org Data Engineering （Recommender Systems 4） 2022.01.24
2. ### Graph data 2 A graph is a data structure consisting

of collection of nodes and edges (links). Each edge represents the relation between two nodes.
3. ### Graph data is often observed in real life 3 Image

from William L. Hamilton’s COMP551 special topic lecture Paper citation networks Web
4. ### Important nodes in graphs 4 Image from William L. Hamilton’s

COMP551 special topic lecture We often want to find which nodes are important in graphs. Who is the most influential person? Which is the best paper? Which is the most popular webpage? Paper citation graphs Web Social networks
5. ### Important nodes in graphs 5 Image from William L. Hamilton’s

COMP551 special topic lecture We often want to know which nodes are important in graph. Who is the most influential person? Which is the best paper? Which is the most popular webpage? Paper citation networks Web How can we compute the importance of nodes in graphs? Q. Link analysis can help us!! A.

2.

webpages
8. ### The objective of PageRank 8 A C D B E

Importance Ranking 1. node B 2. node D 3. node A 4. node C 5. node E 0.40pt 0.26pt 0.20pt 0.11pt 0.03pt Based on graph structure, PageRank evaluates and ranks webpages Web graph (Hyperlink structure)
9. ### Simple method to evaluate webpage importance 9 Simple assumption (majority

voting) If a webpage is linked by a lot of webpages, the webpage can be important. A C D B E #in-links = 3 #in-links = 2 #in-links = 2
10. ### Simple method to evaluate webpage importance 10 Simple assumption (majority

voting) If a webpage is linked by a lot of webpages, the webpage can be important. A C D B E #in-links = 3 #in-links = 2 #in-links = 2 Is this assumption really good?
11. ### Problems on simple link counting (1/2) 11 A C D

B E Malicious websites can easily their scores by creating ‘spam farm’ of a million pages #in-links: 2
12. ### Problems on simple link counting (1/2) 12 A C D

B E #in-links: 2 ⇒ 100 Malicious websites can easily their scores by creating ‘spam farm’ of a million pages M M M M M M Spam farm (98 pages)

14. ### Basic idea of PageRank If a page is linked by

a lot of IMPORTANT pages, the page can be important Assumption A C D B E more important than E #in-links: 2 #in-links: 2 D is more important than C because D is linked by more important node (B) than D
15. ### Another interpretation of basic idea of PageRank 15 They are

more likely to stay on more important pages 1．When people are browsing a page, we assume that they randomly select a link on it for next browsing 2．They are likely to move to a page from links of more important pages. A C D B E How can we calculate the likelihood to stay? 3．
16. ### Toy example to check the basic idea of PageRank 16

A C D B Q. Suppose that a random surfer is now at A. He randomly selects one of links on each page to decide which page he will visit. Which page has the highest chance of him to stay after he move repeatedly? Prob. = 1 Prob. = 0 Prob. = 0 Prob. = 0 Random surfer
17. ### Toy example to check the basic idea of PageRank 17

A C D B Q. Which page has the highest chance of him to stay after he move repeatedly? Prob. = 1
18. ### Toy example to check the basic idea of PageRank 18

A C D B Q. A surfer randomly select a link to move 1/3 1/3 Transition probability 1/3 Which page has the highest chance of him to stay after he move repeatedly?
19. ### Toy example to check the basic idea of PageRank 19

A C D B 1/3 1/3 Transition probability 1/3 How large is the chance that he will be on page B after his first transition? Q. Which page has the highest chance of him to stay after he move repeatedly?
20. ### Toy example to check the basic idea of PageRank 20

A C D B 1x(1/3)=1/3 1x(1/3)=1/3 1x(1/3)=1/3 0 1/3 1/3 1/3 Q. Which page has the highest chance of him to stay after he move repeatedly?
21. ### Toy example to check the basic idea of PageRank 21

A C D B 1x(1/3)=1/3 1x(1/3)=1/3 1x(1/3)=1/3 0 How likely is he to stay on page B after his SECOND transition? Q. Which page has the highest chance of him to stay after he move repeatedly?
22. ### Toy example to check the basic idea of PageRank 22

A C D B Q. Which page has the highest chance of him to stay after he move repeatedly? Prob = 0 Prob = 1/3 Prob = 1/3 Prob = 1/3
23. ### Toy example to check the basic idea of PageRank 23

A C D B 1/2 1/2 1 Transition probability 1/2 1/2 Q. Which page has the highest chance of him to stay after he move repeatedly?
24. ### Toy example to check the basic idea of PageRank 24

A C D B 1/2 1/2 1 Transition probability 1/2 1/2 1/3 1/3 1/3 0 Q. Which page has the highest chance of him to stay after he move repeatedly?
25. ### Toy example to check the basic idea of PageRank 25

A C D B 1/2 1/2 1 Transition probability 1/2 1/2 1/3 1/3 1/3 Q. Which page has the highest chance of him to stay after he move repeatedly? The probability A→B→A is 𝟏 𝟑 × 𝟏 𝟐
26. ### Toy example to check the basic idea of PageRank 26

A C D B 1/2 1/2 1 Transition probability 1/2 1/2 1/3 1/3 1/3 Q. Which page has the highest chance of him to stay after he move repeatedly? The probability A→C→A is 𝟏 𝟑 ×𝟏
27. ### Toy example to check the basic idea of PageRank 27

A C D B 1/2 1/2 1 Transition probability 1/2 1/2 1/3 1/3 1/3 Q. Which page has the highest chance of him to stay after he move repeatedly? The probability A→?→A is 𝟏 𝟑 × 𝟏 𝟐 + 𝟏 𝟑 ×𝟏 = 𝟏 𝟐
28. ### Toy example to check the basic idea of PageRank 28

A C D B 1/2 1/2 1 Transition probability 1 3 × 1 2 + 1 3 ×1 = 1 2 1 3 × 1 2 + 1 3 ×0 = 1 6 1 3 × 1 2 + 0× 1 3 = 1 6 0× 1 3 + 1 3 × 1 2 = 1 6 1/2 1/2 Q. Which page has the highest chance of him to stay after he move repeatedly?
29. ### Toy example to check the basic idea of PageRank 29

0 1 2 3 4 5 A 1 0 0.5 0.25 0.375 0.313 B 0 0.333 0.167 0.25 0.208 0.229 C 0 0.333 0.167 0.25 0.208 0.229 D 0 0.333 0.167 0.25 0.208 0.229 Node Iter. Probability change in each iteration Q. Which page has the highest chance of him to stay after he move repeatedly?
30. ### Toy example to check the basic idea of PageRank 30

0 5 10 20 … 1000 A 1 0.313 0.334 0.333 0.333 B 0 0.229 0.222 0.222 0.222 C 0 0.229 0.222 0.222 0.222 D 0 0.229 0.222 0.222 0.222 Node Iter. When transition repeats, each probability will be converged. The prob. mean the likelihood of people to visit (i.e., PageRank) Probability change in each iteration Q. Which page has the highest chance of him to stay after he move repeatedly?
31. ### Mathematical procedure to calculate simple PageRank (1/4) 31 Initial probability

of users to be on each node 𝒓𝟎 = 1 0 0 0 Transition probability from node to node 𝑴 = 0 1/2 0 0 1/3 0 0 1/2 1/3 0 1 1/2 1/3 1/2 0 0 A C D B Prob.=1 Prob.=0 Prob.=0 Prob.=0 1/2 1/3 1/3 1/3 1/2 1/2 1/2 1
32. ### Mathematical procedure to calculate simple PageRank (2/4) 32 𝒓𝟏 =

𝑴𝒓𝟎 = 0 1/2 0 0 1/3 0 0 1/2 1/3 0 1 1/2 1/3 1/2 0 0 1 0 0 0
33. ### Mathematical procedure to calculate simple PageRank (3/4) 33 𝒓𝟐 =

𝑴𝒓𝟏 = 0 1/2 0 0 1/3 0 0 1/2 1/3 0 1 1/2 1/3 1/2 0 0 𝟐 1 0 0 0 = 𝑴𝑴𝑻𝒓𝟎 = 𝑴𝟐𝒓𝟎
34. ### Mathematical procedure to calculate simple PageRank (4/4) 34 𝒓𝒏 =

𝑴𝒓𝒏"𝟏 = 𝑴𝑴𝒓𝒏"𝟐 = 𝑴𝟐𝒓𝒏"𝟐 = 𝑴𝒏𝒓𝟎 … If n is enough large, we think rn represents the likelihood of people to visit each node
35. ### Problems of simple PageRank (1/3) 35 A C D B

A C D B Dead end Spider trap Several of link structures violate the PageRank assumption
36. ### Problems of simple PageRank (2/3) 36 A C D B

Dead end Several of link structures violate the PageRank assumption 0 1 10 100 A 1 0 0.01 0 B 0 0.333 0.015 0 C 0 0.333 0.015 0 D 0 0.333 0.015 0 Probability change in each iteration
37. ### Problems of simple PageRank (3/3) 37 A C D B

Spider trap Several of link structures violate the PageRank assumption 0 1 10 100 A 1 0 0.01 0 B 0 0.333 0.015 0 C 0 0.333 0.961 1 D 0 0.333 0.015 0 Probability change in each iteration
38. ### Revision of PageRank assumption (Complete PageRank) 38 1．When people are

browsing a page, we assume that they randomly select links on it for next transition A C D B Most cases: people use links A C D B Sometimes: people directly jump 2．Sometimes, people directly visit pages without using hyperlinks (called, random jump)
39. ### Revision of PageRank assumption (Complete PageRank) 39 1．When people are

browsing a page, we assume that they randomly select links on it for next transition 2．Sometimes, people directly visit pages without using hyperlinks (called, random jump) A C D B Most cases: people use links A C D B Sometimes: people directly jump
40. ### Algorithm of complete PageRank (1/5) 40 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

1 − α 𝒅 2．Starting with n = 0, update rn with the below formula 1． Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d
41. ### Algorithm of complete PageRank (1/5) 41 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

1 − α 𝒅 2．Starting with n = 0, update rn with the below formula 1． Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d
42. ### Algorithm of complete PageRank (1/5) 42 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

1 − α 𝒅 2．Starting with n = 0, update rn with the below formula Corresponds to the case where people use links to visit pages 1． Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d
43. ### Algorithm of complete PageRank (2/5) 43 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

1 − α 𝒅 Transition matrix (which derived from link structure) 0 1/2 0 0 1/3 0 0 1/2 1/3 0 1 1/2 1/3 1/2 0 0 A C D B 1/3 1/3 1/3 1/2 1/2 1/2 2．Starting with n = 0, update rn with the below formula 1． Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d
44. ### Algorithm of complete PageRank (1/5) 44 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

1 − α 𝒅 2．Starting with n = 0, update rn with the below formula Corresponds to the case where people directly visit pages 1． Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d
45. ### Algorithm of complete PageRank (3/5) 45 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

1 − α 𝒅 Random surf vector: Probability of people to directly visit pages (uniform distribution of prob.) 1/4 1/4 1/4 1/4 A C D B 2．Starting with n = 0, update rn with the below formula 1． Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d
46. ### Algorithm of complete PageRank (1/5) 46 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

1 − α 𝒅 2．Starting with n = 0, update rn with the below formula 1． Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d
47. ### Algorithm of complete PageRank (4/5) 47 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

1 − α 𝒅 Probabilities (parameters) to decide which of the two factors people use. (Empirically, α is set in the range from 0.8 to 0.9) 2．Starting with n = 0, update rn with the below formula 1． Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d
48. ### Algorithm of complete PageRank (5/5) 48 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

1 − α 𝒅 2．Starting with n = 0, update rn with the below formula 1． Initialize r0 (randomly assign values to r0 ). Set transition Matrix M and random surf vector d 3．If rn is converged (it does not change), the algorithm finishes. The converged rn is the PageRank!!
49. ### Simple PageRank vs. complete PageRank 49 A C D B

Spider trap 0 1 10 100 A 1 0 0.01 0 B 0 0.333 0.015 0 C 0 0.333 0.961 1 D 0 0.333 0.015 0 Probability change in each iteration 0 1 10 100 A 1 0.05 0.102 0.101 B 0 0.316 0.129 0.128 C 0 0.316 0.639 0.642 D 0 0.316 0.129 0.128 Complete PageRank Simple PageRank
50. ### What can PageRank provide us? 50 PageRank can evaluate centrality

of nodes in graph (network) data Influential people Good papers to cite Popular webpage Paper citation networks Web PageRank PageRank PageRank

52. ### Issues of normal PageRank 52 Normal PageRank ignores what kinds

of topics each node is related to. A C D B E ▪ Pages about medicine ▪ Pages about cosmetic F G Normal PageRank 1. Page C 0.282pt 2. A 0.174pt 3. F 0.133pt 4. D 0.132pt 5. B 0.093pt 6. E 0.092pt 7. G 0.092pt
53. ### Issues of normal PageRank 53 Normal PageRank ignores what kinds

of topics each node is related to. A C D B E ▪ Pages about medicine ▪ Pages about cosmetic F G Normal PageRank 1. Page C 0.282pt 2. A 0.174pt 3. F 0.133pt 4. D 0.132pt 5. B 0.093pt 6. E 0.092pt 7. G 0.092pt
54. ### Which node is the most important about medicine? 54 ▪

Pages about medicine ▪ Pages about cosmetic A C D B E F G Many pages link to C, but only one of them is about med. A is linked by more pages about medicine than C
55. ### Which node is the most important about medicine? 55 ▪

Pages about medicine ▪ Pages about cosmetic A C D B E F G Many pages link to C, but only one of them is about med. A is linked by more pages about medicine than C If people often move to a page from important pages about the topic, such page should be important for the topic! Assumption
56. ### Assumption of Topic-sensitive PageRank 56 Normal PageRank • People follow

links on pages to visit other pages. • They sometimes randomly visit pages without links. Topic-sensitive PageRank • People follow links on pages to visit other pages. • They sometimes randomly visit pages without links. only a kind of
57. ### Algorithm of Topic-sensitive PageRank (1/2) 57 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

1 − α 𝒅 Starting with n = 0, update rn with the below formula 0 0 1/4 1/4 1/4 0 1/4 1/2 0 0 1/2 0 0 0 0 0 0 0 0 0 0 0 1/3 1/3 0 0 1/3 0 0 0 1/2 0 0 1/2 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 A C D B E F G 1 1/4 1/4 1/4 1/4 1/2 1/2 1/3 1/3 1/3 1 1/2 1/2
58. ### Algorithm of Topic-sensitive PageRank (2/2) 58 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

1 − α 𝒅 1/7 1/7 1/7 1/7 1/7 1/7 1/7 Normal PageRank A C D B E F G Starting with n = 0, update rn with the below formula
59. ### Algorithm of Topic-sensitive PageRank (2/2) 59 𝒓𝒏 = 𝜶𝑴𝒓𝒏"𝟏 +

1 − α 𝒅 1/7 1/7 1/7 1/7 1/7 1/7 1/7 Normal PageRank A C D B E F G 1/4 1/4 1/4 0 0 0 1/4 Topic-sensitive PageRank Starting with n = 0, update rn with the below formula
60. ### Results of Topic-sensitive PageRank (TsPR) 60 • TsPR gives high

scores to pages about a target topic ▪ Pages about medicine ▪ Pages about cosmetic A C D B E F G Normal PageRank 1. C 0.282pt 2. A 0.174pt 3. F 0.133pt 4. D 0.132pt 5. B 0.093pt 6. E 0.092pt 7. G 0.092pt Topic-sensitive PR 1. A 0.266pt 2. C 0.248pt 3. G 0.147pt 4. B 0.121pt 5. D 0.108pt 6. E 0.057pt 7. F 0.055pt • Even if a page is not about a target topic, if the page is linked by important pages, TsPR gives high score to it.
61. ### When do we use Topic-sensitive PageRank? 61 Finding important nodes

on a graph for target topics 1. Finding important nodes for individual users (personalized PageRank) 2. - For that, Give random surf values to only nodes for target topics - If you know the nodes of a user to frequently visit, give random surf values to only the nodes. A C D B E F G ▪ Pages which a user likes 𝒓𝒏 = 𝜶𝑴𝒓𝒏\$𝟏 + 1 − α 𝒅 0 1/3 0 0 1/3 0 1/3