of PageRank ▪ Developed as part of an academic project at Stanford University ▪ research platform to aid understanding of large-scale web data and enable researches to easily experiment with new search technologies ▪ Larry Page and Sergey Brin worked on the project about a new kind of search engine (1995-1998) which finally led to a functional prototype called Google 3 Larry Page Sergey Brin
Search Until 1998 ▪ Find all documents using a query term ▪ use information retrieval (IR) solutions ▪ ranking based on "on the page factors" → problem: poor quality of search results (order) ▪ Page and Brin proposed to compute the absolute qualtity of a page (PageRank) ▪ based on the number and quality of pages linking to a page (votes) 4
▪ A page has a high PageRank R if ▪ there are many pages linking to it ▪ or, if there are some pages with a high PageRank linking to it ▪ Total score = IR score x PageRank 5 P1 R1 P2 R2 P3 R3 P4 R4 P5 R5 P6 R6 P7 R7 P8 R8
Algorithm ▪ where ▪ Bi is the set of pages that link to page Pi ▪ Lj is the number of outgoing links for page Pj 6 = i j B P j j i L P R P R ) ( ) ( P1 P2 P3 P1 1 P2 1 P3 1 P1 1.5 P2 1.5 P3 0.75 P1 1.5 P2 1.5 P3 0.75
Representation ▪ Let us define a hyperlink matrix H 7 P1 P2 P3 = otherwise 0 if 1 i j j ij B P L H = 0 2 1 0 0 0 1 1 2 1 0 H ( ) i P R = R and HR R = R is an eigenvector of H with eigenvalue 1 →
Representation ... ▪ We can use the power method to find R 8 t t HR R = +1 = 0 2 1 0 0 0 1 1 2 1 0 H For our example this results in or 1 2 2 = R 2 . 0 4 . 0 4 . 0
Pages ▪ Problem with pages that have no outbound links (P2 ) 9 P1 P2 = 0 1 0 0 H and 0 0 = R = 2 1 0 2 1 0 C = + = 2 1 1 2 1 0 C H S and C C
Connected Pages (Graph) ▪ Add new transition probabilities between all pages ▪ with probability d we follow the hyperlink structure S ▪ with probability 1-d we choose a random page 10 P1 P2 P3 P4 P5 ( ) S 1 G d n d + − = 1 1 GR R = 1-d 1-d 1-d
for Website Development ▪ First make sure that your page gets indexed ▪ "on the page factors" ▪ Think about your site's internal link structure ▪ create many internal links for important pages ▪ be "careful" about where to put outgoing links ▪ Increase the number of pages ▪ Ensure that webpages are addressed consistently ▪ http://www.vub.ac.be http://www.vub.ac.be/index.php ▪ Make sure that you get links from good websites 17
Search Engine Optimisations (SEO) ▪ Internet marketing has become big business ▪ white hat and black hat optimisations ▪ Black hat optimisations ▪ link farms ▪ spamdexing in guestbooks etc. <a rel="nofollow" href="…">…</a> ▪ selling/buying links ▪ … ▪ Is PageRank fair? 19
▪ PageRank algorithm ▪ absolute quality of a page based on incoming links ▪ random surfer model ▪ computed as eigenvector of Google matrix G ▪ Implications for website development and SEO ▪ PageRank is just one (important) factor 20
▪ The PageRank Citation Ranking: Bringing Order to the Web, L. Page, S. Brin, R. Motwani and T. Winograd, January 1998 ▪ The Anatomy of a Large-Scale Hypertextual Web Search Engine, S. Brin and L. Page, Computer Networks and ISDN Systems, 30(1-7), April 1998