usually focused on libraries • Web search is a major application that everyone cares about • Challenges ◦ Scalability (users as well as content) ◦ Ensure high-quality results (fighting SPAM) ◦ Dynamic nature (constantly changing content) 3 / 33
Web • Three broad categories: ◦ Private sites • No incoming links, or may require log in with a valid account ◦ Form results • Sites that can be reached only after entering some data into a form ◦ Scripted pages • Pages that use JavaScript, Flash, or another client-side language to generate links 5 / 33
as a separate document field • Tends to be short, descriptive, and similar to query text ◦ Can be thought of a description of the page “written by others” • Has a significant impact on effectiveness for some types of queries 9 / 33
school, PhD, IR, DB, [...] PROMISE Winter School 2013, [...] headings PROMISE Winter School 2013 Bridging between Information Retrieval and Databases Bressanone, Italy 4-8 February 2013 body The aim of the PROMISE Winter School 2013 on "Bridging between Information Retrieval and Databases" is to give participants a grounding in the core topics that constitute the multidisciplinary area of information access and retrieval to unstructured,semistructured, and structured information. The school is a week-long event consisting of guest lectures from invited speakers who are recognized experts in the field. The school is intended for PhD students, Masters students or senior researchers such as postdoctoral researchers form the fields of databases, information retrieval, and related fields. [...] anchors winter school information retrieval IR lectures 11 / 33
that are popular and useful to many people? • Use the links between web pages as a way to measure popularity • The most obvious measure is to count the number of inlinks ◦ Quite effective, but very susceptible to SPAM 12 / 33
Proposed by Google founders Sergey Brin and Larry Page in 1998 • Main idea: A web page is important if it is pointed to by other important web pages • PageRank is a numeric value that represents the importance of a web page ◦ When one page links to another page, it is effectively casting a vote for the other page ◦ More votes implies more importance ◦ Importance of each vote is taken into account when a page’s PageRank is calculated 13 / 33
the Web randomly as follows • The user is currently at page a ◦ She moves to one of the pages linked from a with probability 1 − q ◦ She jumps to a random web page with probability q • This is to ensure that the user doesn’t “get stuck” on any given page (i.e., on a page with no outlinks) • Repeat the process for the page she moved to • The PageRank score of a page is the average probability of the random surfer visiting that page 15 / 33
need to be computed iteratively ◦ We don’t know the PageRank values at start. We can assume equal values (1/T) • Number of iterations? ◦ Good approximation already after a small number of iterations; stop when change in absolute values is below a given threshold 17 / 33
(“dead ends”), i.e., pages that have no outlinks? • Assume that it links to all other pages in the collection (including itself) when computing PageRank scores 28 / 33
Web pages with high PageRank are preferred • It is, however, not as important as conventional wisdom holds ◦ Just one of the many features a modern web search engine uses ◦ It tends to have the most impact on popular queries 30 / 33
document importance into the ranking? • As a query-independent (“static”) score component score (d, q) = score(d, q) × score(d) • In case of Language Models, document importance is encoded as the document prior P(d) P(d|q) ∝ P(q|d)P(d) 31 / 33