Slide 1

Slide 1 text

2 December 2005 Web Technologies Web Search and SEO Prof. Beat Signer Department of Computer Science Vrije Universiteit Brussel beatsigner.com Department of Computer Science Vrije Universiteit Brussel beatsigner.com

Slide 2

Slide 2 text

Beat Signer - Department of Computer Science - [email protected] 2 November 26, 2024 Search Engine Result Pages (SERP)

Slide 3

Slide 3 text

Beat Signer - Department of Computer Science - [email protected] 3 November 26, 2024 Search Engine Result Pages (SERP) …

Slide 4

Slide 4 text

Beat Signer - Department of Computer Science - [email protected] 4 November 26, 2024 Vertical Search Result Pages

Slide 5

Slide 5 text

Beat Signer - Department of Computer Science - [email protected] 5 November 26, 2024 Search Engine Result Page ▪ There is a variety of information shown on a search engine result page (SERP) ▪ organic search results ▪ non-organic search results ▪ meta-information about the result (e.g.number of result pages) ▪ vertical navigation ▪ advanced search options ▪ query refinement suggestions ▪ ...

Slide 6

Slide 6 text

Beat Signer - Department of Computer Science - [email protected] 6 November 26, 2024 Global Search Engine Market Share (2020) [https://alphametic.com/global-search-engine-market-share]

Slide 7

Slide 7 text

Beat Signer - Department of Computer Science - [email protected] 7 November 26, 2024 Search Engine History ▪ Early "search engines" include various systems starting with Bush's Memex ▪ Archie (1990) ▪ first Internet search engine ▪ indexing of files on FTP servers ▪ W3Catalog (September 1993) ▪ first "web search engine" ▪ mirroring and integration of manually maintained catalogues ▪ JumpStation (December 1993) ▪ first web search engine combining crawling, indexing and searching

Slide 8

Slide 8 text

Beat Signer - Department of Computer Science - [email protected] 8 November 26, 2024 Search Engine History ... ▪ In the following two years (1994/1995) many new search engines appeared ▪ AltaVista, Infoseek, Excite, Inktomi, Yahoo!, ... ▪ Two categories of early Web search solutions ▪ full-text search - based on an index that is automatically created by a web crawler in combination with an indexer - e.g. AltaVista or InfoSeek ▪ manually maintained classification (hierarchy) of webpages - significant human editing effort - e.g. Yahoo (until 2014)

Slide 9

Slide 9 text

Beat Signer - Department of Computer Science - [email protected] 9 November 26, 2024 Information Retrieval ▪ Precision and recall can be used to measure the performance of different information retrieval algorithms       documents retrieved documents retrieved documents relevant precision  =       documents relevant documents retrieved documents relevant recall  = D 1 D 2 D 4 D 6 D 7 D 10 D 3 D 5 D 8 D 9 D 1 D 3 D 8 D 9 D 10 query 6 . 0 5 3 precision = = 75 . 0 4 3 recall = =

Slide 10

Slide 10 text

Beat Signer - Department of Computer Science - [email protected] 10 November 26, 2024 Information Retrieval ... ▪ Often a combination of precision and recall, the so-called F-score (harmonic mean) is used as a single measure D 1 D 2 D 4 D 6 D 7 D 10 D 3 D 5 D 8 D 9 D 1 D 3 D 8 D 9 D 10 query 57 . 0 precision = 1 recall = recall precision recall precision 2 score - F +   = D 1 D 2 D 4 D 6 D 7 D 10 D 3 D 5 D 8 D 9 D 1 D 3 D 8 D 9 D 10 query 6 . 0 precision = 75 . 0 recall = 67 . 0 score - F = D 5 D 2 73 . 0 score - F =

Slide 11

Slide 11 text

Beat Signer - Department of Computer Science - [email protected] 11 November 26, 2024 Bank Delhaize Ghent Metro Shopping Train D1 D2 D3 D4 D5 D6 1 Boolean Model ▪ Based on set theory and boolean logic ▪ Exact matching of documents to a user query ▪ Uses the boolean AND, OR and NOT operators ▪ query: Shopping AND Ghent AND NOT Delhaize ▪ computation: 101110 AND 100111 AND 000111 = 000110 ▪ result: document set {D4 ,D5 } 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 0 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 ... ... ... ... ... ... ... inverted index

Slide 12

Slide 12 text

Beat Signer - Department of Computer Science - [email protected] 12 November 26, 2024 Boolean Model ... ▪ Advantages ▪ relatively easy to implement and scalable ▪ fast query processing based on parallel scanning of indexes ▪ Disadvantages ▪ no ranking of output ▪ often the user has to learn a special syntax such as the use of double quotes to search for phrases ▪ Variants of the boolean model form the basis of many search engines ▪ inverted index

Slide 13

Slide 13 text

Beat Signer - Department of Computer Science - [email protected] 13 November 26, 2024 Web Search Engines ▪ Most web search engines are based on traditional information retrieval techniques, but they must be adapted to deal with the characteristics of the Web ▪ immense amount of web resources (>150 billion web pages) ▪ hyperlinked resources ▪ dynamic content with frequent updates ▪ self-organised web resources ▪ Evaluation of performance ▪ no standard collections ▪ often based on user studies (satisfaction) ▪ Of course, not only the precision and recall but also the query answer time is an important issue

Slide 14

Slide 14 text

Beat Signer - Department of Computer Science - [email protected] 14 November 26, 2024 Web Search Engine Architecture WWW Crawler URL Pool Storage Manager Page Repository content already added? Document Index Special Indexes Indexers URL Handler URL Repository filter normalisation and duplicate elimination Client Query Handler inverted index Ranking

Slide 15

Slide 15 text

Beat Signer - Department of Computer Science - [email protected] 15 November 26, 2024 Web Crawler ▪ A web crawler or spider is used to create an index of webpages to be used by a web search engine ▪ any web search is then based on this index ▪ Web crawler has to deal with the following issues ▪ freshness - the index should be updated regularly (based on web page update frequency) ▪ quality - since not all web pages can be indexed, the crawler should give priority to "high quality" pages ▪ scalability - it should be possible to increase the crawl rate by just adding additional servers (modular architecture) - e.g. the estimated number of Google servers in 2016 was 2.5 million (including not only the crawler but the entire Google platform)

Slide 16

Slide 16 text

Beat Signer - Department of Computer Science - [email protected] 16 November 26, 2024 Web Crawler ... ▪ distribution - the crawler should be able to run in a distributed manner (computer centres all over the world) ▪ robustness - the Web contains a lot of pages with errors and a crawler must deal with these problems - e.g. deal with a web server that creates an unlimited number of "virtual web pages" (crawler trap) ▪ efficiency - resources (e.g. network bandwidth) should be used in the most efficient way ▪ crawl rates - the crawler should pay attention to existing web server policies (e.g.revisit-after HTML meta tag or robots.txt file) User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ robots.txt

Slide 17

Slide 17 text

Beat Signer - Department of Computer Science - [email protected] 17 November 26, 2024 Pre-1998 Web Search ▪ Find all documents for a given query term ▪ use information retrieval (IR) solutions - boolean model - vector space model - ... ▪ ranking based on "on-page factors" → problem: poor quality of search results (order) ▪ Larry Page and Sergey Brin proposed to compute the absolute quality of a page called PageRank ▪ based on the number and quality of pages linking to a page (votes) ▪ query-independent

Slide 18

Slide 18 text

Beat Signer - Department of Computer Science - [email protected] 18 November 26, 2024 Origins of PageRank ▪ Developed as part of an academic project at Stanford University ▪ research platform to aid under- standing of large-scale web data and enable researchers to easily experiment with new search technologies ▪ Larry Page and Sergey Brin worked on the project about a new kind of search engine (1995-1998) which finally led to a functional prototype called Google Larry Page Sergey Brin

Slide 19

Slide 19 text

Beat Signer - Department of Computer Science - [email protected] 19 November 26, 2024 PageRank ▪ A page Pi has a high PageRank Ri if ▪ there are many pages linking to it ▪ or, if there are some pages with a high PageRank linking to it ▪ Total score = IR score × PageRank P1 R1 P2 R2 P3 R3 P4 R4 P5 R5 P6 R6 P7 R7 P8 R8

Slide 20

Slide 20 text

Beat Signer - Department of Computer Science - [email protected] 20 November 26, 2024 Basic PageRank Algorithm ▪ where ▪ Bi is the set of pages that link to page Pi ▪ Lj is the number of outgoing links for page Pj   = i j B P j j i L P R P R ) ( ) ( P1 P2 P3 P1 1 P2 1 P3 1 P1 1.5 P2 1.5 P3 0.75 P1 1.5 P2 1.5 P3 0.75

Slide 21

Slide 21 text

Beat Signer - Department of Computer Science - [email protected] 21 November 26, 2024 Matrix Representation ▪ Let us define a hyperlink matrix H P1 P2 P3     = otherwise 0 if 1 i j j ij B P L H           = 0 2 1 0 0 0 1 1 2 1 0 H ( )   i P R = R and HR R = R is an eigenvector of H with eigenvalue 1 →

Slide 22

Slide 22 text

Beat Signer - Department of Computer Science - [email protected] 22 November 26, 2024 Matrix Representation ... ▪ We can use the power method to find R ▪ sparse matrix H with 150 billion columns and rows but only an average of 10 non-zero entries in each column t t HR R = +1           = 0 2 1 0 0 0 1 1 2 1 0 H For our example this results in or   1 2 2 = R   2 . 0 4 . 0 4 . 0

Slide 23

Slide 23 text

Beat Signer - Department of Computer Science - [email protected] 23 November 26, 2024 Dangling Pages (Rank Sink) ▪ Problem with pages that have no outgoing links (e.g. P2 ) ▪ Stochastic adjustment ▪ if page Pj has no outgoing links then replace column j with 1/Lj ▪ New stochastic matrix S always has a stationary vector R ▪ can also be interpreted as a Markov chain P1 P2       = 0 1 0 0 H and   0 0 = R       = 2 1 0 2 1 0 C       = + = 2 1 1 2 1 0 C H S and C C

Slide 24

Slide 24 text

Beat Signer - Department of Computer Science - [email protected] 24 November 26, 2024 Strongly Connected Pages (Graph) ▪ Add new transition proba- bilities between all pages ▪ with probability d we follow the hyperlink structure S ▪ with probability 1-d we choose a random page ▪ matrix G becomes irreducible ▪ Google matrix G reflects a random surfer ▪ no modelling of back button P1 P2 P3 P4 P5 ( ) 1 S G n d d 1 1 − + = GR R = 1-d 1-d 1-d

Slide 25

Slide 25 text

Beat Signer - Department of Computer Science - [email protected] 25 November 26, 2024 Examples ( ) 1 S G n d d 1 1 − + = A1 0.26 A2 0.37 A3 0.37

Slide 26

Slide 26 text

Beat Signer - Department of Computer Science - [email protected] 26 November 26, 2024 Examples ... A1 0.13 A2 0.185 A3 0.185 B1 0.13 B2 0.185 B3 0.185 ( ) 5 . 0 = A P ( ) 5 . 0 = B P ( ) 1 S G n d d 1 1 − + =

Slide 27

Slide 27 text

Beat Signer - Department of Computer Science - [email protected] 27 November 26, 2024 Examples ▪ PageRank leakage A1 0.10 A2 0.14 A3 0.14 B1 0.22 B2 0.20 B3 0.20 ( ) 38 . 0 = A P ( ) 62 . 0 = B P ( ) 1 S G n d d 1 1 − + =

Slide 28

Slide 28 text

Beat Signer - Department of Computer Science - [email protected] 28 November 26, 2024 Examples ... A1 0.3 A2 0.23 A3 0.18 B1 0.10 B2 0.095 B3 0.095 ( ) 71 . 0 = A P ( ) 29 . 0 = B P ( ) 1 S G n d d 1 1 − + =

Slide 29

Slide 29 text

Beat Signer - Department of Computer Science - [email protected] 29 November 26, 2024 Examples ▪ PageRank feedback A1 0.35 A2 0.24 A3 0.18 B1 0.09 B2 0.07 B3 0.07 ( ) 77 . 0 = A P ( ) 23 . 0 = B P ( ) 1 S G n d d 1 1 − + =

Slide 30

Slide 30 text

Beat Signer - Department of Computer Science - [email protected] 30 November 26, 2024 Examples ... A1 0.33 A2 0.17 A3 0.175 B1 0.08 B2 0.06 B3 0.06 ( ) 80 . 0 = A P ( ) 20 . 0 = B P A4 0.125 ( ) 1 S G n d d 1 1 − + =

Slide 31

Slide 31 text

Beat Signer - Department of Computer Science - [email protected] 31 November 26, 2024 Google Search Central ▪ Various services and infor- mation about a website ▪ Site configuration ▪ submission of sitemap ▪ crawler access ▪ URLs of indexed pages ▪ Performance ▪ search queries ▪ countries ▪ devices ▪ …

Slide 32

Slide 32 text

Beat Signer - Department of Computer Science - [email protected] 32 November 26, 2024 Google Search Central … ▪ Enhancements ▪ core web vitals (speed) - mobile as well as desktop ▪ mobile usability ▪ Security issues ▪ Similar tools offered by other search engines ▪ e.g.Bing Webmaster Tools

Slide 33

Slide 33 text

Beat Signer - Department of Computer Science - [email protected] 33 November 26, 2024 XML Sitemaps ▪ List of URLs that should be crawled and indexed https://beatsigner.com/ 2024-11-24 weekly 1.0 https://beatsigner.com/publications.html 2024-11-24 weekly 0.9 ...

Slide 34

Slide 34 text

Beat Signer - Department of Computer Science - [email protected] 34 November 26, 2024 XML Sitemaps ... ▪ All major search engines support the sitemap format ▪ The URLs of a sitemap are not guaranteed to be added to a search engine's index ▪ helps search engine to find pages that are not yet indexed ▪ Additional metadata might be provided to search engines ▪ relative page relevance (priority) ▪ date of last modification (lastmod) ▪ update frequency (changefreq)

Slide 35

Slide 35 text

Beat Signer - Department of Computer Science - [email protected] 35 November 26, 2024 Questions ▪ Is PageRank fair? ▪ What about Google's power and influence? ▪ What about Web 2.0 or Web 3.0 and web search? ▪ "non-existent" webpages such as offered by Rich Internet Applications (e.g. using AJAX) may bring problems for traditional search engines (hidden web) ▪ new forms of social search - social bookmarking - ... ▪ social marketing

Slide 36

Slide 36 text

Beat Signer - Department of Computer Science - [email protected] 36 November 26, 2024 The Google Effect ▪ A recent study by Sparrow et al. shows that people less likely remember things that they believe to be accessible online ▪ Internet as a transactive memory ▪ Does our memory work differently in the age of Google? ▪ What implications will the future of the Internet and new search have?

Slide 37

Slide 37 text

Beat Signer - Department of Computer Science - [email protected] 37 November 26, 2024 Search Engine Marketing (SEM) ▪ For many companies Internet marketing has become a big business ▪ Search engine marketing (SEM) aims to increase the visibility of a website ▪ search engine optimisation (SEO) ▪ paid search advertising (non-organic search) ▪ social media marketing ▪ SEO should not be decoupled from a website's content, structure, design and used technologies ▪ SEO has to be seen as a continuous process in a rapidly changing environment ▪ different search engines with regular changes in ranking

Slide 38

Slide 38 text

Beat Signer - Department of Computer Science - [email protected] 38 November 26, 2024 Structural Choices ▪ Keep the website structure as flat a possible ▪ minimise link depth ▪ avoid pages with much more than 100 links ▪ Think about your website's internal link structure ▪ which pages are directly linked from the homepage? ▪ create many internal links for important pages ▪ be "careful" about where to put outgoing links - PageRank leakage ▪ use keyword-rich anchor texts ▪ dynamically create links between related content - e.g. "customer who bought this also bought ..." or "visitors who viewed this also viewed ..." ▪ Increase the number of pages

Slide 39

Slide 39 text

Beat Signer - Department of Computer Science - [email protected] 39 November 26, 2024 Technological Choices ▪ Use SEO-friendly content management system (CMS) ▪ Dynamic URLs vs.static URLs ▪ avoid session IDs and parameters in URL ▪ use URL rewriting to get descriptive URLs containing keywords ▪ Think carefully about the use of dynamic content ▪ Rich Internet Applications (RIAs) based on AJAX etc. ▪ content hidden behind pull-down menus etc. ▪ Address webpages consistently ▪ https://www.vub.ac.be  https://www.vub.ac.be/index.php

Slide 40

Slide 40 text

Beat Signer - Department of Computer Science - [email protected] 40 November 26, 2024 Search Engine Optimisations ▪ Different things can be optimised ▪ on-page factors ▪ off-page factors ▪ It is assumed that some search engines use more than 200 on-page and off-page factors for their ranking ▪ Difference between optimisation and breaking the "search engine rules" ▪ white hat and black hat optimisations ▪ A bad ranking or removal from index can cost a company a lot of money or even mark the end of the company ▪ e.g.supplemental index ("Google hell")

Slide 41

Slide 41 text

Beat Signer - Department of Computer Science - [email protected] 41 November 26, 2024 Positive On-Page Factors ▪ Use of keywords at relevant places ▪ in title tag (preferably one of the first words) ▪ in URL and domain name ▪ in header tags (e.g.

) and multiple times in body text ▪ Mobile usability ▪ mobile-first indexing by Google since 2016 ▪ Fast page load times ▪ mobile as well as desktop ▪ Provide metadata ▪ e.g. also used by search engines to create the text snippets on the SERPs

Slide 42

Slide 42 text

Beat Signer - Department of Computer Science - [email protected] 42 November 26, 2024 Positive On-Page Factors ▪ Quality of HTML code ▪ Security and accessibility ▪ Uniqueness of content across the website ▪ …

Slide 43

Slide 43 text

Beat Signer - Department of Computer Science - [email protected] 43 November 26, 2024 Negative On-Page Factors ▪ Links to "bad neighbourhood" ▪ Link selling ▪ in 2007 Google announced a campaign against paid links that transfer PageRank ▪ Over optimisation penalty (keyword stuffing) ▪ Text with same colour as background (hidden content) ▪ Automatic redirect via the refresh meta tag ▪ Cloaking ▪ different pages for spider and user ▪ Malware being hosted on the page

Slide 44

Slide 44 text

Beat Signer - Department of Computer Science - [email protected] 44 November 26, 2024 Negative On-Page Factors ... ▪ Duplicate or similar content ▪ Duplicate page titles or meta tags ▪ Slow page load time ▪ Any copyright violations ▪ ...

Slide 45

Slide 45 text

Beat Signer - Department of Computer Science - [email protected] 45 November 26, 2024 Positive Off-Page Factors ▪ Links from pages with a high PageRank ▪ Keywords in anchor text of inbound links ▪ Links from topically relevant sites ▪ High clickthrough rate (CTR) from search engine for a given keyword ▪ High number of shares on social media (social signals) ▪ e.g.Facebook or Twitter ▪ Site age (stability) ▪ Domain expiration date ▪ …

Slide 46

Slide 46 text

Beat Signer - Department of Computer Science - [email protected] 46 November 26, 2024 Negative Off-Page Factors ▪ Site often not accessible to crawlers ▪ e.g.server problem ▪ High bounce rate ▪ users immediately press the back button ▪ Link buying ▪ rapidly increasing number of inbound links ▪ Use of link farms ▪ Participation in link sharing programmes ▪ Links from bad neighbourhood? ▪ Competitor attack (e.g.via duplicate content)?

Slide 47

Slide 47 text

Beat Signer - Department of Computer Science - [email protected] 47 November 26, 2024 Black Hat Optimisations (Don'ts) ▪ Link farms ▪ Spamdexing in guestbooks, Wikipedia etc. ▪ "solution": ... ▪ Keyword Stuffing ▪ overuse of keywords - content keyword stuffing - image keyword stuffing - keywords in meta tags - invisible text with keywords ▪ Selling/buying links ▪ "big" business until 2007 ▪ costs based on the PageRank of the linking site

Slide 48

Slide 48 text

Beat Signer - Department of Computer Science - [email protected] 48 November 26, 2024 Black Hat Optimisations (Don'ts) ... ▪ Doorway pages (cloaking) ▪ doorway pages are normally just designed for search engines - user is automatically redirected to the target page ▪ e.g.BMW Germany and Ricoh Germany banned in February 2006

Slide 49

Slide 49 text

Beat Signer - Department of Computer Science - [email protected] 49 November 26, 2024 Nofollow Link Example ▪ nofollow value for hyperlinks introduced by Google in 2005 to avoid spamdexing ▪ ... ▪ Links with a nofollow value were not counted in the PageRank computation ▪ division by number of outgoing links ▪ e.g.page with 9 outgoing links and 3 of them are nofollow links - PageRank divided by 6 and distributed across the 6 "really linked pages" ▪ SEO experts started to use (misuse) the nofollow links for PageRank sculpting ▪ control flow of PageRank within a website

Slide 50

Slide 50 text

Beat Signer - Department of Computer Science - [email protected] 50 November 26, 2024 Nofollow Link Example ... ▪ In June 2009 Google decided to treat nofollow links differently to avoid PageRank sculpting ▪ division by total number of outgoing links ▪ e.g. page with 9 outgoing links and 3 of them are nofollow links - PageRank divided by 9 and distributed across the 6 "really linked pages" ▪ no longer a good solution to prevent Spamdexing since we loose (diffuse) some PageRank ▪ SEO experts start to use alternative techniques to replace nofollow links ▪ e.g.obfuscated JavaScript links

Slide 51

Slide 51 text

Beat Signer - Department of Computer Science - [email protected] 51 November 26, 2024 Non-Organic Search ▪ In addition to the so-called organic search, websites can also participate in non-organic web search ▪ cost per impression (CPI) ▪ cost-per-click (CPC) ▪ The non-organic web search should not be treated independently from the organic web search ▪ Quality of the landing page can have an impact on the non-organic web search performance! ▪ The Google Ads programme is an example of a commercial non-organic web search service ▪ other services include Yahoo! Advertising Solutions, Facebook Ads, ...

Slide 52

Slide 52 text

Beat Signer - Department of Computer Science - [email protected] 52 November 26, 2024 Google Ads and Google AdSense ▪ pay-per-click (PPC) or cost-per-thousand (CPM) ▪ Campaigns and ad groups ▪ Two types of advertising ▪ search ▪ content network - Google AdSense ▪ Highly customisable ads ▪ region ▪ language ▪ daytime ▪ ...

Slide 53

Slide 53 text

Beat Signer - Department of Computer Science - [email protected] 53 November 26, 2024 Google Ads ... ▪ Excellent control and monitoring for Ads users ▪ cost per conversion ▪ Google advertising revenues ▪ 2023: USD 237.86 billion (total revenues USD 305.6 billion)

Slide 54

Slide 54 text

Beat Signer - Department of Computer Science - [email protected] 54 November 26, 2024 Conclusions ▪ Web information retrieval techniques have to deal with the specific characteristics of the Web ▪ PageRank algorithm ▪ absolute quality of a page based on incoming links ▪ based on random surfer model ▪ computed as eigenvector of Google matrix G ▪ PageRank is just one factor ▪ Various implications for website development and SEO

Slide 55

Slide 55 text

Beat Signer - Department of Computer Science - [email protected] 55 November 26, 2024 Exercise 7 ▪ XML and Related Technologies

Slide 56

Slide 56 text

Beat Signer - Department of Computer Science - [email protected] 56 November 26, 2024 References ▪ L. Page, S. Brin, R. Motwani and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, January 1998 ▪ S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Networks and ISDN Systems, 30(1-7), April 1998 ▪ http://ilpubs.stanford.edu:8090/361/1/1998-8.pdf ▪ Amy N. Langville and Carl D. Meyer, Google's PageRank and Beyond: The Science of Search Engine Rankings, Princeton University Press, July 2006

Slide 57

Slide 57 text

Beat Signer - Department of Computer Science - [email protected] 57 November 26, 2024 References … ▪ B. Sparrow, J. Liu and D.M. Wegner, Google Effects on Memory: Cognitive Consequences of Having Information at Our Fingertips, Science, July 2011 ▪ https://doi.org/10.1126/science.1207745 ▪ Google Search Central ▪ https://developers.google.com/search ▪ The W3C Markup Validation Service ▪ https://validator.w3.org ▪ SEO Book ▪ https://www.seobook.com

Slide 58

Slide 58 text

2 December 2005 Next Lecture Security, Privacy and Trust