system that finds web pages that match a web search. It’s kind of what library is to books! Google, Amazon, Facebook, Tiktok etc are all search engines in any form.
computers to crawl the web - using program called Googlebots (Spiders/search engine bots/crawlers) • Google crawls the web and makes a copy of it. This is called an index. • Algorithms developed to “Rank” the documents in index then serves the results to SERP (Ranked higher has higher quality)
that deals with retrieving information from set of documents • IR comes with huge cost - called “Cost of Retrieval” • Google can easily crawl images, text & videos. However, struggles with Javascript based sites • Many types of crawlers based on content type, devices, location ◦ Adbots, Imagebots, Page resource load bots, mobile bots, desktop bots
documents • Text Transformation transforms documents into index terms • or features. Index terms, are the parts of a document that • are stored in the index and used in searching. • The simplest index term is a word • Phrases, names of people, dates, and links etc can be also index terms • This leads to Index creation that enables fast searching
in this process. In many cases, tokens are the same as words 2. Stopping: Removal of stop words like “is”, “to”, “of” etc 3. Stemming is to group words that are derived from a common stem. Eg: sofa, sofas, sopha 4. Link extraction and analysis: Links & anchors ate stored separately & analysed 5. Information extraction: More of Semantics & nuances of the tokens (Named Entity Recognizers) 6. Classifier: Classify into categories like Furniture, Politics - Knowledge Graph
search engines. • An inverted index contains a list for every index term of the documents that contain that index term. • It is inverted in the sense that Index terms are matched with documents • It is derived by “Weighting” documents (like tf.idf) • The index is then distributed across multiple networks for easier retrieval
100s of factors - called Scoring • The output which is ranked documents is then displayed in SERP • The scoring should be done super fast for retrieval - Performance Optimization • Ranks can also be distributed like index - since location & other factors changes Ranks • Evaluation takes in User data to feed in Ranking performance
(HITS algorithm) 3. Page quality (E-E-A-T) 4. Reliability 5. Localization 6. Navboost Core Algorithms: Google uses core algorithms to reduce the number of matches for a query down to “several hundred” documents. Those core algorithms give the documents initial rankings or scores. Navboost: It memorizes all the clicks on queries from the past 13 months. Navboost is a ranking signal that can only exist after users have clicked on a document/page. It also measures user interactions on other SERP features like PAA, Images, map listing etc (Tangram or Tetris is the system that pull all SERP features in SERP) Deep learning systems: RankBrain, DeepRank (BERT), RankEmbed, MUM
Page and Sergey Brin, assigns a numerical weight to each element of a hyperlinked set of documents 2. PageRank emphasizes the importance of quality links, and a link from a high-ranking page contributes more to the linked page's own ranking 3. Keyword Independent 1. HITS, also known as Hubs and Authorities, was developed by Jon Kleinberg. Instead of assigning a global importance to web pages, HITS identifies two types of pages: hubs and authorities. 2. Pages are mutually reinforcing - a page is a good hub if it links to many authorities, and it is a good authority if it is linked to by many hubs 3. Keyword dependent PageRank Hyperlink-Induced Topic Search
• Google has historic data on user behaviour that can be modelled to accommodate even new users • Google follows previous user interactions to rank pages in SERP • This is where Personalized results come in Takeaway: Create content or pages that resonates with users. Make content superior to rest of the results in SERP - Basically get “Long Clicks” from users
by Human raters • IS rating is done when there is changes ranking system or to evaluate SERP • Google uses mobile to do IS - mobile indexing is the norm • Search Quality Raters Guidelines is part of this
someone saw a link to your site on Google. Depending on the result type, the link might need to be scrolled or expanded into view. 2. Clicks: How often someone clicked a link from Google to your site. 3. (average) Position: A relative ranking of the position of your link on Google, where 1 is the topmost position, 2 is the next position, and so on. Shown only for Google Search results. Nuances to understand: Aggregated Impressions by Property & Page