Slide 1

Slide 1 text

How does Google Search Works? Crawling, Indexing & Ranking 101 Sachin Shaji Kalloothara

Slide 2

Slide 2 text

What is Search Engine? A search engine is a software system that finds web pages that match a web search. It’s kind of what library is to books! Google, Amazon, Facebook, Tiktok etc are all search engines in any form.

Slide 3

Slide 3 text

Crawling, Indexing & Ranking ● Google uses huge sets of computers to crawl the web - using program called Googlebots (Spiders/search engine bots/crawlers) ● Google crawls the web and makes a copy of it. This is called an index. ● Algorithms developed to “Rank” the documents in index then serves the results to SERP (Ranked higher has higher quality)

Slide 4

Slide 4 text

Crawling ● Comes under Information Retrieval (IR) - a science that deals with retrieving information from set of documents ● IR comes with huge cost - called “Cost of Retrieval” ● Google can easily crawl images, text & videos. However, struggles with Javascript based sites ● Many types of crawlers based on content type, devices, location ○ Adbots, Imagebots, Page resource load bots, mobile bots, desktop bots

Slide 5

Slide 5 text

Indexing Process ● Text Acquisition is crawling the web for documents ● Text Transformation transforms documents into index terms ● or features. Index terms, are the parts of a document that ● are stored in the index and used in searching. ● The simplest index term is a word ● Phrases, names of people, dates, and links etc can be also index terms ● This leads to Index creation that enables fast searching

Slide 6

Slide 6 text

Parsing 1. Tokenizing the text is an important first step in this process. In many cases, tokens are the same as words 2. Stopping: Removal of stop words like “is”, “to”, “of” etc 3. Stemming is to group words that are derived from a common stem. Eg: sofa, sofas, sopha 4. Link extraction and analysis: Links & anchors ate stored separately & analysed 5. Information extraction: More of Semantics & nuances of the tokens (Named Entity Recognizers) 6. Classifier: Classify into categories like Furniture, Politics - Knowledge Graph

Slide 7

Slide 7 text

Inverted Indexes ● Most common form of index used by search engines. ● An inverted index contains a list for every index term of the documents that contain that index term. ● It is inverted in the sense that Index terms are matched with documents ● It is derived by “Weighting” documents (like tf.idf) ● The index is then distributed across multiple networks for easier retrieval

Slide 8

Slide 8 text

Querying & ranking ● Algorithms ranks the documents based on 100s of factors - called Scoring ● The output which is ranked documents is then displayed in SERP ● The scoring should be done super fast for retrieval - Performance Optimization ● Ranks can also be distributed like index - since location & other factors changes Ranks ● Evaluation takes in User data to feed in Ranking performance

Slide 9

Slide 9 text

Important Ranking factors Important Signals 1. The document 2. Topicality (HITS algorithm) 3. Page quality (E-E-A-T) 4. Reliability 5. Localization 6. Navboost Core Algorithms: Google uses core algorithms to reduce the number of matches for a query down to “several hundred” documents. Those core algorithms give the documents initial rankings or scores. Navboost: It memorizes all the clicks on queries from the past 13 months. Navboost is a ranking signal that can only exist after users have clicked on a document/page. It also measures user interactions on other SERP features like PAA, Images, map listing etc (Tangram or Tetris is the system that pull all SERP features in SERP) Deep learning systems: RankBrain, DeepRank (BERT), RankEmbed, MUM

Slide 10

Slide 10 text

User Signals - Modeled for Click Prediction ● CTR is critical when it comes to ranking ● Run CTR experiments to improve CTR ● Brands with good awareness has an advantage here

Slide 11

Slide 11 text

PageRank vs HITS 1. PageRank, developed by Google founders Larry Page and Sergey Brin, assigns a numerical weight to each element of a hyperlinked set of documents 2. PageRank emphasizes the importance of quality links, and a link from a high-ranking page contributes more to the linked page's own ranking 3. Keyword Independent 1. HITS, also known as Hubs and Authorities, was developed by Jon Kleinberg. Instead of assigning a global importance to web pages, HITS identifies two types of pages: hubs and authorities. 2. Pages are mutually reinforcing - a page is a good hub if it links to many authorities, and it is a good authority if it is linked to by many hubs 3. Keyword dependent PageRank Hyperlink-Induced Topic Search

Slide 12

Slide 12 text

Priors Algorithm ● Deals with choices of users on SERP ● Google has historic data on user behaviour that can be modelled to accommodate even new users ● Google follows previous user interactions to rank pages in SERP ● This is where Personalized results come in Takeaway: Create content or pages that resonates with users. Make content superior to rest of the results in SERP - Basically get “Long Clicks” from users

Slide 13

Slide 13 text

Information Satisfaction - IS ● Set of queries are rated by Human raters ● IS rating is done when there is changes ranking system or to evaluate SERP ● Google uses mobile to do IS - mobile indexing is the norm ● Search Quality Raters Guidelines is part of this

Slide 14

Slide 14 text

Do Google Understand all Type of Content? With newer model like RankBrain, Google can now understand Content to its all nuances. But Google predominantly uses user signals to rank or modify ranks

Slide 15

Slide 15 text

What are Impressions, Click & Positions? 1. Impressions: How often someone saw a link to your site on Google. Depending on the result type, the link might need to be scrolled or expanded into view. 2. Clicks: How often someone clicked a link from Google to your site. 3. (average) Position: A relative ranking of the position of your link on Google, where 1 is the topmost position, 2 is the next position, and so on. Shown only for Google Search results. Nuances to understand: Aggregated Impressions by Property & Page

Slide 16

Slide 16 text

Thank You Follow Sachin K on: https://www.linkedin.com/in/sachinksa/ https://twitter.com/sachinshajik https://www.reddit.com/user/sachinksa/ https://semkaizen.com/