Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How does Google Search Work?

How does Google Search Work?

This presentation titled "How does Google Search Work?" delves into the detailed workings of the Google search engine.

Sachin Shaji Kalloothara

November 06, 2024
Tweet

Other Decks in Marketing & SEO

Transcript

  1. What is Search Engine? A search engine is a software

    system that finds web pages that match a web search. It’s kind of what library is to books! Google, Amazon, Facebook, Tiktok etc are all search engines in any form.
  2. Crawling, Indexing & Ranking • Google uses huge sets of

    computers to crawl the web - using program called Googlebots (Spiders/search engine bots/crawlers) • Google crawls the web and makes a copy of it. This is called an index. • Algorithms developed to “Rank” the documents in index then serves the results to SERP (Ranked higher has higher quality)
  3. Crawling • Comes under Information Retrieval (IR) - a science

    that deals with retrieving information from set of documents • IR comes with huge cost - called “Cost of Retrieval” • Google can easily crawl images, text & videos. However, struggles with Javascript based sites • Many types of crawlers based on content type, devices, location ◦ Adbots, Imagebots, Page resource load bots, mobile bots, desktop bots
  4. Indexing Process • Text Acquisition is crawling the web for

    documents • Text Transformation transforms documents into index terms • or features. Index terms, are the parts of a document that • are stored in the index and used in searching. • The simplest index term is a word • Phrases, names of people, dates, and links etc can be also index terms • This leads to Index creation that enables fast searching
  5. Parsing 1. Tokenizing the text is an important first step

    in this process. In many cases, tokens are the same as words 2. Stopping: Removal of stop words like “is”, “to”, “of” etc 3. Stemming is to group words that are derived from a common stem. Eg: sofa, sofas, sopha 4. Link extraction and analysis: Links & anchors ate stored separately & analysed 5. Information extraction: More of Semantics & nuances of the tokens (Named Entity Recognizers) 6. Classifier: Classify into categories like Furniture, Politics - Knowledge Graph
  6. Inverted Indexes • Most common form of index used by

    search engines. • An inverted index contains a list for every index term of the documents that contain that index term. • It is inverted in the sense that Index terms are matched with documents • It is derived by “Weighting” documents (like tf.idf) • The index is then distributed across multiple networks for easier retrieval
  7. Querying & ranking • Algorithms ranks the documents based on

    100s of factors - called Scoring • The output which is ranked documents is then displayed in SERP • The scoring should be done super fast for retrieval - Performance Optimization • Ranks can also be distributed like index - since location & other factors changes Ranks • Evaluation takes in User data to feed in Ranking performance
  8. Important Ranking factors Important Signals 1. The document 2. Topicality

    (HITS algorithm) 3. Page quality (E-E-A-T) 4. Reliability 5. Localization 6. Navboost Core Algorithms: Google uses core algorithms to reduce the number of matches for a query down to “several hundred” documents. Those core algorithms give the documents initial rankings or scores. Navboost: It memorizes all the clicks on queries from the past 13 months. Navboost is a ranking signal that can only exist after users have clicked on a document/page. It also measures user interactions on other SERP features like PAA, Images, map listing etc (Tangram or Tetris is the system that pull all SERP features in SERP) Deep learning systems: RankBrain, DeepRank (BERT), RankEmbed, MUM
  9. User Signals - Modeled for Click Prediction • CTR is

    critical when it comes to ranking • Run CTR experiments to improve CTR • Brands with good awareness has an advantage here
  10. PageRank vs HITS 1. PageRank, developed by Google founders Larry

    Page and Sergey Brin, assigns a numerical weight to each element of a hyperlinked set of documents 2. PageRank emphasizes the importance of quality links, and a link from a high-ranking page contributes more to the linked page's own ranking 3. Keyword Independent 1. HITS, also known as Hubs and Authorities, was developed by Jon Kleinberg. Instead of assigning a global importance to web pages, HITS identifies two types of pages: hubs and authorities. 2. Pages are mutually reinforcing - a page is a good hub if it links to many authorities, and it is a good authority if it is linked to by many hubs 3. Keyword dependent PageRank Hyperlink-Induced Topic Search
  11. Priors Algorithm • Deals with choices of users on SERP

    • Google has historic data on user behaviour that can be modelled to accommodate even new users • Google follows previous user interactions to rank pages in SERP • This is where Personalized results come in Takeaway: Create content or pages that resonates with users. Make content superior to rest of the results in SERP - Basically get “Long Clicks” from users
  12. Information Satisfaction - IS • Set of queries are rated

    by Human raters • IS rating is done when there is changes ranking system or to evaluate SERP • Google uses mobile to do IS - mobile indexing is the norm • Search Quality Raters Guidelines is part of this
  13. Do Google Understand all Type of Content? With newer model

    like RankBrain, Google can now understand Content to its all nuances. But Google predominantly uses user signals to rank or modify ranks
  14. What are Impressions, Click & Positions? 1. Impressions: How often

    someone saw a link to your site on Google. Depending on the result type, the link might need to be scrolled or expanded into view. 2. Clicks: How often someone clicked a link from Google to your site. 3. (average) Position: A relative ranking of the position of your link on Google, where 1 is the topmost position, 2 is the next position, and so on. Shown only for Google Search results. Nuances to understand: Aggregated Impressions by Property & Page