of SEO: Decoding Search Engine Algorithms. This is the cover that my publisher sent me. I’m not a fan, but we’ll see what happens. Anyway, you can preorder it wherever books are sold. Here’s the Amazon link. https://amzn.to/3T9qkYN
in 1951 because he felt the Dewey Decimal System could not keep up with the pace of information after the war. This is the basis of what is called an Inverted Index, the data structure behind what we think of as an “index” in search engines.
the vector-space model, documents and queries are converted to vector representations and plotted in multi-dimensional space. The query and document vectors are then compared based on cosine similarity and the ones that are closest to the query are the most relevant. The main takeaway here is that relevance is a quantitative value. This is perhaps the most important concept to understand about how search works.
available web-scale search engine based on a crawler called WebCrawler in 1994. He wrote about it extensively for his PhD thesis. http://www.thinkpink.com/bp/Thesis/ Thesis.pdf
engine can trace its roots back to WebCrawler to some degree. In fact, Lycos, AltaVista, and Google all reference it in their early papers and patents. You know why page titles have been so important for so long? Early search engines only indexed page titles.
Google from AltaVista https://www.quora.com/What-was-it-like-to-work-on-the-AltaVista-team-in-the- 90s?ch=10&oid=960520&share=21e5a871&srid=uHsr&target_type=question
software across eco can be installed on any machine and any process can be run on any machine. For example, a crawler could also on a machine that is managing rendering or processing or anything else.
Panda Panda was a group-specific modification factor that was computed as a function of: The number of independent links divided by the number of reference queries. Penguin built on top of that quality score and applied it to links.
Search Engines Do Fundamentally, this is the basis of how search engines function. Google has developed many layers on top of this, but this is the core of what they all do.
Primary Search Models What we as the SEO community do not have a strong enough handle on is that most of what Google’s doing is on the semantic side and that has all improved dramatically over the last 10 years based on machine learning.
the vector space model again. This model is a lot stronger in the neural network environment because Google can capture more meaning in the vector representations.
we talk about relevance, it’s the question of similar is determined by how similar the vectors are between documents and queries. This is a quantitative measure, not the qualitative idea of how we typically think of relevance.
by Tomas Milosevic and Jeff Dean that yielded an improvement in natural language understanding by using neural networks to compute word vectors. These were better at capturing meaning. Many follow-on innovations like Sentence2Vec and Doc2Vec would follow.
the idea of “aspect embeddings” which is series of embeddings that represent the full elements of both the query and the document and give stronger access to deeper information.
magic happens in the URL manager. The crawler simply accesses a page and extracts it. The processing pipeline handles most of the actual parsing. Source: Distributed Crawling of Hyperlinked Documents https://patents.google.com/patent/U S8812478B1/en
“state.” Although it has the capabilities to, it does not maintain cookies, fill out forms, or make POST requests. Every page it looks at is as though it turned logged on to the web for the first time and
level is reviewed) or depth-first (last node every path before moving on). Google uses a “best-first” model following PageRank Depth-first Breadth-first
Levels Deep” Come From? A paper by IR legend and Yahoo researcher Ricardo Baeza-Yates entitled “Crawling the Infinite Web” identified that crawling only five levels deep is enough to get the most valuable content on the web. https://chato.cl/papers/baeza04_cra wling_infinite_web.pdf
your dates from Schema and your lastmod from your sitemap, but they can’t trust them. So, keep every version of your content that they crawl and they make determinations on how frequently pages change to decide how often to crawl the page.
XML sitemaps regularly from a separate crawler to update their “per site” database. That database informs the list of URLs that go to the scheduler and it treats “differential sitemaps” with higher priority. There’s also a secondary crawler system for URLs in XML Sitemaps.
a Page A good way to improve crawl is by updating your pages regularly. An automated way to make it change is by putting a NLG summary at the top of the page and updating it frequently.
Time / Error Rate = TTFB x Duration / %Server Error = (Avg. TTFB x Duration / %Server Error) * (CTR x Average Time between page updates) = (Avg # of Crawled URLs x Frequency) / Time @JoriFord
that have either explicitly or implicitly indicated that they don’t change (304 response code) are basically put on a timeout for a while and Google will reuse what it has in the index. That cache expiry refreshes on a set interval.
this initiative because of the cross-search engine URL submission requirement. I could imagine them coming up with their own version of the spec though.
Crawl Activity Load Balance – Route Googlebot to its own autoscaling instances by IP Submit Differential Sitemaps Update your pages regularly Align lastmod with structured data date and on-page date Make sure your robots.txt never returns a 500 Track your crawl budget metrics
Anna Paterson led the phrase- based indexing initiative, search engines built inverted indexes on single phrases and then built posting lists at the intersections of phrases in queries. Phrase-based indexing upended this and introduced phrase co-occurrence and predictive modeling based on those phrases.
the Document Server There are a variety of operations that Google does based on your content over time. So they have cached versions from the first time a page appeared.
in multiple tiers across many machines and split into three dimensions based on how important the page is. Super important and regularly accessed pages are stored in memory. Pages of medium importance stored on solid state drives for fast reads. Pages that are not so important are stored in standard HDDs since they are cheap and don’t need to be fast. Distributed Crawling of Hyperlinked Documents https://patents.google.com/patent/U S8812478B1/en
through a series of fingerprints and comparison. There many signals that inform this process such as links, redirects, alternates, etc. Google uses a machine learning classifier to make the final canonical determination.
Web Rendering System uses a modified version of headless Chromium to render pages. It has different behaviors than a users browser like how it handles random, dates, and service workers. It doesn’t paint pixels because there’s no reason to, but it will stop executing if a process takes up too much CPU.
links, it’s very likely that they have ramped up the capabilities around relevance between pages for links. They are likely discounting pages that are not close relevance matches anymore.
function. Google scores content and links a variety of different ways and then chooses the best results. There is not just one “algorithm.” This is why different queries seem to value signals differently. HOW SEARCH ENGINES REALLY WORK IN 2023
scoring functions with different results to choose from, Google may make further re-ranking adjustments based on any number of features and factors. So, really, anything could happen in the SERPs.
Snippet is Bigger 1. The query is more natural language and no longer Orwellian Newspeak. It can be much longer than the 32 words that is has been historically in order 2. The Featured Snippet has become the “AI snapshot” which takes 3 results and builds a summary. 3. Users can also ask follow up questions in conversational mode. 3 2 1
Neeva, Bing, and now Google’s Search Generative Experience all use pull documents based on search queries and feed them to a language model to generate a response.
PaLM 2 and MUM MUM is the Multitask Unified Model that Google announced in 2021 as way to do retrieval augmented generation. PaLM 2 is their latest state of the art large language model.
lit a competitive fire under Google, but they have been working on these technologies for years. They were slow to release because of the various reasons that LLMs are likely to return disinformation.
Quality The experience of a response from Google suggests that there is a person giving the response. The generative text may also conflict with other aspects returned in search.
change in the level of natural language query that Google can support, we’re going to see a lot less head terms and a lot more long tail term. Going down Going up
the search results being pushed down by the AI snapshot experience, what is considered #1 will change. We should also expect that any organic result will be clicked less and the standard organic will drop dramatically. However, this will likely yield query displacement.
industry, we’ll need to decide what is considered the #1 result. Based on this screenshot positions 1- 3 are now the citations for the AI snapshot and #4 is below it. However, the AI snapshot loads on the client side, so rank tracking tools will need to change their approach.
Doc Length Normalization Google has always had the idea of making sure content length isn’t an overpowering factor. Amit Singhal recognized longer documents inherently outperform shorter ones in retrieval tasks, so it’s always been a fundamental thing that Google looked at.
built a tool that allows someone to put in a keyword or a topic and it will generate robust content based on what is currently ranking. https://www.frac.tl/interactives/long- form-article-generator/
Into Play Conceptually, as it relates to search engines, Information Gain is the measure of how much unique information a given document adds to the ranking set of documents. In other words, what are you talking about that your competitors are not?
Looking at Across the Entity Graph •Thus far, there is a very limited set of tools in the SEO space that are specifically looking at entities and their relationships. A non-SEO tool called EntiTree visualizes related entities from Wikidata. https://www.entitree.com/ Using this will give you insights into what entities are being considered for your target entity.
a SERP for Term Co-occurence •While it’s possible that it may yield the same or similar results, tools like this are not looking across relationships of entities.
and Talk About it In Your Content •Ultimately, the process is the same. Work the discussion entities, their attributes and related entities into your content in all the relevant places in your content.
SEARCH ENGINES REALLY WORK IN 2023 I whipped up a quick tool in Colab where you can see how entities are appearing in your own content. You can put text, upload a file, or select URL. Compare the usage of entities in your content with your competitors. https://colab.research.google.com/drive/18QXrdAPoKhUl76gGzuxk_vDiUqMeRyqx?usp=sharing
Use qualitative measures in the places where Google is using quantitative measures Use tools that calculate embeddings Improve the management of your XML sitemaps Leverage generative AI to scale content optimization Build links contextually Start, actually using entities
[email protected] Award Winning, #GirlDad Featured by Download the AI Guide: https://ipullrank.com/ai-seo-guide Use Orbitwise: https://ipullrank.com/tools/orbitwise