Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Michael King - Accounting for Gaps in SEO Software

Tech SEO Connect
October 23, 2024
37

Michael King - Accounting for Gaps in SEO Software

Tech SEO Connect

October 23, 2024
Tweet

More Decks by Tech SEO Connect

Transcript

  1. 1

  2. 4

  3. 6 6 If You Use a Tool That Does This,

    It Does Not Make Sense
  4. 8 I made the mistake of giving Marcus Tober a

    compliment the other day…
  5. 10 10 Phrase Based Indexing + Query Expansion Means Google

    Is Considering Content Across the Topic Cluster for Posting Lists
  6. 11 This is what (was) happening with AI Overviews AI

    Overviews also look at related queries, not just your primary search query. h/t https://richsanger.com/google-ai-overvi ews-do-ranking-studies-tell-the-whole-st ory/
  7. 12 12 All the Other Content Editor Tools Look Vertically

    at a SERP for Just the Target Keyword and Basically just Do TF-IDF
  8. 13 Here’s the Right Way 1. Build a graph of

    keywords based on the target keyword 2. Crawl top 10 rankings for all keywords across the graph 3. Extract features for entities and co-occurring terms 4. Compare against your page 5. Optimize Colab: https://colab.research.google.com/drive /1s5bB0vVsTFFTVfJz1ZUiwcKpyK7vsm y8#scrollTo=mDJvmCtTxucv
  9. 14 Is Anyone Looking at This? Google is telling you

    why results rank based on specific lexical, semantic, and links.
  10. 22 22 The reality is the existence of “Python SEO”

    is a function of the failings of SEO software.
  11. 25 25 Search Engines Work based on the Vector Space

    Model Documents and queries are plotted in multidimensional vector space. The closer a document vector is to a query vector, the more relevant it is.
  12. 26 26 TF-IDF Vectors The vectors in the vector space

    model were built from TF-IDF. These were simplistic based on the Bag-of-Words model and they did not do much to encapsulate meaning.
  13. 27 27 Relevance is a Function of Cosine Similarity When

    we talk about relevance, it’s the question of similar is determined by how similar the vectors are between documents and queries. This is a quantitative measure, not the qualitative idea of how we typically think of relevance.
  14. 28 It Does Not Have to be Cosine Similarity There

    are a lot of ways to compute nearest neighbor distance. Cosine Similarity is the most popular with Euclidean distance being the second, but there are many distance functions to be considered.
  15. 30 30 The lexical model counts the presence and distribution

    of words. Whereas the semantic model captures meaning. This was the huge quantum leap behind Google’s Hummingbird update and most SEO software has been behind for over a decade. Google Shifted from just Lexical to include Semantic a Decade Ago
  16. 31 Word2Vec Gave Us Embeddings Word2Vec was an innovation led

    by Tomas Mikolov and Jeff Dean that yielded an improvement in natural language understanding by using neural networks to compute word vectors. These were better at capturing meaning. Many follow-on innovations like Sentence2Vec and Doc2Vec would follow.
  17. 34 34 This Allows for Mathematical Operations Comparisons of content

    and keywords become linear algebraic operations.
  18. 37 37 This is a huge problem because most SEO

    software still operates on the lexical model.
  19. 39 39 8 Google Employees Are Responsible for Generative AI

    https://www.wired.com/story/eight-google-employees-invented-modern-ai-transformers-paper/
  20. 40 40 The Transformer The transformer is a deep learning

    model used in natural language processing (NLP) that relies on self-attention mechanisms to process sequences of data simultaneously, improving efficiency and understanding in tasks like translation and text generation. Its architecture enables it to capture complex relationships within the text, making it a foundational model for many state-of-the-art NLP applications.
  21. 44 44 Google Has Been Using Public About Hybrid Models

    Since 2020 This is why some of the search results feel so weird. A re-ranking of documents with a mix of lexical and semantic. https://arxiv.org/pdf/2010.01195.pdf
  22. 45 45 Hybrid retrieval is a big part of why

    you often can’t tell why something ranks.
  23. 47 Dense Retrieval You remember “passage ranking?” This is built

    on the concept of dense retrieval wherein there are more embeddings representing more of the query and the document to uncover deeper meaning.
  24. 49 49 Introducing Google’s Version of Dense Retrieval Google introduces

    the idea of “aspect embeddings” which is series of embeddings that represent the full elements of both the query and the document and give stronger access to deeper information.
  25. 50 50 Dense Representations for Entities Google has improved its

    entity resolution using embeddings giving them stronger access to information in documents.
  26. 52 Website Representation Vectors Just as there are representations of

    pages as embeddings, there are vectors representing websites and Google has recently made improvements in understanding when content is not relevant to a given site.
  27. 53 Author Vectors Similarly, Google has Author Vectors wherein they

    are able to identify an author and the subject matter that they discuss. This allows them to fingerprint an author and their expertise.
  28. 54 54 So, really E-E-A-T is a function of information

    associated with vector representations of websites and authors.
  29. 56 56 Embeddings are one of the most fascinating things

    we can leverage as SEOs to catch up to what Google is doing. SEO tools don’t use them.
  30. 58 58 Google’s Algorithms Inner Workings Have Been Put on

    Full Display Lately Through a combination of what’s come out of Google’s DOJ antitrust trial and the Google API documentation leak, we have a much clearer picture of how Google actually functions.
  31. 59 59 I was the First to Publish on the

    Google Leak, but it was a Team Effort
  32. 60 60 We Now Have a Much Stronger Understanding of

    the Architecture https://searchengineland.com/how-google-search-ranking-works-445141
  33. 62 The Primary Takeaway is the Value of User Behavior

    in Organic Search Google’s Navboost system keeps track of user click behavior and uses that to inform what should rank in various contexts.
  34. 68 68 I remain adamant that both Google and the

    SEO community owes @randfish an apology.
  35. 71 71 User Click Data is What Makes Google More

    Powerful Than Any Other Search Engine The court opinion in the DoJ Antitrust trial, Google’s leaked documents, and Google’s own internal documentation all support the fact that click behavior is what makes Google perform the way that it does.
  36. 72 72 There are only Two Tools that we have

    with Clickstream Data Needed Here
  37. 75 Hanns Kronenberg Found IR Metrics in Google Hanns found,

    tracked, and optimized for: • IRScore - The final result of Google’s scoring functions • nav_fraction - Expected CTR • irscore_pre_twiddle - Initial Ascorer value The leak has since been plugged by Google.
  38. 78 78 He Found that Nav Fraction Metric Updates Slowly

    in Alignment with What We Know about Navboost and Glue
  39. 85 TF-Ranking One of Googles Learning to Rank Mechanisms Google

    has expectations of performance for every position in the SERP. The user behavior signals collected reinforce what should rank and demote what doesn’t perform just like a social media channel. The best way to scale this is by generating highly-relevant content with a strong user experience.
  40. 86 86 13 Months of Google Data = 17 Years

    of Bing Data (Sorry, Fabrice)
  41. 88 Modern SEO Needs UX Baked-in Google has expectations of

    performance for every position in the SERP. The user behavior signals collected reinforce what should rank and demote what doesn’t perform just like a social media channel. The best way to scale this is by generating highly-relevant content with a strong user experience.
  42. 91 91 We Should agree the 200 Ranking Factors THing

    is Dead Coincidentally, Semrush asked me to help them update this article. It’s so wrong that I told them to start from scratch.
  43. 92 Site Embeddings Are Used to Measure How On Topic

    a Page is Google is specifically vectorizing pages and sites and comparing the page embeddings to the site embeddings to see how off-topic the page is. Learn more about embeddings: https://ipullrank.com/content-relevance
  44. 94 MixedBread’s Open Source Embeddings are Highly Performant Last week

    @dejanseo shared his research on how MixedBread’s embedding models perform better than anything else for his SEO use cases. He also talked about lowering the dimensionality and converting them to binary representations to save space.
  45. 95 95 You Can Compute This Too I’m using the

    embeddings with cosine similarity and clustering to examine two ways that how pages relate or don’t relate to the site average of the embeddings. Notice how my recent posts on AI related topics for SEO all have high PageSiteSimilarity whereas my post about MozCon from 2011 does not.
  46. 96 Check out the Colab This uses Requests, Trafilatura, Transformers,

    PyTorch, Scikit Learn, and Sentence Transfomrers to compute SiteScore and a dataframe of cosine similarities and cluster based scores for all URLs crawled. https://colab.research.google.com/drive /19PJiFmv8oyjhB-jwzEK9TPlbfK-xB573 You can remove the outliers to improve your site focus score. Add this to your content audits.
  47. 97 97 When I Run it on my whole Sitemap,

    My Site is Not Very Focused
  48. 99 99 Mark Has a Lot More Data to Share;

    He’ll be Sharing it at SERPConf Later This Month
  49. 10 0 Mark Has a SHIT LOAD of data Data

    from 90 million SERPs to be exact.
  50. 10 1 10 1 He Hasn’t Shared Much with Me,

    But There are Some Interesting Ones
  51. 10 3 103 What I love most is that these

    data points can give us more clarity.
  52. 10 4 10 4 MFs Love to Throw Around “Information

    Gain” Conceptually, as it relates to search engines, Information Gain is the measure of how much unique information a given document adds to the ranking set of documents. In other words, what are you talking about that your competitors are not?
  53. 10 5 10 5 Most SEOs Are Referencing This Patent

    This patent actually talks more about re-ranking based on results a user has previously clicked on.
  54. 10 6 10 6 Information Gain was Discussed as Early

    as Phrase-based Indexing as a function of co-occurrence used for prediction of relevance
  55. 10 9 Here’s How I Calculate Information Gain I calculate

    it as a function of embeddings comparisons across the SERP and identifying the unique entities on each page to help inform what to do. Note: This is an old version of the code, we use Mixed Bread embeddings for this now.
  56. 111 11 1 The Source Type Metric A metric called

    sourceType that shows a loose relationship between the where a page is indexed and how valuable it is. For quick background, Google’s index is stratified into tiers where the most important, regularly updated, and accessed content is stored in flash memory. Less important content is stored on solid state drives, and irregularly updated content is stored on standard hard drives. The higher the tier, the more valuable the link. Pages that are considered “fresh” are also considered high quality. Suffice it to say, you want your links to come from pages that either fresh or are otherwise featured in the top tier. Get links from pages that live in the higher tier by modeling a composite score based on data that is available.
  57. 11 2 Source Type is a Proxy Metric Using weighed

    rankings, traffic, URL Rating, and Domain Rating, I build composite metric to estimate where in Google’s tiered index the page may live.
  58. 113 11 3 There are Gold Standard Documents There is

    no indication of what this means, but the description mentions “human-labeled documents” versus “automatically labeled annotations.” I wonder if this is a function of quality ratings, but Google says quality ratings don’t impact rankings. So, we may never know. 🤔
  59. 114 Measure Your Content Against the Quality Rater Guidelines Elias

    Dabbas created a python script and tool that uses the Helpful Content Recommendations to show a proof of concept way to analyze your articles. We’d use the Search Quality Rater Guidelines which serve as the Golden Document standard. I’ll be turning this into a golden document metric soon. Code: https://blog.adver.tools/posts/llm-content-evaluation/ Tools: https://adver.tools/llm-content-evaluation/
  60. 115 What I’m building is a Python library for computing

    as many of the meaningful metrics that Google that is deriving. This will help us expand our understanding of why things rank the way they do. https://github.com/ipullrank/search-telemetry The Search Telemetry Project
  61. 11 8 11 8 Content Decay The web is a

    rapidly changing organism. Google always wants the most relevant content, with the best user experience, and most authority. Unless you stay on top of these measures, you will see traffic fall off over time. Measuring this content decay is as simple comparing page performance period over period in analytics or GSC. Just knowing content has decayed is not enough to be strategic.
  62. 11 9 11 9 It’s not enough to know that

    the page has lost traffic.
  63. 12 3 12 3 Interpreting the Content Potential Rating 80

    - 100: High Priority for Optimization 60 - 79: Moderate Priority for Optimization 40 - 59: Selective Optimization 20 - 39: Low Priority for Optimization 0 - 19: Minimal Benefit from Optimization If you want quick and dirty, you can prune everything below a 40 that is not driving significant traffic.
  64. 12 4 12 4 Combining CPR with pages that lost

    traffic helps you understand if it’s worth it to optimize.
  65. 12 5 12 5 Step 1. Pull the Rankings Data

    from Semrush Organic Research > Positions > Export
  66. 12 6 12 6 Step 2: Pull the Decaying Content

    from GSC Google Search Console is a great source to spot Content Decay by comparing the last three months year over year. Filter for those pages where the Click Difference is negative (smaller than 0) then export.
  67. 12 7 12 7 Step 3: Drop them in the

    Spreadsheet and Press the Magic Button
  68. 12 8 The Output is List of URLs Prioritized by

    Action Each URL is marked as Keep, Revise, Kill or Review based on the keyword opportunities available and the effort required to capitalize on them. Sorting the URLs marked as “Revise” by Aggregated SV and CPR will give you the best opportunities first.
  69. 12 9 12 9 Get your copy of the Content

    Pruning Workbook : https://ipullrank.com/cpr-sheet
  70. 13 0 13 0 Add this data to your content

    audits to make data-driven decisions of what to cut.
  71. 13 4 13 4 In SEO Our Outputs are ALL

    OVER THE PLACE!!! One provider oddly returns a semi-colon separated CSV via its API. While another provides JSON. The data is the same, but formatted dramatically differently. WHY?!
  72. 13 6 13 6 All the Link Indices Crawl a

    Different Subset of the Web, but there’s no real way to Consolidate or Compare the Data The metrics are also so wildly different and much of what Google is looking at is not accounted for.
  73. 13 8 13 8 The Gateway Specification This is a

    draft of standards for data portability and open link metrics. https://github.com/ipullrank/gateway Contribute!
  74. 14 1 141 I’ve always felt rankings and link data

    should be free, but compute and storage cost money.
  75. 14 5 Nodes are Hosted on Trusted Machines Each node

    is a simple TSR that downloads lists of URLs, crawls and extracts information and phones it home to Majestic.
  76. 14 6 14 6 I decompiled it to see how

    hard it might be to build. This is the crawl function.
  77. 14 8 148 What if we replicated that for rankings,

    links, and an embeddings index, but used the nodes for storage too?
  78. 14 9 We Could Mirror the Spanner Architecture Google uses

    many distributed machines as a single machine using the Spanner architecture. We could mirror this idea by building a network of trusted SEOs who run redundant nodes on their machines for crawling and storage.
  79. 15 1 Coming Soon… Although I have been working furiously

    on this in Cursor, I decided to drink with y’all instead of finishing it in time for this.
  80. Contact me if you want to get better results from

    your SEO: [email protected] Thank You | Q&A Award Winning, #GirlDad Featured by Download the Slides: https://speakerdeck.com/ipullrank Mike King Chief Executive Officer @iPullRank
  81. 15 5 15 5 The Three Laws of Generative AI

    content 1. Generative AI is not the end-all-be-all solution. It is not the replacement for a content strategy or your content team. 2. Generative AI for content creation should be a force multiplier to be utilized to improve workflow and augment strategy. 3. You should consider generative AI content for awareness efforts, but continue to leverage subject matter experts for lower funnel content.
  82. 16 0 16 0 It’s Not Difficult to Build with

    Llama Index sitemap_url = "[SITEMAP URL]" sitemap = adv.sitemap_to_df(sitemap_url) urls_to_crawl = sitemap['loc'].tolist() ... # Make an index from your documents index = VectorStoreIndex.from_documents(documents) # Setup your index for citations query_engine = CitationQueryEngine.from_args( index, # indicate how many document chunks it should return similarity_top_k=5, # here we can control how granular citation sources are, the default is 512 citation_chunk_size=155, ) response = query_engine.query("YOUR PROMPT HERE")
  83. 17 9 With AI, I’m giving y’all legos. What you

    build is up to you, but I’m going to show things to consider.
  84. 18 2 LLaMa 3.2 is SOTA on Several Benchmarks Facebook’s

    open source model is outperforming the best closed-source models on a variety of different evaluation metrics. New open source models pop up weekly that continue to shift the state of the art.
  85. 18 8 188 You can now unlock state of the

    art generative AI use cases from your laptop for free.
  86. 18 9 Make Sure You Hook It Up To Your

    GPU On a Windows machine you’ll need to go to the NVIDIA Control Panel and add the Ollama server application under Manage 3D Settings.
  87. 19 0 19 0 Octoparse - Combine a scraper with

    Generative AI - https://www.octoparse.ai/
  88. Contact me if you want to get better results from

    your SEO: [email protected] Thank You | Q&A Award Winning, #GirlDad Featured by Download the Slides: https://speakerdeck.com/ipullrank Mike King Chief Executive Officer @iPullRank