Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Trashcat's guide to black boxes: tech seo for LLMs

Avatar for Jamie Indigo Jamie Indigo PRO
September 26, 2025
28

Trashcat's guide to black boxes: tech seo for LLMs

What can we learn from a feral adult cat posing as a precious kitten? New friends aren't always what they seem. Let's take a critical look at LLM models and break down our assumptions about bot behaviors, search engines, and what we can do to impact how we're represented in AI search

Avatar for Jamie Indigo

Jamie Indigo PRO

September 26, 2025
Tweet

Transcript

  1. audience_guide • Not a robot; speaks bot • Director of

    Technical SEO at Cox Automotive • Author of Rich Snippets • Excitable nerd from New York • /in/jamie-indigo • not-a-robot.com • Proud pet parent
  2. meet_keyleth Expert Evaluation • Adult • Stray • Ravenous •

    Breed: shorthair domestic terrorist • Favorite toys: airpods and grasshoppers did do that. will do it again
  3. bot_covenant let me crawl you. I'll: 1. crawl politely 2.

    declare who i am 3. send you traffic
  4. bot_covenant let me crawl you. I'll: 1. crawl politely 2.

    declare who i am 3. send you traffic assumptions (Fool me once, Kiki)
  5. 1. promise(polite) RFC 9309 - Robots Exclusion Protocol RFC 9309

    - Robots Exclusion Protocol; All about robots | Google Search Central Blog
  6. robots.txt reports that is lie Reddit disallows all crawlers after

    $60M deal with Google Feb 24 Credit: Josh Blyskal, LinkedIn
  7. fwd: fwd: fwd: plausible deniability Common Crawl's massive dataset is

    more than 9.5 petabytes large and makes up a significant portion of the training data for many Large Language Models (LLMs) of GPT-3 tokens (a representation unit of text data) stemmed from Common Crawl. 80% Mozilla Report: How Common Crawl’s Data Infrastructure Shaped the Battle Royale over Generative AI
  8. "65% of our most expensive traffic comes from bots" wikimedia

    foundation How crawlers impact the operations of the Wikimedia projects meta name = "ravenous"
  9. let me love you Mozilla/5.0 (Windows NT 10.0; Win64; x64)

    AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Gemini-Deep-Research; +https://gemini[dot]google/overview/deep-research/) Chrome/135.0.0.0 Safari/537.36 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; compatible; GoogleAgent-Mariner; +https://developers.google[dot]com/search/docs/crawling-indexing/google-agent-mariner) Chrome/135.0.0.0 Safari/537.36 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GoogleAgent-Search; +https://developers.google[dot]com/search/docs/crawling-indexing/google-agent-search) Chrome/114.0.0.0 Safari/537.36
  10. "This data is in contrast to third-party reports that inaccurately

    suggest dramatic declines in aggregate traffic — often based on flawed methodologies, isolated examples, or traffic changes that occurred prior to the roll out of AI features in Search."
  11. "This data is in contrast to third-party reports that inaccurately

    suggest dramatic declines in aggregate traffic — often based on flawed methodologies, isolated examples, or traffic changes that occurred prior to the roll out of AI features in Search." THEN HELP US LEARN BETTER
  12. bot_covenant let me crawl you. I'll: 1. crawl politely 2.

    declare who i am 3. send you traffic do it whether you like it or not
  13. ai_search ask ai for things, it 1. is search engine

    2. does the thing I ask 3. uses search engine mechanics assumptions
  14. search engine = information retrieval system ai search = model

    trained on corpus + retrieval augmentation (sometimes)
  15. AI rank tracking • uses synthetic personas as part of

    the prompt • these are not the same as user embedding vectors • intentionally resets persistent memory AI search • is intent aware • has ambient persistent memory • is personalized using the same user embedding vectors but ai rank tracking…
  16. analytics.jk 1. Analytics referrer 2. User-initiated UAs in log files

    (ex: chatgpt-user) These are you appearing in AI search, my friend. 3. Text fragments in landing pages (#:~:text=)
  17. Effective Resource restriction Robots.txt Introduction and Guide | Google Search

    Central | Documentation 1. If you're disallowing it for an AI crawler, repeat the statement for CCbot 2. Use X-robots directives to block non-HTML resources from being indexed independently but allowed to contribute to rendered page content. 3. Seriously block your API endpoints for non rendering bots 4. Be sure to block only non-critical resources—that is, resources that aren't important to understanding the meaning of the page. 5. Block personalization resources used for returning users to conserve crawl budget 6. Hide logins from the crawl path
  18. render_mechanics AI Crawler Requests Renders Google (ecosystem) ✅ ✅ Claude

    (Claude-SearchBot, Claude-User, Claude-Web, ClaudeBot) ✅ 🤔 OpenAI (OAI-SearchBot, ChatGPT-User, GPTBot, ChatGPT Agent) ✅ ❌ Meta (Meta-ExternalAgent) ✅ ❌ Perplexity (PerplexityBot) ✅ ❌ ByteDance (Bytespider) ✅ ❌ Common Crawl (ccbot) ✅ ❌ Rise of the AI Crawler, Vercel; reverified 29 July 2025– thank you, Ryan Siddle
  19. “The decision on which pages to crawl is primarily influenced

    by the relevance of the title, the content within the snippet, the freshness of the information, and the credibility of the domain.” ChatGPT support team How does ChatGPT Search select the sources to crawl? Jérôme Salomon
  20. 1. URL 2. Title 3. Snippet (usually meta description) 4.

    Ranking position 5. Metadata event: delta data: {"v": [ { "type": "search_result_group", "domain": "www.kbb.com", "entries": [ { "type": "search_result", "url":"https://www.kbb.com/cars-for-sale/all/2025/nissan/frontier/pro-4x", "title": "2025 Nissan Frontier PRO-4X for Sale", "snippet": "... 3572 Nissan Frontier cars for sale, including a New 2025 Nissan Frontier PRO-4X and a Used 2025 Nissan Frontier PRO-4X ranging in price from $13795 to $68619.", "ref_id": null, "pub_date": null, "attribution": "www.kbb.com" } ] } ] }
  21. leverage analytics and logs as a proxy for AI rank

    1. Referring 2. User-initiated crawls 3. Fragments English Google SEO office-hours from January 7, 2022
  22. want new pages crawled? link from homepage “So for the

    most part, for example, we would refresh crawl the homepage, I don’t know, once a day, or every couple of hours, or something like that. And if we find new links on their home page then we’ll go off and crawl those with the discovery crawl as well. And because of that you will always see a mix of discover and refresh happening with regard to crawling. And you’ll see some baseline of crawling happening every day. But if we recognize that individual pages change very rarely, then we realize we don’t have to crawl them all the time.” English Google SEO office-hours from January 7, 2022
  23. Technical SEO for AI Search - SALT.agency® performance still matters

    • Sites with CLS ≤ 0.1 recorded a 29.8% higher inclusion rate in generative summaries compared with sites above this threshold. • Pages delivering LCP ≤ 2.5 seconds were 1.47 times more likely to appear in AI outputs than slower pages. • Crawlers abandoned requests for 18% of pages larger than 1 MB of HTML, highlighting the need for lean markup. • TTFB under 200 ms correlated with a 22% increase in citation density, particularly when paired with robust caching strategies. • His study shows that performance improvements do more than enhance user experience. They directly increase the probability of being cited or surfaced by AI systems.
  24. HTTP 499 status code indicates that the client closed the

    connection before the server could respond. In this context, the client is a bot sent by ChatGPT, terminating the request due to delayed server responses. Real-time genAI search can't afford slow pages ChatGPT has no time for your slow pages ⏱ | Jérôme Salomon 499 = i give up
  25. • crawlable links • semantic html • consistent urls •

    not rendered, not discovered tech seo still applies
  26. question original_query score assessment explanation suggestions What is the value

    according to the Blue Book? blue book value 6 partially_answered The content mentions that Kelley Blue Book provides various values (Trade In Range and Private Party Value) but does not specify what the 'Blue Book Value' is or how it is determined in a clear manner. Clarify what the 'Blue Book Value' specifically refers to and how it is calculated, including examples of different types of values (trade-in vs. private party). How can I find out the value of my car? car value 9 fully_answered The content clearly explains that users can get their car's value by providing their VIN or license plate and that they will receive an email with the value within 24 hours. This is a direct and actionable answer. - How much is my car worth according to Kelley Blue Book? how much is my car worth 7 partially_answered The content implies that users can find out their car's worth through Kelley Blue Book but does not provide a direct answer or method to obtain that specific value without further context. Include a more explicit statement or example of how to find out the worth of a specific car using Kelley Blue Book. If our existing queries move to AI conversations, can we answer?
  27. https://not-a-robot.com/ai-transparency Each doc has: 1. Assertions (Rules, Constraints, and Stated

    Facts) 2. Functionalities (The AI's Capabilities) 3. Testing strategies using Chrome devTools to verify