Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

SEOIRL - Bot Behavior Decoded (complete deck)

Avatar for Jori Ford Jori Ford
October 17, 2025
180

SEOIRL - Bot Behavior Decoded (complete deck)

Understand how to cross map llm & search bots so you can generate a measurement map & plan for testing.

Avatar for Jori Ford

Jori Ford

October 17, 2025
Tweet

Transcript

  1. So Why Spend Time Mapping Bots? • Guesswork isn’t an

    option when KPIs are down • Measure what matters • Measurement turns noise into a plan
  2. Principles for the next 30 minutes • TEST EVERYTHING!!! •

    Logs mostly don’t lie • Unify Search + LLM Measurement • Create your own tooling so you can validate results • Collect → Normalize → Attribute → Visualize → Alert
  3. Logs reveal what bots care about… And can indicate the

    value of what you published. LOOKING AT LOG FILES
  4. 104.28.10.1 - - [11/Apr/2025:14:22:11 +0000] "GET /faq/refund-policy HTTP/1.1" 200 "-"

    "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; GPTBot/1.0; +https://openai.com/gptbot)" IP Address of GPTBot Timestamp of Request HTTP Method & URL Status User-Agent String Anatomy of a Log Line
  5. IP Address of GPTBot 104.28.10.1 - - [11/Apr/2025:14:22:11 +0000] "GET

    /faq/refund-policy HTTP/1.1" 200 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; GPTBot/1.0; +https://openai.com/gptbot)" Timestamp of Request HTTP Method & URL Status User-Agent String Who & Where? What & How? When Anatomy of a Log Line
  6. Logs Can Answer Who? – Which User Agent Visited? What?

    – What pages are important? When? – How frequently am I getting crawled? Where? – Where is my audience looking for my brand? How? – How is my plan doing?
  7. Bot Official UA token Robots behavior IP / Verification guidance

    Notes Use these tokens in rules Googlebot Googlebot Respects robots.txt, crawl-delay not supported, uses desktop & smartphone crawlers; fetches sitemaps. Reverse DNS verification recommended; Google documents the method. Many UA variants (mobile/desktop, Google-InspectionTool). Crawl budget applies; watch robots.txt, sitemaps, canonicals, 301s. User-agent: Googlebot (optionally also Google-InspectionTool) Bingbot bingbot Respects robots.txt; supports crawl- control in Bing Webmaster Tools; fetches sitemaps. Reverse DNS method documented by Microsoft. UA variants exist; can switch to new UA formats; also see AdIdxBot/MSNBot in some cases. User-agent: bingbot GPTBot (OpenAI) GPTBot Respects robots.txt; allows global or path-level opt-out; will not crawl disallowed content. OpenAI documents allow/deny rules; IP ranges are periodically published but can change—treat UA + robots as primary. Fetches typical HTML and many text formats; shows up heavily on answer-dense pages. User-agent: GPTBot ClaudeBot / Claude-Search (Anthropic) ClaudeBot, Claude- Search (or Claude- SearchBot) Respects robots.txt; supports site- level and path-level disallow. No public reverse-DNS method; rely on robots + behavior patterns You may see separate tokens for training vs search; keep rules broad unless you need granularity. User-agent: ClaudeBot and/or Claude-Search* PerplexityBot PerplexityBot Respects robots.txt; supports site/path disallow. No official reverse-DNS method published; rely on UA token + robots adherence. Often crawls sources it later cites; intensity can spike on fresh, answerable content. User-agent: PerplexityBot
  8. Like a librarian.. Knows the books in each section and

    gives you a list of recommended reading for you to find your answer.
  9. Search vs LLM Bot Comparison BEHAVIOR SEARCH LLM Crawl Frequency

    Per-URL scheduling: minutes to months based on PageRank + freshness No transparent scheduling; burst activity during training cycles Crawl Budget System Sophisticated: capacity limit + demand signals No documented budget system; often overload servers Depth of Crawl Follows internal link structure Less systematic; may focus on specific high-value pages Revisit Logic Based on budget & freshness Training cycles: periodic bulk collection, then dormant Time of Day Patterns Distributed for crawl capacity limits Burst patterns: spikes during training data collection
  10. Separate Crawlers vs. Agent Browsers Server-only crawlers • No JS

    execution • Fetches robots.txt/sitemap first (usually) • Asset mix is HTML-heavy, few JS/CSS/img • Regular cadence, parallelized fetching • UA token (e.g., Googlebot, GPTBot) may be present — spoofable JS-executing crawlers Source = server | cdn • JS runs (beacon fires) • Loads assets like a user (HTML + JS/CSS/img) • Often has referrer (chat app, aggregator, preview) • Can appear as Chrome/Safari User Agent • May carry custom headers agent_browser = false Source = web agent_browser = true Trust layer: DNS verify Google/Bing to avoid spoofed UA Behavior cues: robots/sitemap fetch sequence; bursty page hits; low asset mix Heuristics: referrer patterns, asset mix, JS-exec Metric: Direct agent percent = COUNT(agent_browswer=true)/COUNT (all) { "event": "bot_visit", "bot_type": "search|llm","bot_name": "Googlebot|GPTBot", "page_path": "/path", "agent_browser": false } // Optional: infer LLM source from referrer var ref = document.referrer || ''; var btype = 'unknown', bname = 'browser_like_agent'; if (ref.indexOf('chat.openai.com') > -1) { btype='llm'; bname='ChatGPT'; } if (ref.indexOf('claude.ai') > -1) { btype='llm'; bname='Claude'; } if (ref.indexOf('perplexity.ai') > -1) { btype='llm'; bname='Perplexity'; } if (ref.indexOf('gemini.google') > -1 || ref.indexOf('bard.google') > -1) { btype='llm'; bname='Gemini'; }
  11. Validate Before You Celebrate ✓ Canary URLs in sitemap-only. (Using

    to verify visits NOT for security) ✓ Header Experiments (robots, cache-control) ✓ Agent spoof lab checks ✓ Time-to-recrawl ✓ Double-entry reconciliation (Logs, CDN, GA4, Dashboard)
  12. After 30-90 Days • Measure KPIs You Can Defend &

    Test Against • Referrals by Bot/LLM • Conversion by Bot/LLM • % of Touches w/ Data-Driven Attribution View • Crawl intensity by bot • Bot Distribution • Time to Recrawl • Launch Your Tests & Re-Run for New LLM Bots
  13. • Google Search Central — “Verifying Googlebot” (reverse DNS +

    forward- confirm). • Microsoft Learn — “Verify that Bingbot is the real one.” • OpenAI — “GPTBot” documentation (robots control & UA token). • Anthropic — ClaudeBot / Claude-Search crawler notes (UA tokens; robots). • Perplexity — PerplexityBot page (UA token; robots). • Cloudflare blog/investigation — reporting alleged “stealth crawling” patterns (Perplexity disputed). • Manning, Raghavan, Schütze — Introduction to Information Retrieval (crawl/index fundamentals). • Dejan Marketing — Query fan-out / deep-learning variant coverage. • iPullRank — “Vector Embeddings Is All You Need.”
  14. AI Bot & Search Engine Tracking Let’s setup GTM, GA4

    & GSC so we can see what’s going on!
  15. 1. Create a Variable to Capture the Bot Name function()

    { var ua = navigator.userAgent.toLowerCase(); if (ua.indexOf('gptbot') > -1) return 'GPTBot'; if (ua.indexOf('googlebot') > -1) return 'Googlebot'; if (ua.indexOf('perplexitybot') > -1) return 'PerplexityBot'; if (ua.indexOf('ccbot') > -1) return 'CCBot'; if (ua.indexOf('bytespider') > -1) return 'Bytespider'; if (ua.indexOf('amazonbot') > -1) return 'Amazonbot'; return 'Unknown'; } Variables → New → Custom JavaScript
  16. Regex-Based Trigger Triggers → New → Page View → Some

    Page Views gptbot| googlebot|perplexitybot|ccbot|bytespider| amazonbot
  17. GA4 Event Tag Tags → New → GA4→ 1. Add

    Event Name 2. Update Event Parameters 3. Connect the Trigger
  18. So Let’s Get Hybrid About This 1. Continue to optimize

    for search results (Google) 2. Log & Tag Important Bots (Search & AI) 3. Train LLMs on YOU 4. Create “Answer Assets” for LLM Retrieval
  19. Train LLMs on You Why: LLMs like ChatGPT and Perplexity

    use web snapshots + RAG pipelines. You need to make your site “learnable.” Actions: Include clear brand + product definitions above the fold (on homepage and key pages) Publish FAQ-style Q&A for your domain (“What is X?” “How does Y work?”) Add named entity references (your brand, features, categories) with internal links Submit to Perplexity’s “Pro” knowledge base or cite-able resources Ensure your sitemap.xml and robots.txt expose these pages (Getttummm Crawled)
  20. Create “Answer Assets” for LLM Retrieval Why: Most LLMs retrieve

    passages, not whole pages. You need snippet-optimized content. Actions: Build high-authority explainer blocks: •Start with the answer → support with examples → cite yourself Use bullet summaries, step-by-step guides, or glossaries Add linkable anchors (#how-it-works, #pricing-explained) Use OpenGraph and meta descriptions that summarize function, audience, and value