SEOIRL - Bot Behavior Decoded (complete deck)

BOT BEHAVIOR DECODED Mapping Search Engines vs. LLM Crawlers Across
Your Site

Quick Intro

So Why Spend Time Mapping Bots? • Guesswork isn’t an
option when KPIs are down • Measure what matters • Measurement turns noise into a plan

Principles for the next 30 minutes • TEST EVERYTHING!!! •
Logs mostly don’t lie • Unify Search + LLM Measurement • Create your own tooling so you can validate results • Collect → Normalize → Attribute → Visualize → Alert

Logs reveal what bots care about… And can indicate the
value of what you published. LOOKING AT LOG FILES

104.28.10.1 - - [11/Apr/2025:14:22:11 +0000] "GET /faq/refund-policy HTTP/1.1" 200 "-"
"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; GPTBot/1.0; +https://openai.com/gptbot)" IP Address of GPTBot Timestamp of Request HTTP Method & URL Status User-Agent String Anatomy of a Log Line

IP Address of GPTBot 104.28.10.1 - - [11/Apr/2025:14:22:11 +0000] "GET
/faq/refund-policy HTTP/1.1" 200 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; GPTBot/1.0; +https://openai.com/gptbot)" Timestamp of Request HTTP Method & URL Status User-Agent String Who & Where? What & How? When Anatomy of a Log Line

Logs Can Answer Who? – Which User Agent Visited? What?
– What pages are important? When? – How frequently am I getting crawled? Where? – Where is my audience looking for my brand? How? – How is my plan doing?

The Docs vs. Reality You can’t always rely on the
docs!

Bot Official UA token Robots behavior IP / Verification guidance
Notes Use these tokens in rules Googlebot Googlebot Respects robots.txt, crawl-delay not supported, uses desktop & smartphone crawlers; fetches sitemaps. Reverse DNS verification recommended; Google documents the method. Many UA variants (mobile/desktop, Google-InspectionTool). Crawl budget applies; watch robots.txt, sitemaps, canonicals, 301s. User-agent: Googlebot (optionally also Google-InspectionTool) Bingbot bingbot Respects robots.txt; supports crawl- control in Bing Webmaster Tools; fetches sitemaps. Reverse DNS method documented by Microsoft. UA variants exist; can switch to new UA formats; also see AdIdxBot/MSNBot in some cases. User-agent: bingbot GPTBot (OpenAI) GPTBot Respects robots.txt; allows global or path-level opt-out; will not crawl disallowed content. OpenAI documents allow/deny rules; IP ranges are periodically published but can change—treat UA + robots as primary. Fetches typical HTML and many text formats; shows up heavily on answer-dense pages. User-agent: GPTBot ClaudeBot / Claude-Search (Anthropic) ClaudeBot, Claude- Search (or Claude- SearchBot) Respects robots.txt; supports site- level and path-level disallow. No public reverse-DNS method; rely on robots + behavior patterns You may see separate tokens for training vs search; keep rules broad unless you need granularity. User-agent: ClaudeBot and/or Claude-Search* PerplexityBot PerplexityBot Respects robots.txt; supports site/path disallow. No official reverse-DNS method published; rely on UA token + robots adherence. Often crawls sources it later cites; intensity can spike on fresh, answerable content. User-agent: PerplexityBot

Crawling Patterns Understand the differences between Searchbots & LLMbots

Like a librarian.. Knows the books in each section and
gives you a list of recommended reading for you to find your answer.

Search vs LLM Bot Comparison BEHAVIOR SEARCH LLM Crawl Frequency
Per-URL scheduling: minutes to months based on PageRank + freshness No transparent scheduling; burst activity during training cycles Crawl Budget System Sophisticated: capacity limit + demand signals No documented budget system; often overload servers Depth of Crawl Follows internal link structure Less systematic; may focus on specific high-value pages Revisit Logic Based on budget & freshness Training cycles: periodic bulk collection, then dormant Time of Day Patterns Distributed for crawl capacity limits Burst patterns: spikes during training data collection

Unified Measurement Realizing Value & Setting up Tests

Separate Crawlers vs. Agent Browsers Server-only crawlers • No JS
execution • Fetches robots.txt/sitemap first (usually) • Asset mix is HTML-heavy, few JS/CSS/img • Regular cadence, parallelized fetching • UA token (e.g., Googlebot, GPTBot) may be present — spoofable JS-executing crawlers Source = server | cdn • JS runs (beacon fires) • Loads assets like a user (HTML + JS/CSS/img) • Often has referrer (chat app, aggregator, preview) • Can appear as Chrome/Safari User Agent • May carry custom headers agent_browser = false Source = web agent_browser = true Trust layer: DNS verify Google/Bing to avoid spoofed UA Behavior cues: robots/sitemap fetch sequence; bursty page hits; low asset mix Heuristics: referrer patterns, asset mix, JS-exec Metric: Direct agent percent = COUNT(agent_browswer=true)/COUNT (all) { "event": "bot_visit", "bot_type": "search|llm","bot_name": "Googlebot|GPTBot", "page_path": "/path", "agent_browser": false } // Optional: infer LLM source from referrer var ref = document.referrer || ''; var btype = 'unknown', bname = 'browser_like_agent'; if (ref.indexOf('chat.openai.com') > -1) { btype='llm'; bname='ChatGPT'; } if (ref.indexOf('claude.ai') > -1) { btype='llm'; bname='Claude'; } if (ref.indexOf('perplexity.ai') > -1) { btype='llm'; bname='Perplexity'; } if (ref.indexOf('gemini.google') > -1 || ref.indexOf('bard.google') > -1) { btype='llm'; bname='Gemini'; }

Decision Guidance: w/ Referrals

Decision Guidance: No Referrals

Prioritize by Combined Behavior

Want to Know What to Change?

Validate Before You Celebrate ✓ Canary URLs in sitemap-only. (Using
to verify visits NOT for security) ✓ Header Experiments (robots, cache-control) ✓ Agent spoof lab checks ✓ Time-to-recrawl ✓ Double-entry reconciliation (Logs, CDN, GA4, Dashboard)

Mapping Let’s You Measure Know where they go so you
can test & report on it.

Check Your Setup/Configuration

Logs Search Console GA4

Overlap = Existing Opportunity

HIGH BOT VISITS, LOW CLICKS = MISSED OPPORTUNITY

After 30-90 Days • Measure KPIs You Can Defend &
Test Against • Referrals by Bot/LLM • Conversion by Bot/LLM • % of Touches w/ Data-Driven Attribution View • Crawl intensity by bot • Bot Distribution • Time to Recrawl • Launch Your Tests & Re-Run for New LLM Bots

THANK YOU Let’s Connect:

Appendix

• Google Search Central — “Verifying Googlebot” (reverse DNS +
forward- confirm). • Microsoft Learn — “Verify that Bingbot is the real one.” • OpenAI — “GPTBot” documentation (robots control & UA token). • Anthropic — ClaudeBot / Claude-Search crawler notes (UA tokens; robots). • Perplexity — PerplexityBot page (UA token; robots). • Cloudflare blog/investigation — reporting alleged “stealth crawling” patterns (Perplexity disputed). • Manning, Raghavan, Schütze — Introduction to Information Retrieval (crawl/index fundamentals). • Dejan Marketing — Query fan-out / deep-learning variant coverage. • iPullRank — “Vector Embeddings Is All You Need.”

226 Bots/Crawlers Source: https://radar.cloudflare.com/bots

Source: https://momenticmarketing.com/blog/list-of-ai-search-crawlers-user-agents

AI Bot & Search Engine Tracking Let’s setup GTM, GA4
& GSC so we can see what’s going on!

Google Tag Manager (GTM) Setup in 3 Steps

1. Create a Variable to Capture the Bot Name function()
{ var ua = navigator.userAgent.toLowerCase(); if (ua.indexOf('gptbot') > -1) return 'GPTBot'; if (ua.indexOf('googlebot') > -1) return 'Googlebot'; if (ua.indexOf('perplexitybot') > -1) return 'PerplexityBot'; if (ua.indexOf('ccbot') > -1) return 'CCBot'; if (ua.indexOf('bytespider') > -1) return 'Bytespider'; if (ua.indexOf('amazonbot') > -1) return 'Amazonbot'; return 'Unknown'; } Variables → New → Custom JavaScript

GA4 Event Tag Tags → New → GA4→ 1. Add
Event Name 2. Update Event Parameters 3. Connect the Trigger

So Let’s Get Hybrid About This 1. Continue to optimize
for search results (Google) 2. Log & Tag Important Bots (Search & AI) 3. Train LLMs on YOU 4. Create “Answer Assets” for LLM Retrieval

Train LLMs on You Why: LLMs like ChatGPT and Perplexity
use web snapshots + RAG pipelines. You need to make your site “learnable.” Actions: Include clear brand + product definitions above the fold (on homepage and key pages) Publish FAQ-style Q&A for your domain (“What is X?” “How does Y work?”) Add named entity references (your brand, features, categories) with internal links Submit to Perplexity’s “Pro” knowledge base or cite-able resources Ensure your sitemap.xml and robots.txt expose these pages (Getttummm Crawled)

Create “Answer Assets” for LLM Retrieval Why: Most LLMs retrieve
passages, not whole pages. You need snippet-optimized content. Actions: Build high-authority explainer blocks: •Start with the answer → support with examples → cite yourself Use bullet summaries, step-by-step guides, or glossaries Add linkable anchors (#how-it-works, #pricing-explained) Use OpenGraph and meta descriptions that summarize function, audience, and value

SEOIRL - Bot Behavior Decoded (complete deck)

SEOIRL - Bot Behavior Decoded (complete deck)

Featured

Transcript