Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Taming the Bots: AI Crawler Management for the ...

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

Taming the Bots: AI Crawler Management for the Modern SEO

As AI crawlers from ChatGPT, Perplexity, Claude, and Google multiply, your server logs, crawl budget, and content strategy are all affected. Learn how to identify, segment, and manage AI bot traffic — and turn what looks like a threat into a visibility opportunity.

Avatar for Athens SEO

Athens SEO PRO

May 27, 2026

More Decks by Athens SEO

Other Decks in Marketing & SEO

Transcript

  1. Athens SEO 2026 Welcomes On Stage 100% Human Technical SEO,

    Not a Robot Jamie Indigo Your photo here
  2. JAMIE INDIGO Not a robot; speaks bot Doesn't care about

    your AI product Director of Technical SEO Cox Automotive Author of While You Were Offline 80% documentation, 20% snark /in/jamie-indigo not-a-robot.com
  3. It's not just Google anymore. Each platform has its own

    rules. ChatGPT • Limited Search documentation (2 dev docs) • No 1st party reporting hub • Documented user-agents • Declared IP ranges Perplexity • Limited search documentation (1 dev doc) • No 1st party reporting hub • Documented user-agents • Declared IP ranges Gemini • Limited search documentation • No 1st party reporting hub • No published IP range list (uses Google Cloud and workspace IPs) Claude • Limited Search documentation (2 dev docs) • No 1st party reporting hub • Declared IP ranges Grok • No 1st party reporting hub • Laughable documentation • No declared IP ranges • No declared user-agent AI Overviews/Mode • A first party reporting hub… that you can't filter to AIO/M • Some documentation • Uses Google infrastructure and cache
  4. Let's build a model 1. Training crawler requests a page

    👾 2. Training crawlers often do not render the page (meaning they can't see content created by scripts) 🛠 3. Training crawler feeds the page into the corpus 📚 4. The corpus is used to train the model 🔮 5. Users ask the model a question. The model users the user's prior interactions and user profile information to interpret the question. 👤 6. If the model is confident it knows based on information from the corpus. If it's not confident, it executes a Real-Time Augmented Retrieval (RAG) 🤖 7. The model surfaces an answer in it's response 💻
  5. Agenda Identify Let's read the latin Segment Let's play with

    data Triage + Opportunity Let's take action
  6. Server logs are a record of every request a server

    receives. They are a source of truth. (And my most favorite thing)
  7. Edge — CDN, DNS, and WAF handle requests before they

    even hit your infrastructure. Each generates its own logs (cache hits, query traces, blocked requests). Network — The load balancer sits here, logging every upstream request, health check, and routing decision. Web server — nginx or Apache write access and error logs; the TLS/proxy tier adds SSL handshake records. Application — Your app servers, auth service, and API gateway each generate their own logs covering business logic, login events, rate limiting, etc. Data — Databases (query/audit), message queues (producer/consumer), and caches (hit/miss/eviction) all produce logs that are easy to overlook. Infrastructure — The foundation: OS-level syslog, container stdout/stderr and Kubernetes pod events, and the log aggregator (ELK, Splunk, etc.) that pulls everything together.
  8. DIY: Accessing Log Files Apache (Linux Server) NGINX (Linux Server)

    IIS log files (Windows Server) AWS Load Balancer (Load Balancer) Google Cloud Load Balancer (Load Balancer) AWS Cloudfront (CDN) Accessing CloudFlare log files (CDN) Incapsula (CDN/DDoS Mitigation) Akamai logs (CDN/DDoS Mitigation)
  9. Internal Log Requests (Someone else already has them) Ask: Is

    there already a log management platform in place? Be Clear: We do not want Personal Identification Information (PII) and request it be removed Check your CDN for data on edge node (cached) vs server (uncached) hits Get: Log format and definitions You want enough log data to get an accurate picture
  10. 216.150.168.131 emeasrvr003 [07/Mar/2018:16:11:58 -0800] 66.249.66.1 GET /twiki/bin/view/TWiki/WikiSyntax?q=ntoon HTTP/1.1 www.example.com 200

    7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Server IP
  11. 216.150.168.131 emeasrvr003 [07/Mar/2018:16:11:58 -0800] 66.249.66.1 GET /twiki/bin/view/TWiki/WikiSyntax?q=ntoon HTTP/1.1 www.example.com 200

    7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Server Name
  12. 216.150.168.131 emeasrvr003 [07/Mar/2018:16:11:58 -0800] 66.249.66.1 GET /twiki/bin/view/TWiki/WikiSyntax?q=ntoon HTTP/1.1 www.example.com 200

    7352 616"-" Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Date & Time
  13. 216.150.168.131 emeasrvr003 [07/Mar/2018:16:11:58 -0800] 66.249.66.1 GET /twiki/bin/view/TWiki/WikiSyntax?q=ntoon HTTP/1.1 www.example.com 200

    7352 616"-" Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Requester’s IP
  14. 216.150.168.131 emeasrvr003 [07/Mar/2018:16:11:58 -0800] 66.249.66.1 GET /twiki/bin/view/TWiki/WikiSyntax?q=ntoon HTTP/1.1 www.example.com 200

    7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Request Method
  15. 216.150.168.131 emeasrvr003 [07/Mar/2018:16:11:58 -0800] 66.249.66.1 GET /twiki/bin/view/TWiki/WikiSyntax?q=ntoon HTTP/1.1 www.example.com 200

    7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Hostname
  16. 216.150.168.131 emeasrvr003 [07/Mar/2018:16:11:58 -0800] 66.249.66.1 GET /twiki/bin/view/TWiki/WikiSyntax?q=ntoon HTTP/1.1 www.example.com 200

    7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Response Code
  17. 216.150.168.131 emeasrvr003 [07/Mar/2018:16:11:58 -0800] 66.249.66.1 GET /twiki/bin/view/TWiki/WikiSyntax?q=ntoon HTTP/1.1 www.example.com 200

    7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Response Size
  18. 216.150.168.131 emeasrvr003 [07/Mar/2018:16:11:58 -0800] 66.249.66.1 GET /twiki/bin/view/TWiki/WikiSyntax?q=ntoon HTTP/1.1 www.example.com 200

    7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Response Size
  19. 216.150.168.131 emeasrvr003 [07/Mar/2018:16:11:58 -0800] 66.249.66.1 GET /twiki/bin/view/TWiki/WikiSyntax?q=ntoon HTTP/1.1 www.example.com 200

    7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Response Time
  20. 216.150.168.131 emeasrvr003 [07/Mar/2018:16:11:58 -0800] 66.249.66.1 GET /twiki/bin/view/TWiki/WikiSyntax?q=ntoon HTTP/1.1 www.example.com 200

    7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Requester’s User Agent
  21. 216.150.168.131 emeasrvr003 [07/Mar/2018:16:11:58 -0800] 66.249.66.1 GET /twiki/bin/view/TWiki/WikiSyntax?q=ntoon HTTP/1.1 www.example.com 200

    7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Referring URL
  22. Get to know your log format Data element JSON key

    Examples Description HTTP status codes statusCode 0 200 206 404 The HTTP status code sent in the response. Returns 0 if the TCP connection to the client ended before the server sent a response. Client IP cliIP 198.18.77.18 The IPv4 or IPv6 address of the requesting client. See IPv6 in RFC 5952. Bytes bytes 4995 The content bytes served in the response body. For HTTP/2, this includes framing overhead bytes used when multiplexing multiple responses on one connection. Protocol type proto HTTP/1.1 HTTPS/1.1 HTTP/2 HTTP3 The protocol of the response-request cycle. Query string queryStr q=foo&submit=true The query string in the incoming URI from the client.To monitor this parameter in your logs, you need to update your stream's property configuration to set the Cache Key Query Parameters behavior to include all parameters. See Cache Key Query Parameters. Request host reqHost splat-traffic.205400.a kamai.com The value of the Host header in the request with the domain name of the server and the TCP port number on which the server is listening. If no port is included, the default port for the service requested is implied. For example, 443 for an HTTPS URL, and 80 for an HTTP URL.A Host header must be present in HTTP/1.1 requests. If a request lacks this header or has more than one, the server may respond with a 400 status code. See Host in RFC 7230. Request method reqMethod GET POST PUT OPTIONS The HTTP method of the request. Request path reqPath path1/path2/file.ext The path to a resource in the incoming URI without query parameters. See the Query string field. Response Content-Length rspContentLen -1 0 5000 The size of object data returned to the client without HTTP response headers.The Akamai Edge logs the object size even if there is no Content-Length header. Returns -1 if the size can’t be determined—for example, when the connection ended before the edge server received the complete object from the origin. Response Content-Type rspContentType text-plain text-html The value of the Content-Type header in the response with the media type of the returned content. Returns - if unknown or not set. The 304 Not Modified response usually does not return this header.See:\ Content-Type in RFC 7231\ Partial Content in RFC 7233\ Media types User-Agent UA Mozilla%2F5.0+%28 Macintosh The URI-encoded value of the User-Agent header in the request. It lets edge servers identify the application, operating system, vendor, or version of the requesting user agent.This field is RFC-1738 escaped. See the note on hex-encoding above the table, User-Agent in RFC 7231 and RFC 2616.
  23. Paid: CDN-level (Akamai, Cloudflare), Botify, Logz.io, Sumo Logic, Splunk Free(mium):

    Screaming Frog Log Analyzer, Big Query Masochistic: Excel, Command Line How do we analyze logs at scale? @timestamp May 23, 2026 @ 08:38:38.000 IP 74.7.36.71 LogSize 1.5KB Method GET base_domain www.athenseo.com bytes 163,281 full_request /speakers/jamie-indigo name com/bot os Other os_name Other referrer "-", - request /speakers/jamie-indigo response 200 status_code 200 subfolder_1 speakers subfolder_2 jamie-indigo subfolder_3 type apache user_agent Mozilla/5.0%20AppleWebKit/537.36%20(KHTML, %20like%20Gecko);%20compatible;%20ChatGPT- User/1.0;%20+https://openai.com/bot
  24. The AI Crawler Zoo Mozilla/5.0 (Macintosh; Intel Mac OS X

    10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot) Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; [email protected]) Mozilla/5.0 (compatible; Bytespider; [email protected]) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.0.0 Safari/537.36 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.7778.96 Mobile Safari/537.36 (compatible; GoogleOther) Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; +https://openai.com/searchbot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot) Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; TikTokSpider; [email protected]) Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; [email protected]) meta-webindexer/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) Verity/1.1 (https://gumgum.com/verity; [email protected]) Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-SearchBot/1.0; [email protected]) Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; ClaudeBot/1.0; [email protected]) Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/144.0.0.0 Safari/537.36 Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (HTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) CCBot/2.0 (https://commoncrawl.org/faq/) meta-externalfetcher/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) DuckAssistBot/1.2; (+http://duckduckgo.com/duckassistbot.html) Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) Google-NotebookLM
  25. Let's find out who's crawling with 6 query blocks blocks

    1. Desired metrics 2. Data Source 3. Condition 4. Aggregate 5. Sort 6. Limit SELECT UA, count(*) AS requests FROM myLogs WHERE UA like '%GPTBot%' OR UA like '%ChatGPT-User%' ... OR UA like '%PerplexityBot%' GROUP BY UA ORDER BY requests DESC LIMIT 100
  26. UA Hits bytespider 40,164,250 oai-searchbot 7,417,706 gptbot 5,080,958 chatgpt-user 3,499,647

    claudebot 2,210,478 claude-searchbot 459,281 PerplexityBot 276,516 ChatGPT Agent 180,021 claude-user 7,486 perplexity-user 5,641 google-agent 1,400 mistralai-user 249 gemini-deep-research 232 Use Case: Who's crawling?
  27. Use case: What are they crawling? 1. Desired metrics 2.

    Data Source 3. Condition 4. Aggregate 5. Sort 6. Limit SELECT concat('https://', reqHost, reqPath) AS url, count(*) AS requests FROM myLogs WHERE UA like '%GPTBot%' OR UA like '%ChatGPT-User%' ... OR UA like '%PerplexityBot%' GROUP BY URL ORDER BY requests DESC LIMIT 100 Building a full URL for our results!
  28. url requests https://www.example.com 14,076 https://www.example.com/hi-new-friends 5,857 https://www.example.com/api/product-data 1,101 https://www.example.com/my-account 983

    https://www.example.com/best-sellers 699 https://www.example.com/sale/// 689 https://www.example.com/blog// 650 https://www.example.com/new// 476 Use case: What are they crawling?
  29. Yeah but I want to see it over time 1.

    Desired metrics 2. Data Source 3. Condition 4. Aggregate 5. Sort 6. Limit SELECT date, url, requests FROM ( DATE_TRUNC('day', reqTimeSec) AS date, concat('https://', reqHost, reqPath) AS url, count(*) AS requests, ROW_NUMBER() OVER (PARTITION BY date ORDER BY requests DESC) AS rank FROM myLogs WHERE UA like '%GPTBot%' OR UA like '%ChatGPT-User%' ... OR UA like '%PerplexityBot%' AND reqTimeSec >= '2026-04-01 00:00:00' AND reqTimeSec < '2026-05-01 00:00:00' GROUP BY date,URL ) WHERE rank <= 10 ORDER BY date ASC, requests DESC Ooh! A window function that perform calculations across a set of rows that are related to the current row AND lets us add more conditions like start and end dates This WHERE acts as our limits thanks to the window function
  30. Use case: Does the requested content exist? 1. Desired metrics

    2. Data Source 3. Condition 4. Aggregate 5. Sort 6. Limit SELECT date, url, requests $__timeInterval(reqTimeSec) AS time, countIf(statusCode between 200 and 299) / count(*) AS rate2xx, countIf(statusCode between 300 and 399) / count(*) AS rate3xx, countIf(statusCode between 400 and 499) / count(*) AS rate4xx, countIf(statusCode between 500 and 599) / count(*) AS rate5xx FROM myLogs WHERE UA like '%GPTBot%' OR UA like '%ChatGPT-User%' ... OR UA like '%PerplexityBot%' WHERE $__timeFilter(reqTimeSec) GROUP BY time ORDER BY time ASC LIMIT 25
  31. All these UAs can be segmented into a 4 buckets

    Training User-Initiated Search Indexer Ads Deprecated OpenAI GPTBot OAI-SearchBot ChatGPT-User OAI-AdsBot Gemini Google-Extended Googlebot Googlebot Claude ClaudeBot Claude-User Claude-SearchBot Anthropic-AI Claude-Web Perplexity Uses a variety of models from Google, Openai, Anthropic, and their internal "Sonar" models Perplexity-User PerplexityBot (not used ot train model) Microsoft Copilot Uses GPT 4 & 5 series + Claude for Github and 365 Products Bingbot Bingbot Comet Standard Chromium + Referrer-Policy: no-referrer Atlas ChatGPTBrowser ChatGPT Atlas Standard Chromium AI Overviews Uses Gemini 3 models Googlebot Googlebot AI Mode Uses Gemini 3 models Googlebot Googlebot
  32. SELECT multiIf( UA like '%GPTBot%', 'Training', UA like '%Google-Extended%', 'Training',

    ... UA like '%Claude-SearchBot%', 'Search Indexer', UA like '%PerplexityBot%', 'Search Indexer', ) AS crawlerPurpose, count(*) AS requests, count(distinct(reqPath)) AS uniqueURLs, count(distinct(UA)) AS uniqueAgents FROM myLogz WHERE ( UA like '%DuckAssistBot%' OR UA like '%ClaudeBot%' OR UA like '%GPTBot%' ... OR UA like '%OAI-SearchBot%' OR UA like '%ChatGPT-User%' ) GROUP BY rendering ORDER BY requests DESC LIMIT 5 1. Desired metrics 2. Data Source 3. Condition 4. Aggregate 5. Sort 6. Limit Bonus scalar function! Count distinct and you know how many unique UAs!
  33. crawlerPurpose requests uniqueURLs uniqueAgents Search Indexer 300,619 132,092 21 User-Initiated

    41,982 37,131 13 Training 16,841 15,064 4 Ads 49 4 1 User-initiated 11.7% Search Indexer 83.6% Training 4.7% Use case: Crawl by purpose
  34. url requests https://www.example.com 14,076 https://www.example.com/hi-new-friends 5,857 https://www.example.com/api/product-data 1,101 https://www.example.com/my-account 983

    https://www.example.com/best-sellers 699 But should it be crawled? Should it be trained on? These bots are lazy and dumb. If that API endpoint has everything they need, then that's where they'll send the user to. The user will close the tab and move to the next source. (Plus, it's expensive to keep feeding these greedy little monsters.)
  35. Controlling Training/"Index" • <data no-snippet> ◦ Block specific page elements

    (div, span, section) from appearing in snippets/grounding ◦ Preserves page ranking while hiding elected fragments • <meta name="robots" content="nosnippet"> ◦ Blocks the whole page from appearing in snippets/grounding • X-Robots-Tag: nosnippet ◦ For non-HTML resources, such as PDF files, video files, or image files
  36. Source:AI Insights | Cloudflare Radar Server strain is real and

    costly Crawl-to-refer ratio by AI platform Ratio of HTML page crawl requests to HTML page referrals by platform
  37. Block training crawlers = reduce presence in AI-generated answers over

    time (model doesn't "know" you). Block live inference crawlers = present in the model, no citation traffic. These are opposite problems. Make these decisions deliberate, not default. The training vs. inference tradeoff — the strategic question your clients haven't asked yet
  38. Does it even use the scripts it keeps requesting!? AI

    tools don't provide first-party documentation, so technical SEOs are constantly testing. This chart shows rendering capabilities as of December 2025, as documented by Merj using middleware. Source: Layout, Style, and Rendering - When Things Go Wrong, Giacomo Zecchini Crawler Rendering Crawler Rendering Crawler Rendering Googlebot ✅ CCBot ❌ GrokBot ❌ Bingbot ✅ GPTBot ❌ MistralAl-User ✅ Yandex ✅ OAI-SearchBot ❌ Bytespider ✅ Naver ✅ ChatGPT-User ❌ meta-externalagent ✅ Baidu ✅ ClaudeBot ✅ Applebot ✅ Claude-User ❌ AdsBot-Googl e ✅ Claude-SearchBot ❌ adidxbot ✅ PerplexityBot ❌ Applebot ✅
  39. 85% of AI referral traffic comes from platforms that do

    not execute JavaScript Unknown 3.2% Renders 11.6% Does not render 85.3% 11.6% 3.2% Source: AI Chatbot Market Share Worldwide | Statcounter Global Stats
  40. Control Crawl • Robots.txt ◦ Blocks crawling… for polite bots

    ◦ Removes page from search index • X-Robots-Tag ◦ Index control for non-HTML resources ◦ Does not appear to be respected by OpenAI crawlers
  41. Speaking of "Index", what if we could make one? We

    can query how many unique URLs we have. We can query how many unique URLs GPTbot has requested.
  42. We can even track the impact of performance • Sites

    with CLS ≤ 0.1 recorded a 29.8% higher inclusion rate in generative summaries compared with sites above this threshold • Pages delivering LCP ≤ 2.5 seconds were 1.47X more likely to appear in AI outputs than slower pages • Crawlers abandoned requests for 18% of pages >1MB • TTFB under 200 ms correlated with a 22% increase in citation density, particularly when paired with robust caching strategies. • A SALT.agency study shows that performance improvements do more than enhance user experience. They directly increase the probability of being cited or surfaced by AI systems. Technical SEO for AI Search - SALT.agency®
  43. Treat yourself to analytics while we're at it SELECT multiIf(

    referer like '%chat.openai.com%' OR queryStr like '%utm_source=chatgpt%', 'ChatGPT', referer like '%perplexity.ai%', 'Perplexity', ... referer like '%copilot.microsoft.com%', 'Microsoft Copilot' ) AS aiSource, count(*) AS sessions, count(distinct(cliIP)) AS uniqueUsers, countIf(statusCode between 200 and 299) / count(*) AS successRate, ${aggregation}(totalBytes) AS avgBytes FROM myLogz WHERE $__timeFilter(reqTimeSec) AND ( referer like '%chat.openai.com%' OR queryStr like '%utm_source=chatgpt%' ... OR referer like '%copilot.microsoft.com%') GROUP BY aiSource ORDER BY sessions DESC LIMIT 10
  44. We can ever see AI agents that don't trigger CSR

    analytics SELECT multiIf( UA like '%ChatGPT Agent%', 'ChatGPT Agent', UA like '%google-agent%', 'Google Agent', ) AS agentName, count(*) AS requests, count(distinct(cliIP)) AS uniqueSessions, count(distinct(reqPath)) AS uniquePaths, requests / uniqueSessions AS avgRequestsPerSession, countIf(reqMethod = 'POST') AS postRequests, countIf(reqMethod = 'POST') / count(*) AS postRate, countIf(statusCode between 200 and 299) / count(*) AS successRate, countIf(statusCode between 400 and 499) / count(*) AS rate4xx, ${aggregation}(turnAroundTimeMSec) AS avgTurnaround, sum(totalBytes) AS totalBandwidth FROM myLogz WHERE $__timeFilter(reqTimeSec) AND ( BotnetID like '%Google-Agent%' OR BotnetID like '%Claude-User%' GROUP BY aiSource ORDER BY sessions DESC LIMIT 10 POST method is a key indicator of an agent
  45. Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) Mozilla/5.0 AppleWebKit/537.36

    (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot) ChatGPT%20Atlas/20251021184832000 CFNetwork/3860.100.1 Darwin/25.0.0 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Gemini-Deep-Research; +https://gemini[dot]google/overview/deep-research/) Chrome/135.0.0.0 Safari/537.36 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; compatible; GoogleAgent-Mariner;+https://developers.google[dot]com/search/docs/crawling-indexing/g oogle-agent-mariner) Chrome/135.0.0.0 Safari/537.36 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GoogleAgent-Search; +https://developers.google[dot]com/search/docs/crawling-indexing/google-agent-search) Chrome/114.0.0.0 Safari/537.36 Let me love you
  46. Trust but verify Ways to validate the IP of requests:

    • Manually with host command • In bulk with a script • Natively in platform • Third-party tools like Tame the Bots Real LLM Bot IP?
  47. 1. Make allies. That salty SRE is about to be

    your new BFF. 2. Until you understand why they're so protective, you probably shouldn't have access 3. More permissions in a platform provide a better view. And more ways to seriously mess things up. 4. Seriously, a single CSV export is not going to cut it here. Make that friend and earn that trust. 5. Get the log file format details Top tips for getting log file access
  48. Top tips for managing AI crawl 1. If you're disallowing

    it for an AI crawler, repeat the statement for CCbot 2. Use X-robots directives to block non-HTML resources from being indexed independently but allowed to contribute to rendered page content. 3. Seriously, block your API endpoints for non rendering bots 4. Be sure to block only non-critical resources—that is, resources that aren't important to understanding the meaning of the page. 5. Block personalization resources used for returning users to conserve crawl budget 6. Hide auth required links from the crawl path
  49. | ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄| HYDRATE WEAR SUNSCREEN BE KIND BE CURIOUS |________| (\__/)

    || (•ㅅ•) || /   づ Στο καλό - Thank you! https://not-a-robot.com/athenseo