Taming the Bots: AI Crawler Management for the Modern SEO

Athens SEO 2026 Welcomes On Stage 100% Human Technical SEO,
Not a Robot Jamie Indigo Your photo here

JAMIE INDIGO Not a robot; speaks bot Doesn't care about
your AI product Director of Technical SEO Cox Automotive Author of While You Were Offline 80% documentation, 20% snark /in/jamie-indigo not-a-robot.com

All of us share a common goal: To appear where
users are searching

It's not just Google anymore. Each platform has its own
rules. ChatGPT • Limited Search documentation (2 dev docs) • No 1st party reporting hub • Documented user-agents • Declared IP ranges Perplexity • Limited search documentation (1 dev doc) • No 1st party reporting hub • Documented user-agents • Declared IP ranges Gemini • Limited search documentation • No 1st party reporting hub • No published IP range list (uses Google Cloud and workspace IPs) Claude • Limited Search documentation (2 dev docs) • No 1st party reporting hub • Declared IP ranges Grok • No 1st party reporting hub • Laughable documentation • No declared IP ranges • No declared user-agent AI Overviews/Mode • A ﬁrst party reporting hub… that you can't ﬁlter to AIO/M • Some documentation • Uses Google infrastructure and cache

AI is a revolution! AI is the future! AI is
still a damn bot.

Let's build a model 1. Training crawler requests a page
👾 2. Training crawlers often do not render the page (meaning they can't see content created by scripts) 🛠 3. Training crawler feeds the page into the corpus 📚 4. The corpus is used to train the model 🔮 5. Users ask the model a question. The model users the user's prior interactions and user profile information to interpret the question. 👤 6. If the model is confident it knows based on information from the corpus. If it's not confident, it executes a Real-Time Augmented Retrieval (RAG) 🤖 7. The model surfaces an answer in it's response 💻

Agenda Identify Let's read the latin Segment Let's play with
data Triage + Opportunity Let's take action

Server logs are a record of every request a server
receives. They are a source of truth. (And my most favorite thing)

Logs can come from multiple places in your stack.

Edge — CDN, DNS, and WAF handle requests before they
even hit your infrastructure. Each generates its own logs (cache hits, query traces, blocked requests). Network — The load balancer sits here, logging every upstream request, health check, and routing decision. Web server — nginx or Apache write access and error logs; the TLS/proxy tier adds SSL handshake records. Application — Your app servers, auth service, and API gateway each generate their own logs covering business logic, login events, rate limiting, etc. Data — Databases (query/audit), message queues (producer/consumer), and caches (hit/miss/eviction) all produce logs that are easy to overlook. Infrastructure — The foundation: OS-level syslog, container stdout/stderr and Kubernetes pod events, and the log aggregator (ELK, Splunk, etc.) that pulls everything together.

DIY: Accessing Log Files Apache (Linux Server) NGINX (Linux Server)
IIS log files (Windows Server) AWS Load Balancer (Load Balancer) Google Cloud Load Balancer (Load Balancer) AWS Cloudfront (CDN) Accessing CloudFlare log files (CDN) Incapsula (CDN/DDoS Mitigation) Akamai logs (CDN/DDoS Mitigation)

Standard Wordpress site? Log into your hosting provider and look
for Raw Access

Internal Log Requests (Someone else already has them) Ask: Is
there already a log management platform in place? Be Clear: We do not want Personal Identification Information (PII) and request it be removed Check your CDN for data on edge node (cached) vs server (uncached) hits Get: Log format and definitions You want enough log data to get an accurate picture

216.150.168.131 emeasrvr003 [07/Mar/2018:16:11:58 -0800] 66.249.66.1 GET /twiki/bin/view/TWiki/WikiSyntax?q=ntoon HTTP/1.1 www.example.com 200
7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Server IP

7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Server Name

7352 616"-" Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Date & Time

7352 616"-" Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Requester’s IP

7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Request Method

7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Hostname

7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Response Code

7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Response Size

7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Response Time

7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Requester’s User Agent

7352 616 - Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+ (compatible;+Googlebot/2.1;++http://www.google .com/bot.html) https://www.example.com/en/ indiegogo Referring URL

Get to know your log format Data element JSON key
Examples Description HTTP status codes statusCode 0 200 206 404 The HTTP status code sent in the response. Returns 0 if the TCP connection to the client ended before the server sent a response. Client IP cliIP 198.18.77.18 The IPv4 or IPv6 address of the requesting client. See IPv6 in RFC 5952. Bytes bytes 4995 The content bytes served in the response body. For HTTP/2, this includes framing overhead bytes used when multiplexing multiple responses on one connection. Protocol type proto HTTP/1.1 HTTPS/1.1 HTTP/2 HTTP3 The protocol of the response-request cycle. Query string queryStr q=foo&submit=true The query string in the incoming URI from the client.To monitor this parameter in your logs, you need to update your stream's property configuration to set the Cache Key Query Parameters behavior to include all parameters. See Cache Key Query Parameters. Request host reqHost splat-traffic.205400.a kamai.com The value of the Host header in the request with the domain name of the server and the TCP port number on which the server is listening. If no port is included, the default port for the service requested is implied. For example, 443 for an HTTPS URL, and 80 for an HTTP URL.A Host header must be present in HTTP/1.1 requests. If a request lacks this header or has more than one, the server may respond with a 400 status code. See Host in RFC 7230. Request method reqMethod GET POST PUT OPTIONS The HTTP method of the request. Request path reqPath path1/path2/file.ext The path to a resource in the incoming URI without query parameters. See the Query string field. Response Content-Length rspContentLen -1 0 5000 The size of object data returned to the client without HTTP response headers.The Akamai Edge logs the object size even if there is no Content-Length header. Returns -1 if the size can’t be determined—for example, when the connection ended before the edge server received the complete object from the origin. Response Content-Type rspContentType text-plain text-html The value of the Content-Type header in the response with the media type of the returned content. Returns - if unknown or not set. The 304 Not Modified response usually does not return this header.See:\ Content-Type in RFC 7231\ Partial Content in RFC 7233\ Media types User-Agent UA Mozilla%2F5.0+%28 Macintosh The URI-encoded value of the User-Agent header in the request. It lets edge servers identify the application, operating system, vendor, or version of the requesting user agent.This field is RFC-1738 escaped. See the note on hex-encoding above the table, User-Agent in RFC 7231 and RFC 2616.

Paid: CDN-level (Akamai, Cloudflare), Botify, Logz.io, Sumo Logic, Splunk Free(mium):
Screaming Frog Log Analyzer, Big Query Masochistic: Excel, Command Line How do we analyze logs at scale? @timestamp May 23, 2026 @ 08:38:38.000 IP 74.7.36.71 LogSize 1.5KB Method GET base_domain www.athenseo.com bytes 163,281 full_request /speakers/jamie-indigo name com/bot os Other os_name Other referrer "-", - request /speakers/jamie-indigo response 200 status_code 200 subfolder_1 speakers subfolder_2 jamie-indigo subfolder_3 type apache user_agent Mozilla/5.0%20AppleWebKit/537.36%20(KHTML, %20like%20Gecko);%20compatible;%20ChatGPT- User/1.0;%20+https://openai.com/bot

The best option is the one already in place

The AI Crawler Zoo Mozilla/5.0 (Macintosh; Intel Mac OS X
10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot) Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; [email protected]) Mozilla/5.0 (compatible; Bytespider; [email protected]) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.0.0 Safari/537.36 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.7778.96 Mobile Safari/537.36 (compatible; GoogleOther) Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; +https://openai.com/searchbot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot) Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; TikTokSpider; [email protected]) Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; [email protected]) meta-webindexer/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) Verity/1.1 (https://gumgum.com/verity; [email protected]) Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-SearchBot/1.0; [email protected]) Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; ClaudeBot/1.0; [email protected]) Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/144.0.0.0 Safari/537.36 Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (HTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) CCBot/2.0 (https://commoncrawl.org/faq/) meta-externalfetcher/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) DuckAssistBot/1.2; (+http://duckduckgo.com/duckassistbot.html) Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) Google-NotebookLM

1,850 approximate number of search and AI user agents seen
in the last 30 days

Let's find out who's crawling with 6 query blocks blocks
1. Desired metrics 2. Data Source 3. Condition 4. Aggregate 5. Sort 6. Limit SELECT UA, count(*) AS requests FROM myLogs WHERE UA like '%GPTBot%' OR UA like '%ChatGPT-User%' ... OR UA like '%PerplexityBot%' GROUP BY UA ORDER BY requests DESC LIMIT 100

(Pssst – LIMIT is how you keep your logs access)

UA Hits bytespider 40,164,250 oai-searchbot 7,417,706 gptbot 5,080,958 chatgpt-user 3,499,647
claudebot 2,210,478 claude-searchbot 459,281 PerplexityBot 276,516 ChatGPT Agent 180,021 claude-user 7,486 perplexity-user 5,641 google-agent 1,400 mistralai-user 249 gemini-deep-research 232 Use Case: Who's crawling?

Use case: What are they crawling? 1. Desired metrics 2.
Data Source 3. Condition 4. Aggregate 5. Sort 6. Limit SELECT concat('https://', reqHost, reqPath) AS url, count(*) AS requests FROM myLogs WHERE UA like '%GPTBot%' OR UA like '%ChatGPT-User%' ... OR UA like '%PerplexityBot%' GROUP BY URL ORDER BY requests DESC LIMIT 100 Building a full URL for our results!

url requests https://www.example.com 14,076 https://www.example.com/hi-new-friends 5,857 https://www.example.com/api/product-data 1,101 https://www.example.com/my-account 983
https://www.example.com/best-sellers 699 https://www.example.com/sale/// 689 https://www.example.com/blog// 650 https://www.example.com/new// 476 Use case: What are they crawling?

Yeah but I want to see it over time 1.
Desired metrics 2. Data Source 3. Condition 4. Aggregate 5. Sort 6. Limit SELECT date, url, requests FROM ( DATE_TRUNC('day', reqTimeSec) AS date, concat('https://', reqHost, reqPath) AS url, count(*) AS requests, ROW_NUMBER() OVER (PARTITION BY date ORDER BY requests DESC) AS rank FROM myLogs WHERE UA like '%GPTBot%' OR UA like '%ChatGPT-User%' ... OR UA like '%PerplexityBot%' AND reqTimeSec >= '2026-04-01 00:00:00' AND reqTimeSec < '2026-05-01 00:00:00' GROUP BY date,URL ) WHERE rank <= 10 ORDER BY date ASC, requests DESC Ooh! A window function that perform calculations across a set of rows that are related to the current row AND lets us add more conditions like start and end dates This WHERE acts as our limits thanks to the window function

Use case: Top crawled URLs over time

Use case: Does the requested content exist? 1. Desired metrics
2. Data Source 3. Condition 4. Aggregate 5. Sort 6. Limit SELECT date, url, requests $__timeInterval(reqTimeSec) AS time, countIf(statusCode between 200 and 299) / count(*) AS rate2xx, countIf(statusCode between 300 and 399) / count(*) AS rate3xx, countIf(statusCode between 400 and 499) / count(*) AS rate4xx, countIf(statusCode between 500 and 599) / count(*) AS rate5xx FROM myLogs WHERE UA like '%GPTBot%' OR UA like '%ChatGPT-User%' ... OR UA like '%PerplexityBot%' WHERE $__timeFilter(reqTimeSec) GROUP BY time ORDER BY time ASC LIMIT 25

Use case: Does the requested content exist?

All these UAs can be segmented into a 4 buckets
Training User-Initiated Search Indexer Ads Deprecated OpenAI GPTBot OAI-SearchBot ChatGPT-User OAI-AdsBot Gemini Google-Extended Googlebot Googlebot Claude ClaudeBot Claude-User Claude-SearchBot Anthropic-AI Claude-Web Perplexity Uses a variety of models from Google, Openai, Anthropic, and their internal "Sonar" models Perplexity-User PerplexityBot (not used ot train model) Microsoft Copilot Uses GPT 4 & 5 series + Claude for Github and 365 Products Bingbot Bingbot Comet Standard Chromium + Referrer-Policy: no-referrer Atlas ChatGPTBrowser ChatGPT Atlas Standard Chromium AI Overviews Uses Gemini 3 models Googlebot Googlebot AI Mode Uses Gemini 3 models Googlebot Googlebot

SELECT multiIf( UA like '%GPTBot%', 'Training', UA like '%Google-Extended%', 'Training',
... UA like '%Claude-SearchBot%', 'Search Indexer', UA like '%PerplexityBot%', 'Search Indexer', ) AS crawlerPurpose, count(*) AS requests, count(distinct(reqPath)) AS uniqueURLs, count(distinct(UA)) AS uniqueAgents FROM myLogz WHERE ( UA like '%DuckAssistBot%' OR UA like '%ClaudeBot%' OR UA like '%GPTBot%' ... OR UA like '%OAI-SearchBot%' OR UA like '%ChatGPT-User%' ) GROUP BY rendering ORDER BY requests DESC LIMIT 5 1. Desired metrics 2. Data Source 3. Condition 4. Aggregate 5. Sort 6. Limit Bonus scalar function! Count distinct and you know how many unique UAs!

crawlerPurpose requests uniqueURLs uniqueAgents Search Indexer 300,619 132,092 21 User-Initiated
41,982 37,131 13 Training 16,841 15,064 4 Ads 49 4 1 User-initiated 11.7% Search Indexer 83.6% Training 4.7% Use case: Crawl by purpose

url requests https://www.example.com 14,076 https://www.example.com/hi-new-friends 5,857 https://www.example.com/api/product-data 1,101 https://www.example.com/my-account 983
https://www.example.com/best-sellers 699 But should it be crawled? Should it be trained on? These bots are lazy and dumb. If that API endpoint has everything they need, then that's where they'll send the user to. The user will close the tab and move to the next source. (Plus, it's expensive to keep feeding these greedy little monsters.)

Controlling Training/"Index" • <data no-snippet> ◦ Block specific page elements
(div, span, section) from appearing in snippets/grounding ◦ Preserves page ranking while hiding elected fragments • <meta name="robots" content="nosnippet"> ◦ Blocks the whole page from appearing in snippets/grounding • X-Robots-Tag: nosnippet ◦ For non-HTML resources, such as PDF files, video files, or image files

Source:AI Insights | Cloudflare Radar Server strain is real and
costly Crawl-to-refer ratio by AI platform Ratio of HTML page crawl requests to HTML page referrals by platform

Block training crawlers = reduce presence in AI-generated answers over
time (model doesn't "know" you). Block live inference crawlers = present in the model, no citation traffic. These are opposite problems. Make these decisions deliberate, not default. The training vs. inference tradeoff — the strategic question your clients haven't asked yet

Does it even use the scripts it keeps requesting!? AI
tools don't provide first-party documentation, so technical SEOs are constantly testing. This chart shows rendering capabilities as of December 2025, as documented by Merj using middleware. Source: Layout, Style, and Rendering - When Things Go Wrong, Giacomo Zecchini Crawler Rendering Crawler Rendering Crawler Rendering Googlebot ✅ CCBot ❌ GrokBot ❌ Bingbot ✅ GPTBot ❌ MistralAl-User ✅ Yandex ✅ OAI-SearchBot ❌ Bytespider ✅ Naver ✅ ChatGPT-User ❌ meta-externalagent ✅ Baidu ✅ ClaudeBot ✅ Applebot ✅ Claude-User ❌ AdsBot-Googl e ✅ Claude-SearchBot ❌ adidxbot ✅ PerplexityBot ❌ Applebot ✅

85% of AI referral traffic comes from platforms that do
not execute JavaScript Unknown 3.2% Renders 11.6% Does not render 85.3% 11.6% 3.2% Source: AI Chatbot Market Share Worldwide | Statcounter Global Stats

Control Crawl • Robots.txt ◦ Blocks crawling… for polite bots
◦ Removes page from search index • X-Robots-Tag ◦ Index control for non-HTML resources ◦ Does not appear to be respected by OpenAI crawlers

Speaking of "Index", what if we could make one? We
can query how many unique URLs we have. We can query how many unique URLs GPTbot has requested.

We can keep building from there

We can even track the impact of performance • Sites
with CLS ≤ 0.1 recorded a 29.8% higher inclusion rate in generative summaries compared with sites above this threshold • Pages delivering LCP ≤ 2.5 seconds were 1.47X more likely to appear in AI outputs than slower pages • Crawlers abandoned requests for 18% of pages >1MB • TTFB under 200 ms correlated with a 22% increase in citation density, particularly when paired with robust caching strategies. • A SALT.agency study shows that performance improvements do more than enhance user experience. They directly increase the probability of being cited or surfaced by AI systems. Technical SEO for AI Search - SALT.agency®

Use case: Impact of performance on AI crawlers

Treat yourself to analytics while we're at it SELECT multiIf(
referer like '%chat.openai.com%' OR queryStr like '%utm_source=chatgpt%', 'ChatGPT', referer like '%perplexity.ai%', 'Perplexity', ... referer like '%copilot.microsoft.com%', 'Microsoft Copilot' ) AS aiSource, count(*) AS sessions, count(distinct(cliIP)) AS uniqueUsers, countIf(statusCode between 200 and 299) / count(*) AS successRate, ${aggregation}(totalBytes) AS avgBytes FROM myLogz WHERE $__timeFilter(reqTimeSec) AND ( referer like '%chat.openai.com%' OR queryStr like '%utm_source=chatgpt%' ... OR referer like '%copilot.microsoft.com%') GROUP BY aiSource ORDER BY sessions DESC LIMIT 10

Use case: AI referral analytics

We can ever see AI agents that don't trigger CSR
analytics SELECT multiIf( UA like '%ChatGPT Agent%', 'ChatGPT Agent', UA like '%google-agent%', 'Google Agent', ) AS agentName, count(*) AS requests, count(distinct(cliIP)) AS uniqueSessions, count(distinct(reqPath)) AS uniquePaths, requests / uniqueSessions AS avgRequestsPerSession, countIf(reqMethod = 'POST') AS postRequests, countIf(reqMethod = 'POST') / count(*) AS postRate, countIf(statusCode between 200 and 299) / count(*) AS successRate, countIf(statusCode between 400 and 499) / count(*) AS rate4xx, ${aggregation}(turnAroundTimeMSec) AS avgTurnaround, sum(totalBytes) AS totalBandwidth FROM myLogz WHERE $__timeFilter(reqTimeSec) AND ( BotnetID like '%Google-Agent%' OR BotnetID like '%Claude-User%' GROUP BY aiSource ORDER BY sessions DESC LIMIT 10 POST method is a key indicator of an agent

Use case: AI Agent activity

Logs are perfect beautiful angel babies. Crawlers however…

May 2025 Non-compliance is real and documented

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) Mozilla/5.0 AppleWebKit/537.36
(KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot) ChatGPT%20Atlas/20251021184832000 CFNetwork/3860.100.1 Darwin/25.0.0 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Gemini-Deep-Research; +https://gemini[dot]google/overview/deep-research/) Chrome/135.0.0.0 Safari/537.36 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; compatible; GoogleAgent-Mariner;+https://developers.google[dot]com/search/docs/crawling-indexing/g oogle-agent-mariner) Chrome/135.0.0.0 Safari/537.36 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GoogleAgent-Search; +https://developers.google[dot]com/search/docs/crawling-indexing/google-agent-search) Chrome/114.0.0.0 Safari/537.36 Let me love you

Trust but verify Ways to validate the IP of requests:
• Manually with host command • In bulk with a script • Natively in platform • Third-party tools like Tame the Bots Real LLM Bot IP?

1. Make allies. That salty SRE is about to be
your new BFF. 2. Until you understand why they're so protective, you probably shouldn't have access 3. More permissions in a platform provide a better view. And more ways to seriously mess things up. 4. Seriously, a single CSV export is not going to cut it here. Make that friend and earn that trust. 5. Get the log file format details Top tips for getting log file access

Top tips for managing AI crawl 1. If you're disallowing
it for an AI crawler, repeat the statement for CCbot 2. Use X-robots directives to block non-HTML resources from being indexed independently but allowed to contribute to rendered page content. 3. Seriously, block your API endpoints for non rendering bots 4. Be sure to block only non-critical resources—that is, resources that aren't important to understanding the meaning of the page. 5. Block personalization resources used for returning users to conserve crawl budget 6. Hide auth required links from the crawl path

|￣￣￣￣￣￣￣￣| HYDRATE WEAR SUNSCREEN BE KIND BE CURIOUS |＿＿＿＿＿＿＿＿| (\__/)
|| (•ㅅ•) || / 　づ Στο καλό - Thank you! https://not-a-robot.com/athenseo

Taming the Bots: AI Crawler Management for the ...

Taming the Bots: AI Crawler Management for the Modern SEO

More Decks by Athens SEO

Other Decks in Marketing & SEO

Featured

Transcript