How AI bots discover your site

Why ChatGPT, Claude, Perplexity, and Google AI need to fetch your pages before they can cite you — and the robots.txt and headers that decide whether they can.

5 min read

Before an AI engine can cite your site in an answer, it has to fetch your pages. That sounds obvious, but it's the most common reason indie SaaS sites are invisible in AI search: the content is fine, the brand is real, but robots.txt quietly tells OAI-SearchBot or PerplexityBot to go away. This article is the conceptual backdrop for the AI Crawlability Monitor feature — what these bots are, why they matter, and what specifically on your site they read.

Two kinds of AI bots: search and training

Every major AI vendor runs two flavors of crawler. Blocking one does not have the same effect as blocking the other.

FlavorWhat it doesEffect of blocking
Search / live retrievalFetches pages in real time when a user asks the model a question. Result feeds directly into the answer.The model cannot cite you in answers. Most damaging block for AI visibility.
TrainingBulk crawl that ingests pages into the next model checkpoint. Effect appears months later, when the next version is trained.The next model won't learn about you, but live citations from existing search bots still work.

For example, OpenAI runs three bots:

  • GPTBot — training. Blocking it stops OpenAI from training future models on your content. Live ChatGPT answers are unaffected.
  • OAI-SearchBot — powers ChatGPT Search. Blocking it removes you from ChatGPT's live answers entirely.
  • ChatGPT-User — on-demand fetch when a user asks ChatGPT to read a specific URL. Blocking it kills explicit user-driven citations.

Same split applies for Anthropic (ClaudeBot training, Claude-Web search), Google (Google-Extended is the AI training opt-out token, Googlebot powers both classic search and AI Mode), Perplexity (PerplexityBot indexes, Perplexity-User is the on-demand fetch), and so on.

The 15 bots we check

These are the canonical User-Agent tokens we look for in your robots.txt and headers. The list is reviewed periodically against vendor docs.

VendorBotPurpose
OpenAIGPTBotTraining
OpenAIOAI-SearchBotChatGPT Search live retrieval
OpenAIChatGPT-UserOn-demand browse
AnthropicClaudeBotTraining
AnthropicClaude-WebClaude.ai live retrieval
Anthropicanthropic-aiLegacy UA
PerplexityPerplexityBotIndex for Perplexity search
PerplexityPerplexity-UserOn-demand fetch
GoogleGoogle-ExtendedGemini + AI Mode opt-out token
GoogleGooglebotStandard Google crawler (foundation for AI Mode)
AppleApplebot-ExtendedApple Intelligence training opt-out
ByteDanceBytespiderDoubao / TikTok AI
MetaMeta-ExternalAgentLlama, WhatsApp AI, Instagram AI
Common CrawlCCBotOpen-source training corpus
Coherecohere-aiCohere training
DiffbotDiffbotKnowledge graphs powering third-party AI tools

What "blocked" actually means

There are four signals that decide whether an AI bot can fetch a page:

  1. robots.txt per-UA group. A User-agent: GPTBot block followed by Disallow: / blocks GPTBot at the root. Per RFC 9309, the longest-matching path wins, and Allow: beats Disallow: on tie.
  2. Wildcard User-agent: * group. A wildcard Disallow: / blocks every bot that doesn't have its own override — including all the AI bots. This is the most common accidental block.
  3. X-Robots-Tag HTTP header. If your homepage responds with X-Robots-Tag: noindex, every search engine and AI crawler is told to drop the page from their index — regardless of robots.txt. This one usually comes from a CDN config or a leftover staging rule.
  4. <meta name="robots" content="noindex">. Same effect as the header, but in the HTML. AI crawlers honor it.

Any of (3) or (4) caps your crawlability score at zero — they make the robots.txt details moot. The Monitor flags them as site-wide critical findings ahead of the per-bot matrix.

What we don't (and shouldn't) do

  • Discovery readiness covers the page- level signals that make an answer-ready page (schema, headings, citations). Crawlability is the gate before that — if the bot can't fetch, none of the page-level signals matter.
  • AI engines we cover explains the live retrieval pipeline: which bots feed which engines, and what each engine quotes when it cites.
  • llms.txt is an emerging standard for giving AI assistants a curated, machine-readable summary of your product. We track its presence as an info-level signal — first-mover advantage is real, but it's not yet required by any bot.