How AI bots discover your site
Why ChatGPT, Claude, Perplexity, and Google AI need to fetch your pages before they can cite you — and the robots.txt and headers that decide whether they can.
Before an AI engine can cite your site in an answer, it has to fetch
your pages. That sounds obvious, but it's the most common reason
indie SaaS sites are invisible in AI search: the content is fine, the
brand is real, but robots.txt quietly tells OAI-SearchBot or
PerplexityBot to go away. This article is the conceptual backdrop for
the AI Crawlability Monitor
feature — what these bots are, why they matter, and what specifically
on your site they read.
Two kinds of AI bots: search and training
Every major AI vendor runs two flavors of crawler. Blocking one does not have the same effect as blocking the other.
| Flavor | What it does | Effect of blocking |
|---|---|---|
| Search / live retrieval | Fetches pages in real time when a user asks the model a question. Result feeds directly into the answer. | The model cannot cite you in answers. Most damaging block for AI visibility. |
| Training | Bulk crawl that ingests pages into the next model checkpoint. Effect appears months later, when the next version is trained. | The next model won't learn about you, but live citations from existing search bots still work. |
For example, OpenAI runs three bots:
- GPTBot — training. Blocking it stops OpenAI from training future models on your content. Live ChatGPT answers are unaffected.
- OAI-SearchBot — powers ChatGPT Search. Blocking it removes you from ChatGPT's live answers entirely.
- ChatGPT-User — on-demand fetch when a user asks ChatGPT to read a specific URL. Blocking it kills explicit user-driven citations.
Same split applies for Anthropic (ClaudeBot training, Claude-Web
search), Google (Google-Extended is the AI training opt-out token,
Googlebot powers both classic search and AI Mode), Perplexity
(PerplexityBot indexes, Perplexity-User is the on-demand fetch),
and so on.
The 15 bots we check
These are the canonical User-Agent tokens we look for in your robots.txt and headers. The list is reviewed periodically against vendor docs.
| Vendor | Bot | Purpose |
|---|---|---|
| OpenAI | GPTBot | Training |
| OpenAI | OAI-SearchBot | ChatGPT Search live retrieval |
| OpenAI | ChatGPT-User | On-demand browse |
| Anthropic | ClaudeBot | Training |
| Anthropic | Claude-Web | Claude.ai live retrieval |
| Anthropic | anthropic-ai | Legacy UA |
| Perplexity | PerplexityBot | Index for Perplexity search |
| Perplexity | Perplexity-User | On-demand fetch |
Google-Extended | Gemini + AI Mode opt-out token | |
Googlebot | Standard Google crawler (foundation for AI Mode) | |
| Apple | Applebot-Extended | Apple Intelligence training opt-out |
| ByteDance | Bytespider | Doubao / TikTok AI |
| Meta | Meta-ExternalAgent | Llama, WhatsApp AI, Instagram AI |
| Common Crawl | CCBot | Open-source training corpus |
| Cohere | cohere-ai | Cohere training |
| Diffbot | Diffbot | Knowledge graphs powering third-party AI tools |
What "blocked" actually means
There are four signals that decide whether an AI bot can fetch a page:
robots.txtper-UA group. AUser-agent: GPTBotblock followed byDisallow: /blocks GPTBot at the root. Per RFC 9309, the longest-matching path wins, andAllow:beatsDisallow:on tie.- Wildcard
User-agent: *group. A wildcardDisallow: /blocks every bot that doesn't have its own override — including all the AI bots. This is the most common accidental block. X-Robots-TagHTTP header. If your homepage responds withX-Robots-Tag: noindex, every search engine and AI crawler is told to drop the page from their index — regardless of robots.txt. This one usually comes from a CDN config or a leftover staging rule.<meta name="robots" content="noindex">. Same effect as the header, but in the HTML. AI crawlers honor it.
Any of (3) or (4) caps your crawlability score at zero — they make the robots.txt details moot. The Monitor flags them as site-wide critical findings ahead of the per-bot matrix.
What we don't (and shouldn't) do
Related concepts
- Discovery readiness covers the page- level signals that make an answer-ready page (schema, headings, citations). Crawlability is the gate before that — if the bot can't fetch, none of the page-level signals matter.
- AI engines we cover explains the live retrieval pipeline: which bots feed which engines, and what each engine quotes when it cites.
- llms.txt is an emerging standard for giving AI assistants a curated, machine-readable summary of your product. We track its presence as an info-level signal — first-mover advantage is real, but it's not yet required by any bot.