What robots.txt actually does (and does not) for AI bots
robots.txt is a plain-text file at the root of your site that tells crawlers which paths they may or may not fetch. It is a 30-year-old convention — predates AI search by two decades — but in 2026 it is the primary lever site owners use to control whether their content trains models, appears in AI citations, or gets fetched live when a user asks an assistant a question.
What it is not
- Not enforcement.
robots.txtis a polite request. Well-behaved crawlers (OpenAI, Anthropic, Google, Perplexity, Common Crawl) honour it. Less polite ones (some training scrapers, low-quality data brokers) ignore it. Block their IP ranges or use bot-detection if that's a real problem. - Not a ranking signal.Blocking GPTBot does not make Google's AI Overviews rank you differently — they use Googlebot. Blocking PerplexityBot does not affect Bing or ChatGPT.
- Not instant. Bots re-fetch
robots.txton their own schedule — often daily, sometimes weekly. A new rule typically takes 24–72 hours to take effect across major engines.
Why this matters in 2026
Three concrete reasons:
- Training vs citation is now two different decisions. OpenAI, Anthropic, and Google all split their AI bots into training-only crawlers (GPTBot, ClaudeBot, Google-Extended) and live-citation crawlers (OAI-SearchBot, Claude-User, Googlebot). You can opt out of training but stay citable — or vice versa.
- Default-allow is the wrong default for most B2B SaaS. Some of the noisier AI bots fetch tens of thousands of pages per day with no buyer intent behind them — costing bandwidth, polluting analytics, and feeding low-quality training corpora. A considered allowlist beats "allow everything" almost every time.
- AI bot names change. OpenAI introduced OAI-SearchBot in 2024 separately from GPTBot, then added ChatGPT-User. Anthropic added Claude-User and Claude-SearchBot in 2024–2025. Apple added Applebot-Extended. If your
robots.txthasn't been reviewed in 18+ months, you almost certainly have outdated rules.
The two big decisions
Most teams over-think this. There are two decisions worth making consciously:
Decision 1 — do you want your content used to train AI models?
Training-only bots include GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot, Bytespider, and several smaller ones. Blocking them stops your content from going into future training data — if the vendor honours it.
Reasons to allow training: you want your brand, documentation, and product details to be encoded into the foundation model itself. When an AI assistant answers a question about your category, "baked-in" knowledge from training matters more than any live retrieval — because models often answer without browsing the web at all.
Reasons to block training: you sell premium or licensed content, you have a strict legal / IP stance, you are a publisher whose business is the content itself, or you simply object to your work being used uncompensated.
Decision 2 — do you want to be cited in live AI search?
Live-citation bots include OAI-SearchBot, ChatGPT-User, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, and Google's standard Googlebot (which controls AI Overviews and AI Mode). These fetch your pages when a user asks a question and the AI engine needs context — and they cite the source URL in the answer.
Default for B2B SaaS, agencies, content sites: allow. You almost certainly want to be cited.
Exceptions: paid content behind a paywall, membership-only resources, private documentation. Block via robots.txt and via the actual auth layer — never rely on robots.txt alone for security.
The bots — what each one is and what to do
The crawlers worth knowing about in 2026, grouped by vendor:
OpenAI
GPTBot— collects training data for future OpenAI models. Block to opt out of training.OAI-SearchBot— fetches pages cited in ChatGPT's search-mode answers. Does not train. Block to stop being cited in ChatGPT search.ChatGPT-User— live retrieval when a user clicks "browse" or asks a real-time question. Does not train. Block to stop being live-fetched.
Anthropic
ClaudeBot— Anthropic's general crawler for training Claude models.Claude-User— fetches pages when a Claude user asks a question that triggers retrieval.Claude-SearchBot— Anthropic's newer search-mode citation crawler.anthropic-ai— older crawler name still listed in Anthropic's docs for backward compatibility; rules for ClaudeBot now apply.
Perplexity
PerplexityBot— main Perplexity crawler that indexes pages for citation in Sonar answers.Perplexity-User— live retrieval on user query.
Googlebot— regular Google indexing. Required for AI Mode, AI Overviews, and any Google Search visibility. Almost never block.Google-Extended— opt-out token for Bard / Gemini / Vertex AI training. Does not crawl on its own;Googlebotchecks for this token before using your content for training. BlockingGoogle-Extendeddoes not affect AI Mode or AI Overviews — those still rely onGooglebot.
Apple
Applebot— Spotlight, Siri, Safari Suggestions, Apple search index. Allow.Applebot-Extended— opt-out for Apple Intelligence training. MirrorsGoogle-Extended's pattern.
Other notable bots
CCBot— Common Crawl. Free public dataset that feeds many open-source AI models, plus some commercial ones. Block to be excluded from the open training corpus.Bytespider— ByteDance (TikTok) training crawler. Notorious for high volume and for occasionally ignoringrobots.txt. Block, and consider rate-limiting at the CDN level too.Amazonbot— Alexa question-answering and Rufus shopping AI. Allow if you sell on Amazon or want to be cited there.Meta-ExternalAgent(formerlyFacebookBot) — Meta AI training. Block to opt out.DuckAssistBot— DuckDuckGo's AI Assistant crawler.YouBot— You.com's AI-first search crawler.cohere-ai— Cohere's training crawler.Diffbot— knowledge-graph extractor used by many AI products downstream.Timpibot— Timpi search index, used by some AI assistants for retrieval.
Copy-paste robots.txt — three stances
Pick the stance that matches your business. Drop the snippet at the root of your site.
Stance 1 — Allow everything (default for most B2B SaaS)
Maximises both training inclusion and citation in live AI search. The right choice if you want AI engines to know everything about your brand. No Disallow rules needed beyond your normal site-specific ones.
# Allow all crawlers, AI included
User-agent: *
Allow: /
Disallow: /admin
Disallow: /api/private
Sitemap: https://yoursite.com/sitemap.xmlStance 2 — Block training, allow live citation
Stops your content going into future model-training corpora but keeps you eligible for live AI-search citations. The most common stance for publishers, premium-content sites, and B2B SaaS with strong opinions on IP.
# Block training-only bots
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: cohere-ai
Disallow: /
# Allow live-citation bots (default)
# OAI-SearchBot, ChatGPT-User, Claude-User, Claude-SearchBot,
# PerplexityBot, Perplexity-User, Googlebot, Applebot fall through
# to the default User-agent: * rule.
User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xmlStance 3 — Block all AI
Stops both training and live citation across major AI engines. Right for paid-content businesses, news publishers with licensing deals, or sites whose entire value prop is exclusive content. Note: Googlebot is left allowed so regular Google search still works — you can't cleanly opt out of AI Mode / AI Overviews without also opting out of all of Google Search.
# Block all known AI training bots
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: cohere-ai
Disallow: /
# Block live-citation AI bots
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Claude-User
Disallow: /
User-agent: Claude-SearchBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: DuckAssistBot
Disallow: /
User-agent: YouBot
Disallow: /
# Allow regular search (Google, Bing)
User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xmlSetup on Next.js / Vercel
Two options. Pick based on whether your rules are static or need to depend on environment / feature flag.
Option A — static file
Drop the snippet at public/robots.txt. Vercel serves it at yourdomain.com/robots.txt automatically. Simpler, faster, no per-request cost.
Option B — dynamic Metadata API
Next.js exposes a typed app/robots.ts handler. Use this if you want to allow / block based on environment (e.g. block everything on staging, allow on prod) without maintaining two static files.
// app/robots.ts
import type { MetadataRoute } from 'next'
export default function robots(): MetadataRoute.Robots {
const isStaging = process.env.VERCEL_ENV !== 'production'
if (isStaging) {
return {
rules: [{ userAgent: '*', disallow: '/' }],
}
}
const trainingBots = [
'GPTBot', 'ClaudeBot', 'anthropic-ai',
'Google-Extended', 'Applebot-Extended',
'CCBot', 'Bytespider', 'Meta-ExternalAgent',
'cohere-ai',
]
return {
rules: [
...trainingBots.map((userAgent) => ({
userAgent,
disallow: '/',
})),
{ userAgent: '*', allow: '/' },
],
sitemap: 'https://yoursite.com/sitemap.xml',
}
}Setup on WordPress, Webflow, Framer, static HTML
- WordPress — Yoast SEO, Rank Math, and AIOSEO all expose a
robots.txteditor in their Tools / Settings. Otherwise upload via SFTP to your site root. - Webflow — paid plans support custom
robots.txtunder Project Settings → "Indexing". Free plans serve a default rule; upgrade if you need control. - Framer— Site Settings → SEO has a "Custom robots.txt" field.
- Static HTML / static hosts (S3, Cloudflare Pages, Netlify) — put the file in the root of your build output. Done.
- Cloudflare in front of any of the above — Workers and Page Rules can override
robots.txtresponses if your origin gives you no editor. Useful for emergency "block everything" rules without redeploying.
Common mistakes
- Blocking
Googlebotby accident. A wildcardDisallow: /underUser-agent: *with no "allow real Googlebot" rule above it kills your Google Search visibility — including AI Mode and AI Overviews. Always put bot-specific rules before the wildcard. - Relying on
robots.txtfor security. It is a polite request. Pages listed asDisalloware morevisible to attackers, not less, because they're publicly listed in the file. Use real auth for anything sensitive. - Case sensitivity. User-agent matching is case-insensitive for most polite bots but not guaranteed. Match the official casing from the vendor docs.
- Forgetting the
Sitemap:directive. Always include it — both Google and Bing use it for discovery, and several AI crawlers do too. - Mixing
Disallow:andAllow:in subtle ways.Specific paths override the wildcard. Test the result with Google's robots.txt Tester before deploying. - Forgetting to update it. The bot landscape changes 2–3 times per year. If your file is from 2023, it probably misses 4+ bots that exist today.
How to verify it actually works
Three quick checks:
- Curl the file. Run
curl -I https://yoursite.com/robots.txtand confirmContent-Type: text/plainand HTTP 200. - Google Search Console — robots.txt report. Settings → Crawling → Open report. Shows the version Google has cached and lets you re-fetch on demand.
- Your own server logs. Filter by
User-Agenton the blocked bot names. After 24–72 hours, polite bots should respect your new rules and stop fetching disallowed paths.
None of that tells you whether AI engines actually cite your brand more or less afterwards. For that, run our free AI audit on a buyer-question panel and compare before vs after.
What to ship next
- Ship
llms.txt— the companion file giving AI agents a curated map of your important pages. Free generator at /tools/llms-txt-generator, setup guide at /guides/llms-txt-setup. - Audit your Organisation JSON-LD — entity clarity is one of the few signals all AI engines consistently use. Consistent
name,url,sameAsacross every page. - Measure Share of Voice for 30 days before shipping more. Without a baseline you cannot tell whether any of the technical changes moved the needle.