Blocked GPTBot to protect training data. Just realized I also blocked ChatGPT's live search.
Found this the embarrassing way. Was looking at why my GEO score on ChatGPT had flatlined for 8 weeks while Perplexity kept improving. Finally checked my robots.txt.
In 2024, I added a blanket Disallow: / under User-agent: GPTBot. Made sense at the time — I didn't want OpenAI training on my content.
Problem: GPTBot is the training scraper. OAI-SearchBot is the live-search retrieval bot. They're different user-agents. My block was killing both.
Same issue with Anthropic: ClaudeBot (training, fine to block) vs Claude-Web (retrieval, should allow). I had them listed as one.
The surgical 2026 robots.txt I ended up with:
```
Block training scrapers
User-agent: GPTBot Disallow: /
User-agent: ClaudeBot Disallow: /
User-agent: Google-Extended Disallow: /
Allow retrieval bots
User-agent: OAI-SearchBot Allow: /
User-agent: Claude-Web Allow: /
User-agent: PerplexityBot Allow: / ```
I also added User-agent: ChatGPT-User (the browsing plugin bot) as an Allow since that's another retrieval path people miss.
Verify your setup with curl -A 'OAI-SearchBot' https://yourdomain.com/ — should return 200 with full HTML, not a redirect or error.
If you added any GPTBot blocks in 2023-2024, worth auditing whether you also killed retrieval.
3 replies
- Dave A.
For anyone considering the WAF route to block training scrapers: Cloudflare's 'AI scrapers' toggle is actually fairly well maintained and handles the training-vs-retrieval distinction better than most manual robots.txt configs i've seen. The manual route Milan describes is right but it's also a maintenance burden as new bots spin up.
- Leo H.
wait, ChatGPT-User is a different user-agent from OAI-SearchBot? I only had OAI-SearchBot in my Allow list. Going to check if ChatGPT-User is being blocked by a wildcard rule.
I made the exact same mistake. Worse actually — I had a WAF rule that blocked all crawlers matching 'GPT' in the user-agent string, which caught OAI-SearchBot too since the string appears there. Took me weeks to find it. Your curl diagnostic is the right first check.