AI Crawlability Monitor

Audit your robots.txt, X-Robots-Tag header, meta robots, and llms.txt against 15 AI crawlers. Score, severity, and copy-paste fix snippets per blocked bot.

7 min read

The Crawlability Monitor checks whether ChatGPT, Claude, Perplexity, Google AI, and 11 other AI bots can actually fetch your site. It runs in ~2 seconds, scores the result on a 0–100 gauge, and gives you a copy- paste robots.txt snippet for every blocked bot. Pure deterministic — zero LLM tokens.

If you're new to the concept, start with How AI bots discover your site for background on search-bot vs training-bot distinctions and what each signal means.

When to run it

The Monitor lives at /dashboard/crawlability and runs in three modes:

Mode	When	What it does
Onboarding	The moment you verify a domain	One audit fires automatically so you see a score on your very first dashboard visit.
Manual	You click Run first audit / Re-audit now	On-demand audit, rate-limited by tier.
Weekly cron	Sundays 03:00 UTC, Pro and Business	Automatic refresh — appears in score history without lifting a finger.

The same code path runs all three. Manual is useful right after you ship a robots.txt change; weekly is the safety net for accidental regressions (a CDN deploy that flips on a noindex header, a Yoast plugin update that wipes your AI bot allows).

How the score is built

We start at 100 and subtract weight per finding:

Pass = weight: 1.0 — the bot is allowed at the root.
Info = weight: 0.95 — neutral signal (no robots.txt, no llms.txt, no Sitemap: directive). Default-allow behavior keeps the score high.
Warning = weight: 0.5 — partial block, training-only bot blocked, or excessive Crawl-delay. AI visibility partially affected.
Critical = weight: 0.0 — search bot blocked, noindex header or meta, robots unreachable. Direct hit on AI search visibility.

The composite score weights only the critical-priority bots (OpenAI, Anthropic, Perplexity, Google). Extra-priority bots (Apple, ByteDance, Meta, Common Crawl, Cohere, Diffbot) appear in the matrix but don't move the gauge — exotic-stack penalisation kept off so a B2B SaaS targeting US/EU isn't scored down for ignoring Bytespider.

What each finding tells you

Findings break into two families:

Site-wide checks

These apply to your domain as a whole, not to a specific bot.

Finding	Severity	What it means
`noindex_header`	Critical	Your homepage responds with `X-Robots-Tag: noindex`. Look in your CDN, hosting platform, or framework config — on Vercel this often sits in `vercel.json` headers; on Cloudflare in a Transform Rule.
`noindex_meta`	Critical	The HTML contains a `<meta name="robots" content="noindex">`. Find the layout / template that renders it and remove or scope the directive.
`robots_unreachable`	Critical	`robots.txt` 5xx or timeout. AI crawlers conservatively assume the site is blocked — fix the underlying server issue.
`no_robots_txt`	Info	`robots.txt` 404. Default-allow, so this doesn't block anything — but adding one lets you point bots at your sitemap and explicitly allow AI crawlers.
`sitemap_missing`	Info	No `Sitemap:` directive in your robots.txt. AI crawlers and search engines use it to discover deep pages.
`crawl_delay_excessive`	Warning	One of your robots.txt groups sets `Crawl-delay: 30+`. Most major bots ignore Crawl-delay, but very high values can still slow legitimate indexers.
`no_llms_txt`	Info	No `/llms.txt` file. Not yet required by any bot, but first-mover advantage is real (llmstxt.org).

Per-bot findings

Finding	Severity (typical)	What it means
`bot_allowed`	Pass	The bot is allowed at the root.
`bot_disallowed`	Critical (search) / Warning (training)	An explicit `User-agent: <bot> / Disallow: /` group is blocking the bot.
`wildcard_disallow`	Critical / Warning	A `User-agent: * / Disallow: /` group is blocking the bot transitively (no per-bot override).
`partial_disallow`	Warning	The bot can fetch the homepage but specific paths (`/blog`, `/docs`, `/pricing`) are blocked. The most common reason competitors get cited and you don't.

The fix drawer

Click any bot row and a side drawer slides in with everything you need to act:

Markdown explanation. Why the finding matters for this specific bot — search vs training, blast radius if you don't fix.
Evidence quoted from your site. The exact robots.txt line or header value that triggered the finding, so you can find it in your codebase or CDN config.
Copy-paste fix snippet. Ready to drop into your robots.txt. Single click to copy.
Framework hints. Where to put the snippet for app/robots.ts (Next.js), public/robots.txt (Astro / Vercel / Cloudflare Pages), Yoast / Rank Math (WordPress), or a custom server route.
Vendor documentation link. Direct link to the bot vendor's official crawler page — useful when you need to defend a config choice in a code review.

Plan limits

Plan	Manual audits	Weekly cron	Bots visible	Fix snippets
Free	1 per UTC week (resets Mon 00:00)	—	Top 3 critical issues, read-only	—
Pro	5 per UTC day	Sundays 03:00 UTC	All 15	✓
Business	10 per UTC day	Sundays 03:00 UTC	All 15	✓ + email alerts (coming soon)

The free read-only view is intentional: it surfaces the most damaging findings so you know whether you have a problem, but copy-paste fix snippets and the full bot matrix are gated to paid plans.

Cost and infrastructure

Three plain HTTPS fetches per audit:

https://yourdomain/robots.txt — capped at 256 KB
https://yourdomain/ — homepage, capped at 512 KB
https://yourdomain/llms.txt — HEAD only, just for status

Total wall time on a healthy domain is 300–1500 ms. Vercel function budget is 300 s, so the weekly cron processes hundreds of domains sequentially without breaking a sweat.

Zero LLM tokens. All parsing is deterministic: an RFC 9309 robots parser, an X-Robots-Tag header parser, a regex extractor for <meta name="robots"> tags. Findings are composed from a static ruleset, fix snippets are static templates, framework hints are static strings. No model call at any point.

Trouble-shooting

"My audit shows 100 but Cursor / Stripe / a tool I trust reports us as blocked." Different tools check different signals. We focus on the canonical 15 AI crawlers and the four signals above. If a third- party tool flags something else (e.g. an old OG meta tag, a missing canonical), it's measuring something different — both can be valid.

"Score went from 100 to 50 overnight without us touching robots.txt." Most likely a CDN or hosting platform update flipped on an X-Robots- Tag: noindex header. Check the latest finding in the site-wide checks section — the evidence field shows the exact header value we received.

"I want to opt out of training but stay in search." Block GPTBot and ClaudeBot (training crawlers), keep OAI-SearchBot, Claude-Web, PerplexityBot, Perplexity-User, and Googlebot allowed. The Monitor will surface the training blocks as warnings (intentional, not red) and keep your score in the high 80s — that's the correct visual.

"GPTBot is blocked but my score is still 95-ish, not red." Correct behavior. GPTBot is training-only — blocking it doesn't kill ChatGPT live answers (those go through OAI-SearchBot). The warning is informative, not catastrophic.

"Wildcard disallow flagged everything as critical, score is in the teens." Add explicit per-bot allows before the wildcard Disallow: /. The fix snippet in the drawer is exactly the block to paste. Order in robots.txt doesn't matter for grouping (the longest- match wins), but for human readability put per-bot blocks at the top.

How AI bots discover your site — the conceptual backdrop and bot taxonomy.
Content audit — page-level signals (schema, headings, citations) that come into play after the bot can fetch the page.
Methodology — how we score every layer of the visibility stack, end-to-end.