Why doesn't ChatGPT cite my site?

Most often it's one of five technical issues: bots blocked in robots.txt (the wrong bots, usually), missing or stale JSON-LD that leaves the entity ambiguous, brand-string drift across pages, no direct-answer paragraphs near question-shaped H2s, or content that targets keywords instead of buyer questions. Fix the eligibility issues before assuming it's a content problem.

Is llms.txt required?

No. Google has confirmed llms.txt is not used as a ranking signal in AI Mode or AI Overviews, and no major engine has endorsed it as one. It is a cheap insurance policy for the smaller AI clients and MCP servers that do read it — five minutes to ship, asymmetric upside, never mandatory.

Should I add JSON-LD on every page?

Organization schema on every page yes (entity foundation). SoftwareApplication / Product on the homepage and key product pages. Article / BlogPosting per blog post. FAQPage only on pages with a visible FAQ section. Avoid duplicate or contradictory blocks across pages — engines treat conflicting entities as separate.

Does Google-Extended affect Google Search?

No. Google-Extended is a separate user-agent that controls whether your content trains Bard / Gemini / Vertex AI. It has no effect on classic Google Search or on AI Overviews / AI Mode — those use Googlebot. Blocking Google-Extended does not remove you from AI Overviews.

Are AI crawler blocks bad?

It depends on intent. Blocking training crawlers (GPTBot, Google-Extended) protects training opt-out without removing you from live AI search. Blocking search crawlers (OAI-SearchBot, PerplexityBot, ChatGPT-User) does remove you from AI citations. The common mistake is treating both as the same lever.

Guides10 min read

5 Technical Mistakes That Reduce AI Citation Eligibility

Crawl policy, machine-readable clarity, and entity consistency are prerequisites for being cited. This guide separates vendor-documented bot roles so you do not optimize against the wrong lever.

Petr VlčekPublished Mar 22, 2026Updated May 11, 2026

You can have strong positioning and still see your brand omitted from AI answers. Sometimes the gap is not “thought leadership,” but eligibility: crawlers cannot fetch the right pages, machines cannot confidently interpret what you sell, or your ecosystem signals disagree.

This article lists five recurring technical and information-architecture issues that reduce citation eligibility. It is not a guarantee of inclusion — no third party can promise citations — but these fixes remove self-inflicted friction.

Methodology & sources

Editorial review for factual claims (as of 2026-04-11).

Verified against vendor docs: OpenAI crawler roles, Perplexity crawler roles, Anthropic crawler roles, and Google Search guidance on AI features.
Interpretation (labeled): where we describe “what practitioners commonly see,” it is heuristic — not a published ranking formula.
Limits: AI answers vary by product surface, model, locale, personalization, and time; treat monitoring outputs as directional.

The five mistakes, in one paragraph

Treating all AI bots as one thing. OpenAI alone runs OAI-SearchBot, GPTBot, and ChatGPT-User for different purposes; the same is true of Perplexity, Anthropic, and Google. Blocking the wrong agent removes you from the wrong surface.
Relying on client-side rendering for primary facts. Pricing, integrations, and key claims that only appear after JavaScript executes are invisible to many AI crawlers. Server-render the parts that matter.
Letting canonical signals contradict each other. <link rel="canonical">, redirects, hreflang, and sitemap entries should all point at the same authoritative URL — otherwise AI systems pick the wrong one to cite.
Skipping entity hygiene on third-party profiles. Crunchbase, LinkedIn, G2, Capterra, GitHub Org pages — if your name, category, or boundary claims disagree with your homepage, the AI cites the safer source instead.
Treating llms.txt as a magic switch. Google has publicly confirmed it does not use llms.txt for crawling, indexing, or ranking; some AI vendors may ingest it, but it does not substitute for the four items above.

Why technical GEO matters

AI-mediated retrieval still depends on the open web in many product experiences: pages are fetched, parsed, summarized, and linked. Systems can fail on the same basics as any large-scale crawler: blocked access, heavy client-side rendering, slow responses, unclear structure, and contradictory canonical signals.

Mistake 1: Treating all “AI bots” as one thing in `robots.txt`

robots.txt is a crawl policy file. Different user agents exist for different purposes — training collection, search indexing, and user-initiated fetching are not interchangeable.

If you copy/paste rules without mapping them to outcomes, you can accidentally optimize the wrong lever (for example blocking search inclusion while believing you only blocked “training”).

OpenAI: `OAI-SearchBot`, `GPTBot`, and `ChatGPT-User` (distinct roles)

According to OpenAI’s crawler overview:

OAI-SearchBot is for search. OpenAI states it is used to surface websites in ChatGPT search features, and that sites opted out of OAI-SearchBot will not be shown in ChatGPT search answers (with some nuance around navigational links).
GPTBot is for training data collection for generative foundation models. Disallowing GPTBot signals content should not be used in that training pipeline — this is not the same control surface as ChatGPT search inclusion.
ChatGPT-User is user-initiated. OpenAI states it is used when users (or certain product actions) request page fetches, that it is not used for automatic crawling, and critically: “Because these actions are initiated by a user, robots.txt rules may not apply.” OpenAI also states ChatGPT-User is not used to determine Search inclusion and points publishers to OAI-SearchBot for managing Search opt-outs.

Source: OpenAI — Overview of OpenAI Crawlers: developers.openai.com/api/docs/bots

Editorial implication: “If you block GPTBot, ChatGPT cannot see your site” is not a generally correct statement. If your goal is visibility in ChatGPT search results, the documented lever to audit first is typically OAI-SearchBot, not GPTBot.

Perplexity: `PerplexityBot` vs `Perplexity-User`

Perplexity documents:

PerplexityBot for indexing/surfacing sites in Perplexity search results (and explicitly: not described there as crawling for foundation model training).
Perplexity-User for user-initiated access when answering questions.

Perplexity also states that because user-initiated fetch is user-requested, that fetcher generally ignores robots.txt rules.

Source: Perplexity — Perplexity crawlers: docs.perplexity.ai/guides/bots

Anthropic: `ClaudeBot`, `Claude-SearchBot`, and `Claude-User`

Anthropic’s Help Center describes three bots and what happens when each is disabled — including separate effects on training collection vs user-directed retrieval vs search indexing quality.

Source: Anthropic — Does Anthropic crawl data from the web, and how can site owners block the crawler?: support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler

Google: AI features vs `Google-Extended`

Google’s guidance for AI Overviews / AI Mode is framed as part of Google Search eligibility and best practices.

Separately, Google-Extended is a crawl token used in a different policy context (training/use of content by Google’s generative AI products) — it is not a clean substitute label for “the AI Overviews crawler,” and it should not be taught that way.

Sources:

Google Search Central — AI features and your website: developers.google.com/search/docs/appearance/ai-features
Google — Google-Extended: developers.google.com/search/docs/crawling-google-overview/google-extended

A simple decision checklist (policy-first)

Search inclusion: which engines matter for discovery — and which search bot(s) does each vendor document for that outcome?
Training collection: do you want to allow/disallow training crawlers independently of search bots (OpenAI explicitly frames these as independent robots.txt settings)?
User-initiated fetch: understand that some user agents may not behave like classic “respect robots” crawlers for all requests — your policy, auth, and edge protections matter.

How to check

Open https://yourdomain.com/robots.txt and map each User-agent group you maintain to the vendor’s description of what that token controls. If you want a faster way to do this, our free robots.txt for AI bots builder lists every major AI crawler in 2026 with vendor + category + per-bot note, with three preset stances — and the setup guide walks through the full vendor-by-vendor decision tree.

For product-level differences in citation UX and documented crawl roles between assistants, read ChatGPT vs Perplexity. For how Google positions AI Overviews and AI Mode inside Search (and common measurement misconceptions), read Google AI Overviews vs AI Mode.

Mistake 2: Missing or misleading Schema.org JSON-LD

Structured data helps machines classify pages (organization vs article vs software product) and extract fields like name, description, and offers.

Google’s AI-features documentation states there is no special schema requirement to appear in AI Overviews / AI Mode as a supporting link — but incorrect structured data can still confuse parsers and undermine trust.

Common issues

Wrong type for the page: product pages marked like blog posts.
Missing Organization (or equivalent) clarity on the canonical “who we are” URL.
FAQ content without FAQPage markup when the visible content is Q/A-shaped (optional, but can improve machine extractability — not a guaranteed inclusion mechanism).

Minimal examples (adapt fields to your truth)

Organization JSON-LD belongs wherever you define the canonical company entity:

{
  "@context": "https://schema.org",
  "@type": "Organization",
  "name": "Your Brand Name",
  "url": "https://yourdomain.com",
  "logo": "https://yourdomain.com/logo.png",
  "description": "One clear sentence describing what your product does.",
  "sameAs": [
    "https://www.linkedin.com/company/yourbrand"
  ]
}

SoftwareApplication JSON-LD is a common fit for SaaS product pages:

{
  "@context": "https://schema.org",
  "@type": "SoftwareApplication",
  "name": "Your Product Name",
  "applicationCategory": "BusinessApplication",
  "operatingSystem": "Web",
  "description": "What your product does in one sentence."
}

If you don't have these blocks shipped yet, our free JSON-LD generator builds Organization, SoftwareApplication, Article, and FAQPage schema in 60 seconds with one-click Google Rich Results Test validation — and the JSON-LD setup guide explains which fields matter and why.

Mistake 3: Treating `llms.txt` as mandatory (or skipping it without strategy)

llms.txt is a lightweight, community-oriented text file intended to help tools find curated pointers to important pages. It is not a Google Search ranking requirement.

Google’s Search Central guidance for AI features explicitly says you do not need new machine-readable files to appear in AI Overviews / AI Mode — foundational SEO practices and normal indexing eligibility are the stated baseline.

That said, llms.txt can still be useful as an internal “routing table” for humans and tools — especially if your site is large and your best answers are not obvious from navigation alone. If you want to ship one, our free llms.txt generator builds a spec-compliant file in 60 seconds, and the setup guide covers Next.js / WordPress / Webflow setup with copy-paste snippets.

Source (community spec): llmstxt.org

Mistake 4: Inconsistent brand entity signals

Automated systems rely on corroboration. If your canonical site, LinkedIn, major business directories you actually use, GitHub org, and review profiles disagree on name, category, or what the product does, you increase ambiguity — and ambiguous entities are easier to omit than defend.

Audit checklist (30 minutes)

Legal/marketing name (capitalization, suffixes)
One-sentence category claim (what you are / are not)
Primary product URL (canonical homepage vs product subdomain)
Founding year / location (only if you publish them — keep them aligned)
Leadership names (if you cite them publicly)

What “good” looks like

One canonical “brand truth” document internally, reflected consistently in public profiles and on-page copy.

Mistake 5: Pages that never directly answer buyer questions

Many landing pages optimize for persuasion, not extractability. AI-mediated answers disproportionately reuse explicit statements: definitions, steps, comparisons, limitations, pricing mechanics (when published), and “who it is for.”

Information architecture signals

Prefer headings that match real questions buyers ask.
Put the direct answer first under the heading, then supporting detail.
Use lists and comparison tables for “vs” intents.

This is not “tricking” a model — it is aligning your public documentation with how people ask questions.

Quick audit checklist

Map robots.txt decisions to search vs training vs user-initiated bot roles (per vendor docs)
Validate JSON-LD for major templates (homepage, pricing, product)
Decide whether you will maintain llms.txt as a curated map (optional)
Compare brand/category strings across your top public profiles
Rewrite top headings into question form where appropriate
Keep key URLs fast and stable on mobile (measure with your own field data / RUM — avoid universal “X seconds” thresholds)

Soft utility: if you want a reusable prompt panel, start from the query patterns in What Is GEO? and track the same 10–20 prompts weekly.

Closing note

Technical work removes blockers. Editorial and ecosystem credibility (reviews, independent writeups, documentation depth) still determine whether you are worth citing once you are eligible.

Sources and official documentation

OpenAI — Overview of OpenAI Crawlers: developers.openai.com/api/docs/bots
Perplexity — Perplexity crawlers: docs.perplexity.ai/guides/bots
Anthropic — Does Anthropic crawl data from the web…?: support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
Google Search Central — AI features and your website: developers.google.com/search/docs/appearance/ai-features
Google — Google-Extended: developers.google.com/search/docs/crawling-google-overview/google-extended

Frequently asked questions

The same Q&A pairs ship as FAQPage structured data so AI engines can quote them verbatim.

Why doesn't ChatGPT cite my site?: Most often it's one of five technical issues: bots blocked in robots.txt (the wrong bots, usually), missing or stale JSON-LD that leaves the entity ambiguous, brand-string drift across pages, no direct-answer paragraphs near question-shaped H2s, or content that targets keywords instead of buyer questions. Fix the eligibility issues before assuming it's a content problem.
Is llms.txt required?: No. Google has confirmed llms.txt is not used as a ranking signal in AI Mode or AI Overviews, and no major engine has endorsed it as one. It is a cheap insurance policy for the smaller AI clients and MCP servers that do read it — five minutes to ship, asymmetric upside, never mandatory.
Should I add JSON-LD on every page?: Organization schema on every page yes (entity foundation). SoftwareApplication / Product on the homepage and key product pages. Article / BlogPosting per blog post. FAQPage only on pages with a visible FAQ section. Avoid duplicate or contradictory blocks across pages — engines treat conflicting entities as separate.
Does Google-Extended affect Google Search?: No. Google-Extended is a separate user-agent that controls whether your content trains Bard / Gemini / Vertex AI. It has no effect on classic Google Search or on AI Overviews / AI Mode — those use Googlebot. Blocking Google-Extended does not remove you from AI Overviews.
Are AI crawler blocks bad?: It depends on intent. Blocking training crawlers (GPTBot, Google-Extended) protects training opt-out without removing you from live AI search. Blocking search crawlers (OAI-SearchBot, PerplexityBot, ChatGPT-User) does remove you from AI citations. The common mistake is treating both as the same lever.

Guides

Share this articlePost on X LinkedIn

Guides

This isn't a SaaS subscription. It's a working relationship.

6 min read read

Guides

8 GEO Myths Google Just Debunked — What the Data Actually Shows in 2026

15 min read read

Guides

Why we refuse to draft AI citations into archived threads

15 min read read

Visibility baseline

Establish an AI mention baseline you can defend

GEO Tracker AI runs repeatable checks for supported engines so you can see whether your brand is mentioned, what context shows up, and how that changes week over week — complementary to Search Console, not a replacement for it.

Start free monitoring →Compare plans Join the community discussion →

5 Technical Mistakes That Reduce AI Citation Eligibility

Why technical GEO matters

Mistake 1: Treating all “AI bots” as one thing in robots.txt

OpenAI: OAI-SearchBot, GPTBot, and ChatGPT-User (distinct roles)

Perplexity: PerplexityBot vs Perplexity-User

Anthropic: ClaudeBot, Claude-SearchBot, and Claude-User

Google: AI features vs Google-Extended

A simple decision checklist (policy-first)

How to check

Mistake 2: Missing or misleading Schema.org JSON-LD

Common issues

Minimal examples (adapt fields to your truth)

Mistake 3: Treating llms.txt as mandatory (or skipping it without strategy)

Mistake 4: Inconsistent brand entity signals

Audit checklist (30 minutes)

What “good” looks like

Mistake 5: Pages that never directly answer buyer questions

Information architecture signals

Quick audit checklist

Closing note

Sources and official documentation

Frequently asked questions

Related articles

Establish an AI mention baseline you can defend

Mistake 1: Treating all “AI bots” as one thing in `robots.txt`

OpenAI: `OAI-SearchBot`, `GPTBot`, and `ChatGPT-User` (distinct roles)

Perplexity: `PerplexityBot` vs `Perplexity-User`

Anthropic: `ClaudeBot`, `Claude-SearchBot`, and `Claude-User`

Google: AI features vs `Google-Extended`

Mistake 3: Treating `llms.txt` as mandatory (or skipping it without strategy)