AI search visibility in 2026: the hygiene checklist, four free tools, and the honest math behind it
A practical 2026 hygiene checklist for AI search visibility — four free tools, eight guides, the per-engine differences, and the honest math on what each step delivers.
If you're trying to figure out whether ChatGPT, Perplexity, or Google AI Mode cite your brand for the questions your buyers ask — and what to do if they don't — there are exactly five signals worth thinking about. Four are technical hygiene you can ship yourself. The fifth is measurement.
Most B2B SaaS sites ship zero of them.
This post walks all five, with one free tool per signal where applicable, and links to the deep-dive guide if you want more than the summary. We just shipped the toolkit and guides — everything is free, runs in the browser, no signup. The whole thing took about two weeks to build. The math at the bottom is honest about what each step actually delivers.
The five signals, ranked by leverage
- Measurement first. Without a Share of Voice baseline you can't tell whether anything you ship moves the needle. Run a 60-second audit at /grader before doing the other four.
- Reddit / community footprint. The single highest-leverage citation source in 2026 — Reddit is 24% of Perplexity citations and powers Google AI Mode's new Expert Advice + Community Perspectives sections after the May 6 2026 update.
- JSON-LD on the homepage. Organization + SoftwareApplication (or Service / Product) schema. Not a ranking signal, but a citation-accuracy signal — affects how AI engines cite you, which matters more than people think.
- robots.txt for AI bots. Most sites either block everything or allow everything by default. Both are wrong for most cases. There are three sensible stances.
- llms.txt. A courtesy file for AI agents. Not a ranking signal, doesn't guarantee citation, but takes 60 seconds and the asymmetry is right.
The honest claim: technical hygiene alone caps your Share of Voice somewhere around 30%. Past that, community presence (Reddit, comparison pages, third-party reviews) and consistent content quality are what move the needle.
The framing — what AI engines actually decide on
Strip away the marketing and AI search engines decide three things on every query:
- Should this brand be in the answer at all? (Entity recognition + citation eligibility)
- In what position, with what facts about it? (Source quality + structured data extraction)
- From which third-party source are we pulling the supporting reference? (Authority + freshness + retrieval set composition)
The four technical hygiene signals influence those three decisions. The fifth signal — measurement — tells you whether any of it worked.
The funnel from "I shipped llms.txt" to "ChatGPT cites me" looks like this:
Technical hygiene → AI engine can ingest + understand your site
↓
Content + community footprint → AI engine considers you a valid candidate
↓
Authority signals → AI engine picks you over a competitor for the citation slot
↓
Measurement → you find out whether any of the above worked
Skip any layer and the next one fails silently. Skip measurement and the whole thing is faith-based.
Signal 1 — llms.txt (the navigational map)
What it is: A plain-text file at the root of your domain (yourdomain.com/llms.txt) that gives AI agents a Markdown map of which pages on your site matter. Think of it as a sitemap, but for AI.
What it does: AI agents that read it (Claude clients, Perplexity, MCP servers, smaller open-source crawlers) get a curated summary of your site instead of having to crawl page-by-page.
What it does NOT do: Google has stated explicitly it's not used for ranking in AI Mode or AI Overviews. Bing has not endorsed it as a ranking signal either. Shipping a perfect file does not make ChatGPT cite you.
When it matters: Small + medium sites where you can curate 10–30 highest-value links. Five minutes to ship. Asymmetric upside: if it helps even one AI agent cite you accurately, you've recovered the cost.
When to skip: You already auto-generate llms.txt from a CMS or content collection (Next.js dynamic route, Sanity, Hugo build step). Don't maintain a hand-written file alongside an auto-generated one.
→ Free llms.txt generator (60 seconds, no signup) · Full setup guide (framework setup notes, common mistakes, verification workflow)
Signal 2 — robots.txt for AI bots (the access policy)
What it is: The 30-year-old crawler-policy file at yourdomain.com/robots.txt. In 2026 it's the primary lever for controlling whether your content trains AI models, appears in AI citations, or both.
What it does: Tells polite AI crawlers (OpenAI's GPTBot, OAI-SearchBot, ChatGPT-User; Anthropic's ClaudeBot, Claude-User, Claude-SearchBot; PerplexityBot, Perplexity-User; Google-Extended; CCBot; Bytespider; and ~10 more) which paths they may fetch.
Three sensible stances:
| Stance | Who picks this | What you keep / lose |
|---|---|---|
| Allow everything | Most B2B SaaS, agencies, content sites | Maximum training + citation inclusion |
| Block training, allow citation | Publishers, IP-sensitive sites | Opt out of model training but stay citable in live AI search |
| Block all AI | Paid content, news with licensing deals | Block both training and citation across major engines |
The most common mistake: Treating GPTBot as "the ChatGPT bot". It's not — GPTBot is training-only; ChatGPT citations come from OAI-SearchBot and ChatGPT-User. Blocking GPTBot alone does not stop you from being cited in ChatGPT search.
The second mistake: Assuming Google-Extended controls AI Mode. It controls Bard / Gemini / Vertex AI training; AI Mode and AI Overviews run on Googlebot, which is unaffected.
→ Free robots.txt builder for AI bots (18 crawlers, three preset stances, browser-side) · Full guide with per-bot context
Signal 3 — JSON-LD (the machine-readable identity)
What it is: A <script type="application/ld+json"> block in your page HTML using schema.org vocabulary to declare what kind of thing the page is about. Browsers ignore it; AI engines parse it explicitly.
What it does: Ships your brand entity as structured facts — name, url, sameAs (social URLs), description, applicationCategory, price. AI engines pull facts about your brand preferentially from JSON-LD over inferring from prose.
What it does NOT do: Google has consistently said structured data does not boost ranking directly. It makes you eligible for rich result formats and helps the model understand the page. Same for AI engines — it affects how you're cited (name, description, url), not whether.
The four schemas worth shipping first:
- Organization — on every page. Foundation. Declares who you are.
- SoftwareApplication (SaaS) or Service (agency) or Product (e-commerce) — on homepage + key product pages.
- Article / BlogPosting — per blog post. Most CMS frameworks emit this automatically — verify yours does before hand-adding (duplicate blocks cause validation issues).
- FAQPage — pages with visible FAQ sections. Only include FAQs that are actually on the page; schema-only FAQs are a Google guidelines violation.
The common mistake nobody notices: Drift across pages. Organization name or url differs between homepage and blog posts. AI engines treat these as different entities. Use a shared partial / component in your codebase.
→ Free JSON-LD generator (4 entity types, one-click validate via Google Rich Results Test) · Full setup guide with worked examples and common mistakes
Signal 4 — Reddit / community footprint (the discovery surface)
This is the one most B2B SaaS founders skip and shouldn't.
The 2026 data (Tinuiti Q1 2026 AI Citation Trends Report):
- Reddit is 24% of all Perplexity citations. The single most-cited domain.
- ChatGPT cites Reddit in >5% of responses. Less concentrated than Perplexity but still the largest non-Wikipedia UGC source.
- In commercial tech categories, Reddit's citation share grew +73% YoY.
The May 6 2026 Google AI Mode update introduced two new named sections — "Expert Advice" and "Community Perspectives" — that pull preferentially from Reddit and similar forum content. BrightEdge measured AI Mode's UGC citation share at ~17.5%, 35× more than ChatGPT (0.5%) and 87× more than Gemini (0.2%).
If your brand isn't mentioned in the top-upvoted Reddit threads about your category, you are missing the single highest-leverage citation lever AI search offers in 2026.
The work is real — comment with disclosure on threads where the category gets discussed, value-first, no link-spam, no affiliate links. The shape of a comment that gets cited: direct answer → trade-off → concrete data point → disclosure. The 200-word version beats the 800-word version every time.
→ Full Reddit citation strategy guide with subreddit selection, comment patterns, disclosure etiquette, and measurement workflow
Cross-reference: our deeper unpack of the May 6 update is in Reddit Is Now Inside Google's AI Mode (May 6, 2026).
The 5th signal — measurement (where everything else lives or dies)
Here's the asymmetric thing about AI search optimisation: you can ship the four technical hygiene signals perfectly and have zero idea whether any of them moved the needle. Without measurement, every AI-search conversation collapses into anecdote.
The category has now converged on three measurements that matter:
- Share of Voice (sometimes Share of Model) — of all the buyer-questions your prospects ask AI engines, what percentage now cite your brand? Otterly, Profound, Peec AI, Tinuiti, BrightEdge — every tool publishes this metric with the same math.
- AI citation count over time — is the raw count rising, flat, or falling per engine (ChatGPT / Perplexity / Google AI Mode)? Different engines move on different time scales.
- Outcome of specific actions — when you ship llms.txt or a Reddit comment, does Share of Voice on the target prompts actually move 14 days later, or is it noise?
The third is the hardest, because AI engines re-rank weekly and a single scan is noise. The pattern that works: anchor a 14-day measurement window on a shipped action, compare pre-window vs post-window on the target prompts only, tag with a low / medium / high confidence band based on sample size.
→ Full measurement guide — Share of Voice math, citation tracking, the 14-day Outcome Loop pattern · Run a free 60-second audit to get a baseline today
Per-engine differences — what each AI cites differently
The four hygiene signals + measurement apply universally, but the engines themselves differ materially in how they pick citations:
| Engine | Top citation sources (2026) | Reddit share | Retrieval model |
|---|---|---|---|
| ChatGPT | Wikipedia, YouTube, Bing-ranked sites, gov/edu | >5% | Trained knowledge + live (search mode via Bing) |
| Perplexity | Reddit, third-party reviews, fresh blog posts | 24% | Pure live retrieval (Sonar: 10 fetched → 3–4 cited) |
| Google AI Mode | Top-10 organic (54% overlap), Reddit, YouTube | ~17.5% UGC | Googlebot-driven, May 2026 update widened source mix |
| Gemini | Government / academic / institutional (~26%) | 0.1% | Authority-biased, conservative |
| Microsoft Copilot | Mixed; high brand mention rate (26.7%) | Modest | Bing-driven, similar to ChatGPT search |
Practical implication: the engine your buyers use determines which hygiene levers matter most. If they're on Perplexity, Reddit footprint is half the game. If they're on Gemini, Reddit barely matters and authority signals (government, academic, Wikipedia citations) dominate.
→ Per-engine playbooks: ChatGPT · Perplexity · Google AI Mode
For the deeper engine-by-engine breakdown, see AI Visibility Is Not One Channel and ChatGPT vs Perplexity citations.
The decision tree — where to start
The most useful section. Pick the row that describes you today.
| If you're here today | Ship this next |
|---|---|
| No baseline. Don't know if AI cites you at all. | Run free 60-second audit. 5 minutes. You'll know whether you have a problem worth solving. |
| Baseline + bad result. Audit returned 0–20% Share of Voice. | Start with JSON-LD — entity clarity is the single highest-leverage technical lever. Then robots.txt sanity check. |
| Baseline + middling result. 20–50% Share of Voice. | llms.txt as cheap insurance, then the Reddit citation strategy — this is where most teams plateau and the answer is community presence. |
| Baseline + good result. 50%+ Share of Voice. | Pick your weakest engine (ChatGPT / Perplexity / AI Mode) and run the per-engine playbook. |
| Strong result, want to defend. 65%+ SoV. | Measurement guide — set up the 14-day Outcome Loop and watch for drift. New entrants in your category will erode your share unless you measure. |
The order matters. Each step assumes the previous one is done. Skip steps and you'll ship effort into rooms with no measurement to tell you whether anything moved.
The honest math — what 12 weeks of hygiene delivers
People reasonably want to know: if I do all four hygiene steps + the Reddit work, what happens?
Honest ranges based on customer outcomes + observed data across the GEO category:
-
Technical hygiene alone (llms.txt + robots.txt sanity + JSON-LD) — moves Share of Voice from 0–10% to ~15–30% on a buyer-question panel over 4–8 weeks. The improvement comes from citation accuracy (right brand name, right description, right url) more than from getting newly cited.
-
Technical + Reddit work (assuming 5–10 quality comments on relevant high-intent threads) — moves SoV from ~20% to ~35–55% over 6–12 weeks. The lag is real because AI engines take 7–21 days to incorporate fresh Reddit threads into their retrieval set.
-
Technical + Reddit + comparison content + third-party reviews — moves SoV into the 50–70% range over 3–6 months. This is the durable plateau most B2B SaaS sites can reach without a brand budget.
-
65%+ SoV — typically requires brand authority and content gravity that take 12+ months to accumulate. There is no technical-hygiene shortcut past this point.
What this means for expectations:
- Don't promise yourself "ChatGPT will cite us next week" — the citation lag is 7–21 days and noise dominates single-week measurements.
- Don't conclude "this didn't work" after one scan — single scans are anecdote; trend over 30+ days is data.
- Don't conflate Share of Voice with rank — SoV is binary per scan (cited or not); position when cited is a separate metric.
- Don't expect Reddit comments to translate to traffic in week one — they translate to AI citations over weeks; the traffic shows up downstream.
Where the paid product fits
Honest about our own positioning: we built these tools for free because the technical-hygiene part of AI search visibility is a 60-second-to-five-minute exercise per signal, and gating that behind a paywall would have been gross.
The paid product (Pro $129/mo, Business $299/mo) covers what a static file can't:
- Multi-engine Share of Voice tracking across ChatGPT, Perplexity, and Google AI Mode on a stable buyer-question panel, run on a schedule.
- Why-You-Lost analysis per losing prompt — three evidence-backed signals + three ranked actions, draft-ready for Reddit / HN / email pitch / GitHub PR / podcast.
- 14-day Outcome Loop — mark an action shipped, we re-measure 14 days later on the target prompts only, with confidence bands.
- Crawlability + content + technical-GEO auditing on a schedule.
For deeper context on the tooling landscape: we audited 22 AI visibility tools — every one shows metrics; ours is the only one that drafts the Reddit comment, the journalist pitch, or the email. Compare us against Profound, Peec AI, Otterly.ai, and AthenaHQ if you want vendor-by-vendor breakdowns.
The full toolkit (one place)
| Tool | What it does |
|---|---|
| llms.txt generator | Browser-side form, three site-type presets, editable section labels, copy or download |
| robots.txt for AI bots | 18 AI crawlers, three preset stances, per-bot vendor context |
| JSON-LD generator | Organization / SoftwareApplication / Article / FAQPage, one-click validate |
| Free AI audit | 60-second Perplexity scan, Share of Voice baseline, cited brands |
| Guide | What it covers |
|---|---|
| llms.txt setup | Framework setup (Next.js / WordPress / Webflow / static), mistakes, verification |
| robots.txt for AI bots | 18-bot breakdown, three stances, framework setup |
| JSON-LD for AI search | Four schemas, worked examples, validation workflow |
| Optimise for ChatGPT | GPTBot / OAI-SearchBot split, Bing backbone, 8-step plan |
| Optimise for Perplexity | Sonar retrieval (10 → 4), Reddit 24%, 8-step plan |
| Optimise for Google AI Mode | May 6 update, AI Mode vs AI Overviews, 8-step plan |
| Reddit citation strategy | Subreddit selection, comment patterns, disclosure etiquette, measurement |
| Measuring AI visibility | Share of Voice, citations, 14-day Outcome Loop |
Subscribe to new guides at /guides/feed.xml if you prefer RSS, or the blog feed for long-form posts.
Sources and official documentation
- Tinuiti — Q1 2026 AI Citation Trends Report: tinuiti.com/research-insights/research/ai-citation-trends-report/
- BrightEdge — AI Search: Same Brands, Different Sources: brightedge.com/resources/weekly-ai-search-insights/ai-search-same-brands-different-sources
- BrightEdge — Rank Overlap After 16 Months of AIO: brightedge.com/resources/weekly-ai-search-insights/rank-overlap-after-16-months-of-aio
- Google Search Central Community thread — llms.txt clarification: support.google.com/webmasters/thread/356453254
- Google Blog — Explore the web with generative AI in Search (May 6 2026 update): blog.google/products-and-platforms/products/search/explore-web-generative-ai-search/
- Search Engine Land — AI Citation Data: No Universal Top Source (analysis of Tinuiti report): searchengineland.com/ai-citation-data-no-universal-top-source-brands-471285
- OpenAI — Bots: platform.openai.com/docs/bots
- llmstxt.org — llms.txt specification: llmstxt.org
- schema.org — Vocabulary reference: schema.org
- Google — Rich Results Test: search.google.com/test/rich-results
Related articles
Paid Ads vs SEO vs GEO in 2026: What Actually Broke in the Last 12 Months
27 min read read
NewsReddit Is Now Inside Google's AI Mode (May 6, 2026). Here's the Brand Playbook.
14 min read read
NewsAI Visibility Is Not One Channel: How ChatGPT, Google, Perplexity, Claude, Gemini, and Grok See the Web
13 min read read
Visibility baseline
Establish an AI mention baseline you can defend
GEO Tracker AI runs repeatable checks for supported engines so you can see whether your brand is mentioned, what context shows up, and how that changes week over week — complementary to Search Console, not a replacement for it.