Brand selection. 22 SaaS brands across 6 categories (Dev / DX, Productivity / PM, Design, Marketing / CRM, No-code / Integration / Comms, Analytics / BI). Selection criteria: well-known to indie SaaS / dev / marketing buyers, real product with a stable buyer-intent vocabulary AI engines can recognise. GEO Tracker AI is intentionally NOT in the ranking — we are the measurement tool, not a ranked item.
Question generation. For each brand we generated 10 buyer-intent questions using the OpenAI gpt-5.4-nano model with a deterministic prompt seeded by the brand's public homepage description and target buyer. The seed is hand-curated (not scraped at runtime) so the questions don't drift between runs — the benchmark stays re-runnable.
Why never the brand name in the question. A query like "is Linear good for engineering teams?" tests name-recall (does the LLM know the brand exists). We don't care about that — we care about category-level recall (does the LLM cite the brand when a buyer asks the category question without prompting). Both real questions and our generator strip the brand name explicitly.
Why 10 questions and not 3 or 50. At 3 questions per brand, one outlier swings the score by 33 points — too noisy to be a benchmark. At 50, the cost triples and the marginal signal flattens (we tested in our own pipeline). 10 questions × 2 engines = 20 data points per brand, which produces a stable score on the same brand run twice within ±4 points.
Sample — three of the ten questions we ran on Supabase
- What's the best open-source Firebase alternative for a Next.js app using Postgres with row-level security?
- Looking for an open-source BaaS that includes Auth, Storage, and SQL/table management in a web studio—best choice?
- For a React startup MVP, what’s the most cost-effective approach to a Firebase-equivalent Postgres backend with RLS?
None of these mention "Supabase" — they're category-level buyer questions, the kind a real prospect types into ChatGPT. The full list of 10 lives in the per-brand drilldown above.
Engines + models. Each question was run against: ChatGPT (OpenAI), and Perplexity Sonar. Total: 440scans across all brands. Google AI Mode is intentionally excluded from this snapshot — it's a Pro-tier engine in the product (DataForSEO cost per query). When we re-run with AI Mode included, the snapshot id will rev (q2-2026 → q2-2026-full) so historical comparability stays clean.
Engines + models. Each question was run against: ChatGPT (OpenAI), and Perplexity Sonar. Total: 440 scans across all brands.
Scoring. GEO Score follows the same formula GEO Tracker AI uses in the dashboard for paid users (mention rate × citation quality, weighted by engine market share — ChatGPT 0.45, Google AI Mode 0.30, Perplexity 0.25; engines with no result are excluded from the denominator). Quality bands snap to {0, 40, 70, 90}. The math is documented at /dashboard/help/methodology.
How we detect a mention. Two layers. Layer 1 — deterministic regex against hostname + canonical brand token (case-insensitive, subdomain-aware). Layer 2 — for long-tail responses where Layer 1 confidence is ambiguous, an OpenAI helper classifies whether the brand is recommended vs. mentioned-in-passing vs. top-recommended. Layer 2 only fires when Layer 1 flags needs_llm_refine; the rest run deterministic-only. Full implementation at lib/llm/mention-parser.ts + lib/llm/mention-classifier.ts in the public methodology page.
What this benchmark does NOT measure. Paid placements (AI engines don't sell ad slots in cited-answer responses today, but if they ever do, this benchmark won't separate paid from organic). Sub-brand pages and microsites (we countblog.{brand}.com as self, but a microsite at {brand}-go.io would count as a separate domain). Exclusive partnerships and OEM relationships (an AI citing "Vercel x Anthropic" collaboration counts as a mention of both). Voice / video / image AI outputs.
Reproducibility. Anyone running our free 60-second snapshot at /grader on these domains will get numbers in the same range — minor variance is expected because LLM responses are non-deterministic and citation graphs shift over time. The benchmark snapshot date is May 18, 2026. The pipeline that generated this report lives at scripts/benchmark/run-benchmark.ts — invokable by anyone with the same OpenAI + Perplexity creds.
Limitations.AI engines update their training data on rolling cycles; today's citation patterns may differ from next quarter's. We'll re-run this benchmark each quarter and publish all historical snapshots so the data is comparable over time.