Can AI See Your Business? Inside the New Measurement Problem

Your Google ranking for "dentist near me" is knowable. Type the query, note the position, repeat from an incognito browser in your city, average the results. The number is imperfect — personalisation and location shift it a few places — but it exists, it is stable enough to track, and it correlates with something measurable: clicks, calls, appointments. The entire discipline of local SEO is built on the legibility of that number.

AI visibility is not legible in the same way. Ask ChatGPT which dentist to visit in your city. Ask again. Ask from a different session, or rephrase the question slightly. The answer changes. There is no position one, no rank to defend, no single number to report in a monthly dashboard. The engine is nondeterministic — the same prompt produces different outputs across runs — and each major AI platform behaves differently from the others. A business that appears in Gemini's answer may be invisible to Perplexity. A business cited by ChatGPT today may not be cited next week, because the retrieval pipeline changed or a competitor's directory listing was updated.

This is the new measurement problem. AI is now a material local discovery channel — the BrightLocal Local Consumer Review Survey 2026 (n=1,002) found that consumer use of AI for local business recommendations rose from 6% in 2025 to 45% in 2026, making AI the third most-used discovery channel behind only Google and Facebook.¹ The channel is large enough to demand attention. And yet the tools being sold to measure performance on it are, in the main, producing numbers that sound like measurements without behaving like them.

This piece does four things. It explains why the measurement problem is structurally different from SEO. It separates what the published evidence actually proves from what is speculation. It describes what rigorous measurement of AI visibility looks like. And it explains why the "AEO score" products proliferating in the market are, with rare exceptions, selling precision they cannot deliver.

I. The invisible channel: why AI visibility is unmeasured unlike SEO rank

The SEO rank-tracking industry rests on a stable foundation: search engines return consistent, ordered results for a given query from a given location. Rank-tracking tools exploit this consistency by running automated queries, recording positions, and reporting trends. The output is noisy — it varies by device, IP geolocation, search history, and personalisation — but the underlying system is deterministic enough that averaging across multiple reads produces a reliable signal. A business at position 3 for "physiotherapy Sydney" is genuinely more visible than a business at position 8. The number means something.

AI answer engines break this foundation at three points.

Nondeterminism

Large language models, and the retrieval-augmented generation (RAG) pipelines that feed them local business data, do not return the same answer to the same prompt every time. Temperature settings, stochastic sampling, and variability in which retrieved documents happen to be selected on a given run all introduce output variance. A single run of "which chiropractor do you recommend in Austin?" is a single sample from a distribution, not a measurement of a stable state. Reporting a yes/no from that single run as an "AI visibility score" is reporting a coin flip as if it were a census.

Per-engine divergence

The three dominant AI recommendation surfaces — ChatGPT, Gemini, and Perplexity — source their local citations from different places and weight different signals. A multi-million citation analysis by Yext, cross-referenced with Qwairy and Profound studies, found only 11–25% citation overlap between engines — meaning that for any given local query, the three platforms are recommending mostly different businesses.² ChatGPT draws approximately 49% of local citations from third-party directories such as Yelp; Gemini draws approximately 52% from brand-owned sites.² A blended "AI visibility" score that aggregates across engines without per-engine breakdown is averaging incompatible surfaces into a number that conceals the actual story.

The personalisation blind spot

Some AI platforms personalise responses based on conversation history, user location signals, and inferred preferences. This introduces a measurement confound that rank trackers have partially solved for traditional search (by running queries from known IP locations in incognito mode) but that is harder to neutralise for AI systems where the personalisation layer is less documented and less consistent.

These three structural features mean that the question "does AI see my business?" cannot be answered by running a query once. The honest answer is a probabilistic one: "our business appears in X% of relevant AI queries, measured across Y runs, on engine Z." The stakes of getting this right are considerable. The SOCi 2026 Local Visibility Index — covering 350,000+ locations across 2,751 brands — found that ChatGPT recommends only 1.2% of local business locations, compared with 11% for Gemini, 7.4% for Perplexity, and 35.9% for Google's Local 3-Pack.³ Being invisible to Google Maps is one magnitude of problem. Being invisible to ChatGPT is a different but compound problem: the channel is harder to penetrate, the tools to measure it are less reliable, and the feedback loops for improvement are less legible.

Exhibit 1

Local Discovery: Citation Rate by Channel

Share of local business locations recommended by each discovery surface. The gap between traditional local search and AI channels is not a rounding error — it is a structural feature of how AI systems select and cite local businesses. The divergence across AI engines means each platform must be measured separately.

Source: SOCi 2026 Local Visibility Index (350,000+ locations, 2,751 brands). Figures represent share of business locations recommended by each surface. Per-engine figures are not directly comparable due to different query sets and methodologies — the pattern matters more than exact cross-engine comparison.

The traffic economics compound the stakes. Research by Ahrefs (February 2026, 300,000 keywords) found that AI Overviews cut the click-through rate to the top organic result by 58% — but businesses actually cited within an AI answer see organic clicks rise 35% and paid clicks rise 91%.⁴ Inclusion in an AI answer is not merely neutral visibility. It functions as an endorsement, and the downstream commercial effect is larger than being at position one in a traditional result set. The asymmetry between cited and uncited is therefore larger than the visibility gap suggests.

"There is no rank position to check. AI visibility is a probability, not a number — and most of the market is selling the number."

II. The honest state of evidence: what is proven versus what is myth

The AEO (answer engine optimisation) industry has grown faster than its evidence base. Vendors have strong commercial incentives to identify new optimisation levers, and practitioners have a natural tendency to interpret correlation as causation when a tactic coincides with an observed improvement. The result is a market where controlled evidence is scarce and confident assertion is abundant. A careful reader deserves a clear account of what the data actually shows.

What the evidence proves

Organic rank is the strongest measurable predictor of AI citation. An Authoritas study measuring AI Overview citation probability by organic rank position found that a business at position #1 has a 53% chance of being cited in a relevant AI Overview, compared with 37% at position #10.⁵ The relationship is not surprising — AI systems retrieve from the same web the search engines index — but it is the most rigorously supported finding in the field. Anything that improves organic rank also improves AI citation probability. Traditional local SEO and AEO are not competing disciplines.

Review rating floors are measurable. The SOCi 2026 Local Visibility Index identified effective thresholds below which the major AI platforms rarely recommend a business: ChatGPT tends to recommend businesses rated 4.3 stars or above; Perplexity 4.1 stars; Gemini 3.9 stars.³ These are empirical observations from a very large dataset, not published specifications from the platforms — but the pattern across 350,000+ locations is too consistent to dismiss.

Directory presence drives ChatGPT citations specifically. The Yext citation analysis (6.8 million citations) found that ChatGPT draws approximately 49% of local citations from third-party directories such as Yelp — a substantially higher share than Gemini, which draws approximately 52% from brand-owned sites.² A business absent from Yelp and similar directories is structurally disadvantaged for ChatGPT visibility in a way it may not be for Gemini. Per-engine sourcing divergence is real, not theoretical.

Citation-rich content earns more AI visibility. A Princeton GEO (Generative Engine Optimisation) study found that content containing citations, statistics, and quotes earns approximately 40% more AI visibility than content making equivalent claims without evidence.⁶ The mechanism is logical: AI systems trained on academic and journalistic text learn to associate cited, specific claims with reliable sources, and weight them accordingly. Content written to inform rather than to impress performs better in AI retrieval.

Brand-managed sources dominate AI citations. Yext's analysis found that 86% of AI citations come from brand-managed or brand-influenced sources — the business's own website, its Google Business Profile, its claimed directory listings.² This finding is reassuring in one sense (the levers are within a business's control) and clarifying in another (passive web presence, including unclaimed profiles with outdated data, is a liability not an asset).

What the evidence does not support

Schema markup as an AEO lever is not supported by controlled evidence. A difference-in-differences study by Ahrefs, covering 1,885 pages, measured the independent effect of adding LocalBusiness schema markup on AI citation rates. The result: a range of −4.6% to +2.2% — a range that overlaps zero and is not statistically meaningful.⁷ Schema remains sound practice for information accuracy and rich snippet eligibility. It is not a citation driver in any controlled sense.

llms.txt has no measured effect and is not read. llms.txt — a proposed convention allowing websites to signal preferences to AI crawlers — is functionally irrelevant to AI citation at present. An Ahrefs crawl study covering 137,210 domains found that 97% of llms.txt files are never read by AI bots.⁷ Google explicitly debunked the convention at I/O 2026, stating that it does not use llms.txt files and does not plan to. The opportunity cost of optimising for llms.txt is low — the file takes minutes to create — but practitioners selling it as a visibility lever are trading on its theoretical logic rather than any evidence of its practical effect.

Most "AEO best practices" circulating in the market are opinion, not evidence. Recommendations to write "conversational content," to structure pages as FAQ pairs, to target "featured snippet formats," and to add entity markup beyond LocalBusiness schema are based on plausible reasoning about how LLMs process text, not controlled studies of what moves AI citation rates. Some may be correct. None has been validated in a controlled experiment comparable to the Ahrefs schema study. They should be implemented, if at all, after the proven levers — rank, reviews, directory presence, answer-ready content — have been exhausted.

"Schema markup moved AI citation rates by −4.6% to +2.2% in the only controlled study to date. llms.txt is unread by 97% of AI bots. The tactical energy devoted to both belongs on organic rank and review velocity."

III. How to measure AI visibility rigorously

The only reliable method for measuring AI visibility is the same method a scientist would use to measure any nondeterministic system: ask the questions your customers actually ask, repeat each question enough times to establish a confidence interval, report per-engine results separately, and validate that what the engine says about your business is accurate.

Build the prompt basket from real customer intent

Start with the queries your customers actually use to find businesses like yours. These are not the keyword-optimised phrases that appear in your SEO strategy; they are the natural-language questions a consumer would speak or type into an AI interface. For a physiotherapy clinic in Melbourne, the relevant basket might include: "which physio should I see for a knee injury in Melbourne," "best sports physiotherapist in the CBD," "how do I choose a physiotherapist," and "how much does physio cost in Melbourne."

The composition of the basket matters for a specific reason. Whitespark's May 2025 study of 540 queries across three cities and six verticals found that 68% of searches that surface a Google AI Overview are informational or cost-comparison queries — while only 15% of "near me" transactional queries trigger an AI Overview.⁸ Measuring AI visibility only on "X near me" queries dramatically underestimates the relevant surface. Most AI visibility happens on the research queries that precede a purchase decision, not the transactional queries that close one. A prompt basket weighted toward "near me" queries will show low AI presence not because the business is invisible, but because it is measuring the wrong question type.

Run each prompt multiple times and report a confidence interval

A single run of a prompt is a single data point from a probabilistic system. To measure AI visibility meaningfully, each prompt in the basket should be run multiple times — a minimum of ten, ideally twenty or more — and the citation rate reported with a statistical confidence interval. The Wilson binomial confidence interval is the appropriate estimator for this: if a business appears in 8 of 20 runs of a given prompt, the point estimate is 40%, but the 95% confidence interval spans roughly 19% to 64% — a range wide enough to make the point estimate operationally meaningless without the interval attached.

The practical implication: an AI visibility report that shows a single yes/no or a single percentage without a confidence interval is reporting noise as signal. The confidence interval is not a statistical nicety — it is the minimum information required to distinguish a real observation from a sampling artefact.

Report per-engine, never blended

With only 11–25% citation overlap across engines, a blended AI visibility score is averaging incompatible systems.² A business that appears in 60% of Gemini answers and 5% of ChatGPT answers has a blended score of roughly 32% — a number that is uninformative about what is actually happening and what to fix. The 60% Gemini figure reflects strong brand-site presence and good organic rank. The 5% ChatGPT figure may reflect weak directory presence — Yelp, in particular. The blended number obscures both findings.

Per-engine reporting also enables per-engine diagnosis. If ChatGPT citation is low, the likely gap is directories. If Perplexity citation is low, the gap may be community content (Reddit threads, forums, social proof) or direct web crawl content. If Gemini citation is low despite strong Google presence, the gap may be a brand-site issue — thin content, blocked crawling, or a mismatch between what the site says and what the Google Business Profile says.

Validate the facts, not just the citation

A significant and underappreciated problem with AI visibility measurement is the accuracy of what is cited. The SOCi and Yext 2026 studies found that ChatGPT and Perplexity get approximately 32% of local business facts wrong — wrong phone numbers, incorrect hours, misattributed services, outdated addresses — while Gemini achieves near-100% accuracy because it is grounded directly in Google Maps data.³ A measurement that records "business appeared in AI answer" without checking whether the cited facts are correct is potentially flagging hallucinated appearances as positive visibility events. Being cited with wrong information is not visibility. It is noise that may actively mislead customers.

Rigorous measurement therefore includes fact-checking the citation: does the address match? Are the hours correct? Is the rating and review count accurate? Is the business categorised correctly? These are verifiable against ground-truth data and should be part of any measurement protocol.

Establish a baseline and measure the trend

A single measurement, however carefully designed, is a snapshot of a nondeterministic system. Its value lies in comparison over time. Run the full prompt basket and per-engine protocol monthly, record the citation rates and confidence intervals, and watch the trend. A sustained upward trend in Gemini citation rate following a website content refresh is evidence of causality — not proof, but meaningful signal. A sustained downward trend in ChatGPT citation following a period without directory updates is a prompt to investigate directory accuracy. The trend is the measurement; the snapshot is the data point.

IV. Why most "AEO scores" on the market are guesswork dressed as precision

The category of AI visibility tools has grown rapidly since 2025. The market now contains dozens of products that claim to measure a business's AI visibility and return a proprietary score. The majority share a common methodology: they run a small number of prompts — often one or two per query type — once, record a binary yes/no, and aggregate into a score. This methodology has several compounding problems that make the output unreliable for any decision that depends on it.

A single run is not a measurement

The foundational problem: given the nondeterminism of AI systems, a single run of a prompt is a sample size of one from a distribution that may have a standard deviation wide enough to reverse the apparent finding on the next run. A tool that asks ChatGPT "who are the best dentists in Phoenix?" once and records whether your practice appears is not measuring your AI visibility. It is running a coin flip and reporting the result with a decimal point. The decimal point is false precision.

No per-engine breakdown means no actionable finding

Most tools return a single "AI visibility" or "AEO" score. Given the 11–25% citation overlap across engines, this is not a score — it is an average of three different problems. A business with strong Gemini visibility and weak ChatGPT visibility needs to fix its directory presence. A business with the reverse needs to address its website content and organic rank. A single blended score tells neither business what to do.

No accuracy validation means hallucinations count as wins

With a 32% fact-error rate on ChatGPT and Perplexity, tools that record a citation as a visibility event without checking whether the cited facts are correct are contaminating the measurement with hallucinations.³ A business that "appears" in ChatGPT with a wrong phone number, a closed location, or a competitor's address has not gained visibility. It has gained a source of customer confusion.

Proprietary scores obscure the underlying evidence

A proprietary AEO score — a single number from 0 to 100, or a letter grade, or a colour — contains less information than the raw data it was computed from and introduces vendor-specific weighting decisions that the buyer cannot audit. The score is not wrong because it is proprietary; it is problematic because it substitutes a vendor's judgment for the customer's ability to see what the AI actually said. The credible alternative is transparency: show the customer the verbatim AI response. "We asked ChatGPT these ten questions about dentists in Phoenix. You appeared in two of them. Your nearest competitor appeared in seven. Here are the ten verbatim answers." That is an undeniable measurement. A proprietary score derived from the same data is a filtered version that requires the customer to trust the filter.

The underlying issue is incentive structure. The vendor that discovers its clients have poor AI visibility has a strong incentive to produce a tool that scores AI visibility, charges for monthly tracking, and sells optimisation services. The tool that produces the most alarming and legible score fastest is the most commercially successful in the short term — regardless of whether the score is meaningful. The market is not selecting for measurement quality; it is selecting for legibility and alarm. The buyer who understands what rigorous measurement actually requires will be harder to mislead.

V. What to do: the actionable implication

The evidence converges on a short list of actions that are both proven to work and within a business's direct control. The sequence matters: fix the measurement before optimising for a score you cannot trust, then execute the proven levers before investing in tactics whose evidence base is thin.

Build the measurement first

Before commissioning an AEO tool or engaging an agency for AI visibility services, run the prompt basket described in Section III yourself. Choose ten to fifteen questions that represent how your customers actually find businesses like yours — weight the basket toward informational and comparison queries, not "near me" transactional queries. Run each prompt ten times on each major engine (ChatGPT, Gemini, Perplexity). Record the citation rate and whether the cited facts are accurate. This takes a few hours and produces more reliable data than any single-run proprietary score. It also establishes your baseline, against which future improvements can be measured honestly.

Fix the proven levers in order

The evidence hierarchy is clear. Execute in this sequence:

1. Organic rank. The Authoritas finding — 53% citation probability at rank #1 versus 37% at rank #10 — means that the returns to traditional local SEO (Google Business Profile completeness, local citation building, on-page relevance, review management) flow directly into AI citation probability.⁵ There is no meaningful trade-off between ranking well in Google and being cited by AI. They share the same upstream causes.

2. Review velocity above the engine floors. Getting and maintaining a rating above 4.3 stars on ChatGPT, 4.1 on Perplexity, and 3.9 on Gemini is a precondition for citation, not a differentiator.³ Review recency matters as much as rating: 74% of consumers only trust reviews from the last three months, and 32% only those from the last two weeks.¹ A business with a 4.6 average and its most recent review from eight months ago is operationally behind a business with a 4.3 average and twenty reviews from this quarter.

3. Multi-platform directory presence. ChatGPT draws 49% of citations from directories; Yelp is a primary source.² A claimed, accurate, complete Yelp profile is not optional for ChatGPT visibility. The same applies to the five to ten category-specific directories most relevant to your business vertical. Unclaimed profiles with outdated data are a negative signal, not neutral.

4. Answer-ready content. The Princeton GEO finding — content with citations, statistics, and specific claims earns 40% more AI visibility than content making equivalent claims without evidence — points to a concrete writing practice.⁶ Replace adjective-led marketing copy with specific, verifiable claims. Not "award-winning orthodontics in a modern practice" but "Invisalign Diamond Provider, serving the West Loop since 2014, with 340 verified Google reviews." The specificity is the extractable signal.

Do not invest in unproven tactics while proven ones are incomplete

Schema markup and llms.txt have controlled-evidence effect sizes indistinguishable from zero and 97% non-adoption by AI bots respectively.⁷ If a vendor is leading with these as AI visibility levers, they are selling the market's enthusiasm rather than the evidence. Implement LocalBusiness schema because it reduces the risk of AI systems misreading your information — not because it will move your citation rate. Skip llms.txt until a platform publishes evidence that it reads and weights the file, which none has done.

Track engine-level divergence as a diagnostic

If monthly re-measurement shows you appearing in Gemini but not in ChatGPT, the likely gap is directory presence — the platforms source from different wells. If you appear in ChatGPT but not in Perplexity, the gap may be community content and forum presence, where Perplexity's crawl heavily indexes Reddit and specialist communities. Per-engine divergence is the most informative diagnostic available, and it is only visible if you are measuring per-engine rather than as a blended score.

Ask the AI directly and read the answer

The most underused diagnostic tool available to any local business is free: ask the engines the questions your customers ask, and read what they say about you and your competitors. Not once — ten times, in fresh sessions, across the prompt basket. Read the verbatim answers. Note which competitors appear consistently. Note what the engine says about those competitors' distinguishing features. Note whether what the engine says about you is accurate. This direct inspection yields more actionable intelligence than any proprietary score, costs nothing, and can be done in an afternoon.

Key takeaways

AI is the third local discovery channel: 45% of consumers used it in 2026, up from 6% in 2025 (BrightLocal 2026). The channel is too large to ignore and too nondeterministic to measure carelessly.
AI visibility has no stable rank to track. The same prompt returns different answers across runs, and engines share only 11–25% citation overlap (Yext/Qwairy/Profound). A blended score is false precision; per-engine rates with confidence intervals are the minimum credible measurement.
The proven levers are: organic rank (#1 = 53% AI citation probability vs. #10 = 37%, Authoritas); review rating above engine floors (ChatGPT 4.3★, Perplexity 4.1★, Gemini 3.9★, SOCi 2026); multi-platform directory presence (ChatGPT draws 49% of citations from directories, Yext); and content with specific claims and evidence (+40% AI visibility, Princeton GEO).
Schema markup moved AI citation rates by −4.6% to +2.2% in the only controlled study (Ahrefs, 1,885 pages). llms.txt is unread by 97% of AI bots (Ahrefs, 137,210 domains). Both were explicitly debunked by Google at I/O 2026. Invest the tactical energy elsewhere.
ChatGPT and Perplexity get approximately 32% of local business facts wrong; Gemini achieves near-100% accuracy via Google Maps grounding (SOCi/Yext 2026). A citation containing wrong information is not visibility — it is a source of customer confusion. Measurement must include fact-validation, not just citation detection.
The credible alternative to a proprietary AEO score is the verbatim AI answer. Show the actual engine response: which competitor appeared, what was said, how often, on which engine. That is undeniable. A single-run binary score is not a measurement of a nondeterministic system — it is a data point marketed as one.

Notes and sources

¹ BrightLocal Local Consumer Review Survey 2026. Sample: n=1,002 US consumers. Tracks consumer behaviour in local business discovery and trust across search, social, and AI channels. Figures cited: 45% of consumers used AI for local business recommendations in 2026 (up from 6% in 2025); 74% trust only reviews from the last three months; 32% trust only reviews from the last two weeks. brightlocal.com

² Yext citation sourcing analysis (6.8 million citations examined), cross-referenced with Qwairy and Profound multi-million-citation studies. Findings cited: 86% of AI citations from brand-managed or brand-influenced sources; ChatGPT draws ~49% of local citations from third-party directories; Gemini draws ~52% from brand-owned sites; 11–25% citation overlap between engines. yext.com

³ SOCi 2026 Local Visibility Index. Dataset: 350,000+ business locations, 2,751 brands. Findings cited: ChatGPT recommends 1.2% of local business locations; Gemini 11%; Perplexity 7.4%; Google Local 3-Pack 35.9%; 45% overlap between traditional local search winners and AI recommendation winners; AI recommendation rating floors (ChatGPT 4.3★, Perplexity 4.1★, Gemini 3.9★); ChatGPT/Perplexity business profile accuracy ~68%; Gemini accuracy ~100% (grounded in Google Maps). uberall.com/soci

⁴ Ahrefs, "How AI Overviews Affect Organic and Paid Click-Through Rates," February 2026. Dataset: 300,000 keywords. Findings cited: AI Overviews cut top-result organic CTR by 58%; brands cited in AI answers see +35% organic clicks and +91% paid clicks. ahrefs.com

⁵ Authoritas AI Overview citation study. Methodology: measures probability of AI Overview citation by organic search rank position (#1 through #10). Finding cited: #1 organic position correlates with 53% AI-citation probability; #10 with 37%. authoritas.com

⁶ Princeton GEO (Generative Engine Optimisation) study. Finding cited: content containing citations, statistics, and direct quotes earns approximately 40% more AI visibility than equivalent content without cited evidence. This is the primary published finding establishing content format as an AI visibility driver.

⁷ Ahrefs. Two studies cited: (a) Schema markup controlled study — 1,885 pages, difference-in-differences methodology — found that adding LocalBusiness schema moved AI citation rates by −4.6% to +2.2% (range overlaps zero, not statistically significant). (b) llms.txt adoption study — 137,210 domains — found 97% of llms.txt files are never read by AI bots. Google explicitly debunked llms.txt and content-chunking as AI visibility tactics at Google I/O 2026. ahrefs.com

⁸ Whitespark local AI Overview prevalence study, May 2025. Methodology: 540 queries across 3 cities, 6 verticals. Finding cited: 68% of local searches surface a Google AI Overview overall; 15% for transactional "near me" queries; 92–97% for informational and cost-comparison queries. whitespark.ca