Validation results
How Frontguard performs against real open-source frontends — the live harness, the numbers we publish, and the limits we admit.
Validation results
Frontguard publishes how it performs against real open-source frontends — not staged fixtures. This page is the customer-facing summary of the latest harness run; the raw measurement file in the repo holds the unedited numbers, methodology, and skip notes.
What this page is, and isn't
This is a honest snapshot of what the v0.2 harness measured. The classification-accuracy numbers shipped on the landing page are gated on AI provider keys being configured in CI — when they aren't, we report only the pixel-only false-positive rate and disclose that explicitly.
Methodology in one breath
For every target repo, the harness:
- Clones at
--depth 1. - Installs dependencies (auto-detects pnpm / yarn / npm; falls back to non-frozen install when lockfiles are out of date).
- Boots the dev server and waits up to 120 s for it to serve traffic.
- Runs Frontguard with
--update-baselinesagainst every configured route — the baseline pass. - Runs Frontguard again against the same unchanged code — the recheck pass.
- Tears the dev server down and removes the temp dir.
Anything non-pass in the recheck pass is, by definition, a pixel-only false positive on unchanged code. That's the metric we treat as a launch gate.
The target repos
| Name | Repo | Why it's in the harness |
|---|---|---|
taxonomy | shadcn-ui/taxonomy | App Router marketing + dashboard |
tailwind-dashboard | shadcn-ui/next-template | Tailwind-heavy components, npm-based |
chakra-ui-docs | chakra-ui/chakra-ui-docs | Component library docs, large surface |
medusa-storefront | medusajs/nextjs-starter-medusa | Commerce flows; needs a Medusa backend |
nextra-docs | shuding/nextra | MDX docs site monorepo |
The selection is intentionally diverse — marketing pages, dashboards, component galleries, commerce flows, and docs. The point isn't to pick easy targets; it's to find the rendering modes that actually generate noise.
How to reproduce the numbers
# 1. Build and link the CLI so the harness can resolve `frontguard`.
npm run build:cli
(cd packages/cli && npm link)
# 2. Run the harness against all 5 repos (or filter to one).
./validation/run-external.sh
./validation/run-external.sh tailwind-dashboard
# 3. Aggregate into the metrics table.
node validation/aggregate-results.mjsEach per-repo run produces validation/results/<name>.json containing the
full pixel diff + (when AI is enabled) the classifier verdict for every
route × viewport combination. The aggregator script flattens those into the
pixel-only false-positive rate that the landing page surfaces.
What we can't measure today
Honest limitations
The launch-gate template asks for AI classification accuracy. We can't measure that without a key, and we don't fake it.
- AI classification accuracy requires
OPENAI_API_KEYorANTHROPIC_API_KEY. The classifier code and the metrics module (src/diff/validation-metrics.ts) are implemented and unit tested; only the live measurement is gated on credentials. - True-positive rate. Without a known-regression PR set, we can't tell how often Frontguard catches a real bug. The recheck pass measures only negatives.
- Anti-flake consensus. The harness renders each route once per pass. Multi-render consensus is exercised by Frontguard's normal pipeline but not isolated as a separate metric here.
- Repos requiring backend services.
medusa-storefrontneeds a running Medusa backend and a publishable API key — neither of which the harness provisions.
The current published numbers
The aggregate and per-repo tables live in validation/results-v0.2.md
in the repo. The landing page's Validation section reads from the same JSON
artifacts. If you see a percentage on the marketing site that doesn't trace
back to those artifacts, please open an issue.