Is the LLM API you're paying for actually the model it claims?

modelcheck.org is an independent, continuously-running integrity dashboard for the resellers and middleware ("shadow APIs", "中转站") that sell indirect access to frontier reasoning models. We publish multi-method audit data on named providers, so consumers and researchers can verify what they are actually being served.

Pre-launch Public dashboard goes live shortly. This page describes what we are building.

The problem

45.83%

of shadow-API endpoints fail model identity verification, per the CISPA Helmholtz Center 2026 audit — they bill for the advertised premium model while serving a cheaper substitute.

−47.21%

observed accuracy collapse on Gemini-2.5-flash routed through shadow APIs vs. the official endpoint, on medical-domain tasks (same study).

Existing audits are frozen-in-time snapshots, anonymize the providers they tested, and never re-run. The market moves weekly. There is no persistent, public, named dashboard. modelcheck.org is that dashboard.

How we audit

No single test catches every form of substitution. We combine three complementary methods, each measuring a different surface:

Chain-of-thought length-distribution detection

Our own companion paper. The per-prompt length distribution of a reasoning model is largely a function of its weights. A provider that silently disables thinking-mode or substitutes a non-reasoning sibling model produces a measurably different length distribution even when the median is preserved. Catches the cheapest substitution attack.
Model Equality Testing

From Gao et al. (2024). Two-sample MMD with a Hamming kernel on the token distribution. Catches gross substitution and heavy quantisation.
Active fingerprinting (LLMmap-style)

From Cai et al. (2025). GCG-optimised suffixes elicit a coarse first-token-preference signal robust to inference-stack noise. Anchors model identity over a single token.

Each provider page will report all three signals against a known-honest first-party reference, plus latency, throughput, and price.

Hard rules

We never publish the word "fraud" about a named provider. Our verdicts are consistent, inconclusive, or signal-detected vs reference distribution — the operator interprets, we report data.
Methodology and prompt-generation code are open source. Audit seeds rotate weekly so prompt-keyed cache replay does not poison the dashboard.
Every provider has a permanent right of reply. Their response is published alongside the audit data, unedited.
We do not, and cannot, certify that any provider is "honest". A consistent result means we found no signal at the methods we run today; light quantisation and stack drift are known undetectable.

Status & roadmap

Phase	Scope	Status
v0.1	Length-distribution audit; 5 reasoning models; ~10 named resellers	in build
v0.2	+ Model Equality Testing; user-submitted local-audit uploads	planned
v0.3	+ LLMmap fingerprinting; chrome extension; provider right-of-reply	planned

Is the LLM API you're paying for actually the model it claims?

The problem

How we audit

Chain-of-thought length-distribution detection

Model Equality Testing

Active fingerprinting (LLMmap-style)

Hard rules

Status & roadmap