modelcheck.org

Is the LLM API you're paying for actually the model it claims?

modelcheck.org is an independent, continuously-running integrity dashboard for the resellers and middleware ("shadow APIs", "中转站") that sell indirect access to frontier reasoning models. We publish multi-method audit data on named providers, so consumers and researchers can verify what they are actually being served.

Pre-launch Public dashboard goes live shortly. This page describes what we are building.

The problem

45.83%
of shadow-API endpoints fail model identity verification, per the CISPA Helmholtz Center 2026 audit — they bill for the advertised premium model while serving a cheaper substitute.
−47.21%
observed accuracy collapse on Gemini-2.5-flash routed through shadow APIs vs. the official endpoint, on medical-domain tasks (same study).

Existing audits are frozen-in-time snapshots, anonymize the providers they tested, and never re-run. The market moves weekly. There is no persistent, public, named dashboard. modelcheck.org is that dashboard.

How we audit

No single test catches every form of substitution. We combine three complementary methods, each measuring a different surface:

  1. Chain-of-thought length-distribution detection

    Our own companion paper. The per-prompt length distribution of a reasoning model is largely a function of its weights. A provider that silently disables thinking-mode or substitutes a non-reasoning sibling model produces a measurably different length distribution even when the median is preserved. Catches the cheapest substitution attack.

  2. Model Equality Testing

    From Gao et al. (2024). Two-sample MMD with a Hamming kernel on the token distribution. Catches gross substitution and heavy quantisation.

  3. Active fingerprinting (LLMmap-style)

    From Cai et al. (2025). GCG-optimised suffixes elicit a coarse first-token-preference signal robust to inference-stack noise. Anchors model identity over a single token.

Each provider page will report all three signals against a known-honest first-party reference, plus latency, throughput, and price.

Hard rules

Status & roadmap

PhaseScopeStatus
v0.1 Length-distribution audit; 5 reasoning models; ~10 named resellers in build
v0.2 + Model Equality Testing; user-submitted local-audit uploads planned
v0.3 + LLMmap fingerprinting; chrome extension; provider right-of-reply planned