infer-check
Cross-backend differential testing CLI for LLM inference correctness — quantization sweeps, determinism verification, and serving-layer faithfulness across mlx-lm, vllm-mlx, and llama.cpp.
The problem
Every benchmark for LLM inference engines measures tokens per second. Nobody measures whether the tokens are correct. KV cache corruption in vLLM-Ascend, FP8 garbage output in vLLM, 32.5% element mismatches in SGLang's DeepGEMM kernels — these are engine correctness failures that throughput benchmarks completely miss.
No general-purpose tool existed for systematic cross-backend differential testing of LLM inference outputs. infer-check fills that gap.
What it does
pip install infer-check gives you a CLI with five commands: sweep, diff, stress, determinism, and report. It ships with 189 curated prompts across five suites targeting known quantization failure modes.
The design principle is differential testing. Compare a baseline (bf16) against a test (4-bit), or one backend (mlx-lm) against another (vllm-mlx), or the same config run multiple times (determinism). infer-check computes text similarity, classifies divergence into severity tiers, and generates HTML reports.
Key findings
- First published MLX quantization correctness data: 4-bit degrades task-dependently — 77% severe on adversarial numerics, 70% on reasoning, 61% on code
- First Qwen3.5-4B quantization study: MoE + Gated DeltaNet degrades at the same rate as dense Llama — expert redundancy didn't help under uniform quantization
- First vllm-mlx faithfulness verification: 80/80 identical outputs vs mlx-lm at temperature=0
- Thinking token divergence detection: Automatically catches structural divergence from reasoning token stripping in chat APIs