infer-check — Karan Mohindroo

The problem

Every benchmark for LLM inference engines measures tokens per second. Nobody measures whether the tokens are correct. KV cache corruption in vLLM-Ascend, FP8 garbage output in vLLM, 32.5% element mismatches in SGLang's DeepGEMM kernels — these are engine correctness failures that throughput benchmarks completely miss.

No general-purpose tool existed for systematic cross-backend differential testing of LLM inference outputs. infer-check fills that gap.

What it does

pip install infer-check gives you a CLI with five commands: sweep, diff, stress, determinism, and report. It ships with 189 curated prompts across five suites targeting known quantization failure modes.

The design principle is differential testing. Compare a baseline (bf16) against a test (4-bit), or one backend (mlx-lm) against another (vllm-mlx), or the same config run multiple times (determinism). infer-check computes text similarity, classifies divergence into severity tiers, and generates HTML reports.

Key findings

First published MLX quantization correctness data: 4-bit degrades task-dependently — 77% severe on adversarial numerics, 70% on reasoning, 61% on code
First Qwen3.5-4B quantization study: MoE + Gated DeltaNet degrades at the same rate as dense Llama — expert redundancy didn't help under uniform quantization
First vllm-mlx faithfulness verification: 80/80 identical outputs vs mlx-lm at temperature=0
Thinking token divergence detection: Automatically catches structural divergence from reasoning token stripping in chat APIs