I Built the LLM Inference Correctness Tool That Should Already Exist

Every benchmark for LLM inference engines measures the same thing: tokens per second. Throughput, latency, time-to-first-token. But nobody measures whether the tokens are correct.

This isn't a theoretical concern. In the past year alone: KV cache NaN pollution in vLLM-Ascend permanently corrupted all subsequent requests after a single bad input. FP8 KV quantization in vLLM caused models to output repeated garbage. SGLang's FP8 DeepGEMM kernels on Blackwell GPUs showed 32.5% element mismatches. These aren't model quality problems — they're engine correctness failures that benchmarks completely miss.

I built infer-check to test for exactly these failures. It's an open-source CLI that runs quantization sweeps, cross-backend differential tests, stress tests, and determinism checks across MLX-based inference engines. I ran it across Llama-3.1-8B and Qwen3.5-4B on Apple Silicon and got some early results worth sharing. The sample sizes are small (50–80 prompts per test), so treat these as directional signals rather than definitive benchmarks — I'm sharing because the data didn't exist at all, not because I think n=80 is conclusive.

The gap: every cross-backend study ignores correctness

I surveyed the published cross-backend comparisons for LLM inference and found a consistent blind spot. BentoML's 2024 benchmark of five serving frameworks? Throughput only. Red Hat's vLLM-vs-llama.cpp comparison on H200 GPUs? Throughput only. The vllm-mlx arXiv paper comparing against mlx-lm and llama.cpp on Apple Silicon? Throughput only. The November 2025 comparative study of MLX, MLC-LLM, Ollama, llama.cpp, and PyTorch MPS? Throughput only.

Not one of these studies checks whether two backends produce the same output for the same input.

The tooling that does exist is fragmented. vLLM has internal model tests that compare against HuggingFace Transformers, but only for that specific pair. A community tool called hf-vllm-consistency-test does similarity comparisons, but again only HF-vs-vLLM. Microsoft's LLM-42 addresses determinism in SGLang but doesn't do cross-engine comparison. LLMC (EMNLP 2024) can export quantized models to multiple backends but never compares their outputs.

No general-purpose tool exists for this. That's what infer-check is.

How infer-check works

pip install infer-check gives you a CLI with six commands: sweep, compare, diff, stress, determinism, and report. It ships with 209 curated prompts across six suites targeting known quantization failure modes — reasoning, code generation, adversarial numerics, long context, quant-sensitive edge cases, and determinism testing.

The design principle is differential testing. You need a baseline and a test. The baseline can be a higher-precision model (bf16 vs 4-bit), a different quantization of the same model (Bartowski GGUF vs Unsloth GGUF vs MLX native), a different backend (mlx-lm vs vllm-mlx), or the same configuration run multiple times (determinism). infer-check compares outputs using KL divergence and flip rate — not just text similarity — classifies divergence into severity tiers (identical, minor, moderate, severe), and generates self-contained HTML reports you can share directly.

The compare command is the simplest entry point. Pass two model identifiers — HuggingFace repos, Ollama tags, local GGUF paths, or explicit backend prefixes — and infer-check auto-detects the backend, runs the prompt suite, and reports per-category KL divergence, flip rates, and a detailed breakdown of every prompt where the answer changed:

# Two MLX quants
infer-check compare \
  mlx-community/Llama-3.1-8B-Instruct-4bit \
  mlx-community/Llama-3.1-8B-Instruct-8bit

# MLX native vs Ollama GGUF
infer-check compare \
  mlx-community/Llama-3.1-8B-Instruct-4bit \
  ollama:llama3.1:8b-instruct-q4_K_M

Why KL divergence and flip rate instead of perplexity? Perplexity is misleading for quantization quality — there's a longstanding llama.cpp discussion about this, and the NeurIPS 2024 paper "Accuracy is Not All You Need" showed 0.96–0.97 Spearman correlation between KLD and answer flips. Perplexity can look fine while 13.6% of answers silently change.

One feature that paid for itself immediately: every sweep automatically runs the baseline model twice as a self-check. If the baseline isn't 100% identical against itself, your comparison data is unreliable. This caught a chat template bug on my first run that would have contaminated every experiment.

Finding 1: 4-bit quantization degrades task-dependently, and numerics breaks worst

This finding is well-established in the literature — Marchisio et al. (EMNLP 2024), the "Quantization Meets Reasoning" paper (arXiv 2025), Zhou et al.'s Task-Stratified Knowledge Scaling Laws, and Lee et al. (IJCAI 2025) have all shown that mathematical reasoning degrades faster than other tasks under quantization. I'm including it here not as a novel claim but because aggregate MLX quantization benchmarks exist (JANG, Aider), yet I wasn't able to find task-specific degradation analysis under MLX's native quantization — the kind that shows which tasks break first and how severely.

Here's what Llama-3.1-8B-Instruct looks like at 4-bit on MLX, compared against its own bf16 baseline:

Prompt Suite	Identical	Severe	Mean Similarity
adversarial-numerics	0/30	23/30	0.311
reasoning	1/50	35/50	0.384
code	0/49	30/49	0.452

The gradient is clear: adversarial numerics (IEEE 754 edge cases, large number arithmetic, precision traps) suffers 77% severe degradation. Multi-step reasoning is at 70%. Code generation is at 61%.

At 8-bit the picture is much better — 40% of reasoning prompts produced identical output, and mean similarity was 0.81. The degradation cliff between 8-bit and 4-bit is steep.

Finding 2: MoE degrades at the same rate as dense — on an architecture nobody's tested on MLX

Qwen3.5-4B was released on March 2, 2026 — less than two weeks before I ran these tests. It uses Gated Delta Networks (a 3:1 hybrid of linear attention and full attention) combined with sparse Mixture-of-Experts. Quantization benchmarks for Qwen3.5 now exist on GGUF (Unsloth) and NVFP4 (Kaitchup), but none test MLX's native quantization on this architecture.

What I found: Qwen3.5-4B degrades at essentially the same rate as dense Llama-3.1-8B.

Model	Identical	Severe	Mean Similarity
Llama-3.1-8B (dense)	1/50	35/50	0.384
Qwen3.5-4B (MoE+DeltaNet)	0/50	35/50	0.380

35/50 severe for both. Mean similarity within 0.004. The expert redundancy that MoQE predicts should help didn't help here — at least not under MLX's native quantization applied uniformly. As far as I can tell, nobody's published MLX-native quantization correctness data on this architecture yet — happy to be corrected.

Finding 3: vllm-mlx is perfectly faithful to mlx-lm

I ran 80 prompts through both backends at temperature=0 on Llama-3.1-8B-Instruct-4bit: 50 reasoning prompts and 30 adversarial numerics prompts.

Result: 80/80 identical. Perfect faithfulness.

Prompt Suite	Identical	Mean Similarity
reasoning (n=50)	50/50	1.000
adversarial-numerics	30/30	1.000

I couldn't find any existing correctness comparison between these two — every other comparison I surveyed is throughput-only. n=80 is a small sample, but the data didn't exist at all. The result means that if you're choosing between mlx-lm for development and vllm-mlx for serving, the serving layer won't change your outputs.

Finding 4: Reasoning models break cross-backend testing

When I ran the same diff test on Qwen3.5-4B, the result was 50/50 failure — 100% divergence, mean similarity 0.054. This looked catastrophic until I inspected the outputs.

mlx-lm returned the full generation including Qwen3.5's thinking chain. vllm-mlx's chat endpoint returned only the final answer. Both answers were correct. The divergence was structural, not semantic.

This maps to a known issue — Qwen's official documentation warns that vLLM drops reasoning_content fields during API preprocessing. But the fact that automated differential testing flags this as 100% failure underscores an important point: any cross-backend correctness tool must distinguish between semantic divergence and format divergence for reasoning models.

This is something I had to fix in infer-check itself — I added a --chat/--no-chat flag so users explicitly control whether the tool hits /v1/completions or /v1/chat/completions.

Finding 5: mlx-lm is deterministic, and vllm-mlx holds under load

Determinism. I ran 50 prompts 20 times each at temperature=0 on both models via mlx-lm. Both achieved 50/50 perfect determinism — every prompt produced bit-identical output across all 20 runs.

This sounds obvious, but it isn't. A systematic August 2024 study found that even with temperature=0 and fixed seeds, LLM outputs show considerable variation in production serving engines — Mixtral-8x7b showed a 72 percentage-point accuracy range across just 10 runs. I haven't seen mlx-lm's determinism verified elsewhere, though the sample is small enough that this should be treated as a positive signal rather than a proof.

Stress testing. I hit vllm-mlx with concurrent requests at levels 1, 2, 4, and 8. Zero errors and 100% output consistency at all concurrency levels.

What this means for deployment

Don't assume MoE is more robust. The theoretical argument for MoE quantization robustness didn't hold for Qwen3.5-4B under MLX's native quantization. Test it — don't rely on architecture alone.

8-bit is the safe line. The degradation cliff between 8-bit and 4-bit on MLX is steep. If your use case involves numerical precision, 4-bit is not safe without task-specific validation.

vllm-mlx is safe for serving. If you develop with mlx-lm locally and serve with vllm-mlx in production, the outputs will match. This is correctness parity, not just throughput parity.

The tool

infer-check is on PyPI and GitHub. It ships with all 209 prompts across six suites used in this post.

# Head-to-head quant comparison with KL divergence + flip rate
infer-check compare \
  mlx-community/Llama-3.1-8B-Instruct-4bit \
  mlx-community/Llama-3.1-8B-Instruct-8bit

# Cross-provider comparison (auto-detects backends)
infer-check compare \
  mlx-community/Llama-3.1-8B-Instruct-4bit \
  ollama:llama3.1:8b-instruct-q4_K_M

# Quantization sweep across multiple bit-widths
infer-check sweep \
  --models "bf16=mlx-community/Meta-Llama-3.1-8B-Instruct-bf16,\
            4bit=mlx-community/Meta-Llama-3.1-8B-Instruct-4bit" \
  --backend mlx-lm --prompts reasoning

# Cross-backend diff
infer-check diff \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --backends "mlx-lm,openai-compat" \
  --base-urls ",http://localhost:8000" --prompts reasoning

# Generate a shareable HTML report from any results directory
infer-check report ./results/ --format html

What's next

v0.2.0 shipped the compare command with KL divergence, flip rates, cross-provider model resolution, and HTML report generation. infer-check is now in Beta — the CLI surface is stable.

The next phase is CUDA backend support. vLLM and SGLang both expose OpenAI-compatible APIs, so the integration pattern is the same as the existing Ollama support. Once those land, infer-check becomes the only tool that spans both Apple Silicon and CUDA ecosystems for cross-engine correctness testing. The immediate targets: cross-GPU correctness testing (A10 vs A100 vs H100), attention backend comparison (FlashAttention vs Triton MLA), and version upgrade regression testing. The vllm-mlx team presents at EuroMLSys on April 27 — as the MLX serving ecosystem matures, correctness testing needs to keep pace.

The field is moving from "does quantization hurt?" toward "how do we test and mitigate it in production serving?" Every cross-backend study I found measures throughput. None of them check whether the outputs match. That gap is what infer-check exists to close — and if you're deploying quantized models across multiple engines, it's a gap worth closing before your users find it for you.