I Built the LLM Inference Correctness Tool That Should Already Exist
Every benchmark measures tokens per second. Nobody measures whether the tokens are correct. I built infer-check to fix that — and ran it across Llama-3.1-8B and Qwen3.5-4B on Apple Silicon.
8 min read