Skip to main content
KM

CALID

Confidence-based filter decoding system that routes LLM inference requests between small and large models to reduce compute costs while maintaining output quality. BayLearn 2024 poster at Apple.

The problem

LLM inference is expensive — processing a single request can cost up to 10x more than a traditional keyword search. Speculative decoding helps by using small draft models to produce initial token sequences that a large model verifies in parallel, but it still wastes compute generating unnecessary drafts that the large model rejects.

The approach

CALID (Collaborative Accelerate LLM Inference with Draft Model) introduces filter decoding: instead of always sending draft outputs to the large model for verification, we compute a confidence score based on negative log-likelihood (NLL) of the draft model's top prediction. High-confidence drafts are accepted directly without touching the large model at all.

The system architecture has three components:

  1. SLM (Small Language Model) generates draft tokens with confidence scores
  2. Inference Gateway batches requests and applies the confidence threshold — high-confidence drafts are returned immediately, low-confidence ones are sent to the LLM
  3. LLM refines only the uncertain outputs, processing them as batched requests

The confidence score is simple but effective: for each token position, take the NLL of the most likely token in the draft model's distribution. High NLL means the model is uncertain; low NLL means the prediction is reliable.

Results

Using Llama-2 (7B as draft, 70B as target), CALID filtered out 36–48% of requests away from the large model with minimal impact on output quality (perplexity). The key insight: many tokens in natural language are predictable enough that a small model's output is indistinguishable from the large model's — you only need the expensive model for genuinely difficult predictions.

My contribution

This work was done during a summer 2024 research internship at UC Santa Cruz in Prof. Chen Qian's lab. I contributed to the system design and experimental evaluation. The paper was presented as a poster at BayLearn 2024 hosted at Apple's campus.

Connection to current work

CALID directly informed my later work on infer-check. The core insight — that inference correctness varies by difficulty and you need task-specific testing to catch failures — is the same principle that drives infer-check's differential testing approach. CALID asks "when is the small model good enough?"; infer-check asks "when does quantization break things?" Both are about understanding where inference quality degrades.