Zero-Overhead Introspection
for Adaptive Test-Time Compute

UC Berkeley · MIT CSAIL · Liquid AI
*Corresponding author: rohinm@berkeley.edu
ZIP-RC overview figure

ZIP-RC provides real-time reward–cost introspection and uses it to allocate test-time compute adaptively.

Abstract

Large language models excel at reasoning but lack key aspects of introspection, including the ability to anticipate their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this ability, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods such as Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial inference cost by requiring extra models or forward passes. We present ZIP-RC, which equips models with zero-overhead introspective predictions of reward and cost. At every token during generation, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length—no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward–cost introspection, ZIP-RC allows models to reason adaptively and more efficiently.

Method

ZIP-RC trains an LLM to expose real-time introspection signals without any additional inference-time compute.

  • Zero overhead: reserve a small contiguous slice of the vocabulary and interpret its logits as auxiliary predictions; mask those tokens so they are never sampled.
  • Joint reward–cost prediction: at each token, predict a joint distribution over final reward (e.g., correctness) and remaining generation length.
  • Adaptive parallel decoding: use the joint to compute a sampling utility that trades off expected best reward, compute, and latency, then dynamically continue/spawn/pause samples.

Results

ZIP-RC’s joint predictions are calibrated enough to guide inference, enabling controllable tradeoffs between accuracy, compute, and latency.

ZIP-RC joint distribution predictions vs. ground truth

Joint reward–cost predictions (ZIP-RC) compared to ground-truth estimates from rollouts.

ZIP-RC sampling Pareto frontiers across models and benchmarks

ZIP-RC sampling traces Pareto frontiers and improves accuracy at matched or lower generation cost.

BibTeX

@misc{manvi2025zerooverheadintrospectionadaptivetesttime,
  title={Zero-Overhead Introspection for Adaptive Test-Time Compute},
  author={Rohin Manvi and Joey Hong and Tim Seyde and Maxime Labonne and Mathias Lechner and Sergey Levine},
  year={2025},
  eprint={2512.01457},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2512.01457},
}