Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference

A confidence-driven model selection strategy dynamically routes inference tasks to the smallest sufficient LLM by estimating both the model's knowledge likelihood and response accuracy, achieving large-model accuracy at significantly reduced cost.

Problem Statement

Deploying large LLMs incurs high computational and financial costs, yet smaller models often lack the capability to handle complex tasks reliably. Existing approaches either use a single fixed model (wasteful or insufficient) or require complex ensemble/distillation pipelines, leaving a gap for lightweight, dynamic multi-scale routing that works across both local deployments and commercial APIs.

Key Novelty

Dual-signal confidence estimation that jointly evaluates a model's likelihood of knowing the correct answer AND the probability that its generated response is accurate
Multi-scale cascading inference pipeline that dynamically delegates uncertain or complex queries to progressively larger models, avoiding unnecessary large-model calls
Demonstrated applicability to both open-weight multi-scale local inference and commercial API settings (GPT-4o), enabling broad real-world deployment scenarios

Evaluation Highlights

On MMLU benchmark, achieves accuracy comparable to the largest model while reducing computational costs by 20%–40% over always using the largest model
When applied to GPT-4o API calls, reduces token usage by approximately 60%, translating directly into cost savings for commercial API deployments

Signal Assessment

5/10 The paper presents a solid and practically useful contribution to inference-time efficiency, but the core idea of cascading/routing to larger models based on confidence is not entirely new; the novelty lies in the dual-signal confidence design and empirical validation on both local and API settings, making it a strong engineering contribution rather than a paradigm shift.

Methodology

Step 1 – Confidence Assessment: For a given input query, run inference on the smallest candidate model and compute two confidence signals: (a) the model's estimated probability of possessing correct knowledge for the task, and (b) the probability that the specific generated response is accurate.
Step 2 – Routing Decision: Compare the combined confidence estimate against a threshold; if confidence is sufficiently high, accept the small model's output; otherwise, escalate the query to the next larger model in the cascade.
Step 3 – Iterative Delegation: Repeat the confidence assessment and routing process at each scale level until either confidence exceeds the threshold or the largest available model is reached, ensuring reliability for hard cases while minimizing average compute.

System Components

Knowledge Likelihood Estimator

Evaluates whether the model is likely to possess the factual or reasoning knowledge required to answer the query correctly, acting as a prior confidence signal before generation.

Response Accuracy Predictor

Assesses the probability that the model's actual generated response is correct, acting as a post-generation verification signal to catch overconfident but wrong outputs.

Multi-Scale Model Cascade

An ordered sequence of models from smallest to largest; queries are passed up the chain only when confidence thresholds are not met, minimizing average inference cost.

Routing Threshold Controller

A tunable decision boundary that balances the accuracy-cost trade-off by controlling how aggressively queries are escalated to larger models.

Results

Metric/Benchmark	Baseline (Largest Model Always)	This Paper (Cascade)	Delta
MMLU Accuracy	Largest model accuracy (~100% reference)	Comparable to largest model	Negligible accuracy loss
Computational Cost (Local)	100% (full large-model inference)	60%–80% of baseline	-20% to -40%
GPT-4o Token Usage (API)	100% (all queries to GPT-4o)	~40% of baseline	-60% token reduction

Key Takeaways

Practitioners can integrate confidence-based cascading into existing multi-model pipelines with minimal overhead, achieving large-model quality at substantially lower cost—especially valuable for high-volume production workloads.
The ~60% token reduction on GPT-4o API calls makes this approach highly attractive for cost-sensitive commercial applications, where API billing is a primary operational expense.
The dual-signal confidence design (knowledge likelihood + response accuracy) is more robust than single-signal approaches; ML engineers should consider both pre- and post-generation confidence when designing routing systems for edge or resource-constrained deployments.

Abstract

Large Language Models (LLMs) have revolutionized inference across diverse natural language tasks, with larger models performing better but at higher computational costs. We propose a confidence-driven strategy that dynamically selects the most suitable model based on confidence estimates. By assessing a model's confidence in handling the task and response accuracy, tasks that are likely to be solved correctly are retained, while more uncertain or complex cases are delegated to a larger model, ensuring reliability while minimizing computation. Specifically, we evaluate a model's likelihood of knowing the correct answer and the probability that its response is accurate. Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%. When applied to GPT-4o API calls, it reduces token usage by approximately 60\%, further improving cost efficiency. These findings indicate the potential of confidence-based model selection to enhance real-world LLM deployment, particularly in resource-constrained settings such as edge devices and commercial API applications.