Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Problem Statement
Deploying large LLMs incurs high computational and financial costs, yet smaller models often lack the capability to handle complex tasks reliably. Existing approaches either use a single fixed model (wasteful or insufficient) or require complex ensemble/distillation pipelines, leaving a gap for lightweight, dynamic multi-scale routing that works across both local deployments and commercial APIs.
Key Novelty
- Dual-signal confidence estimation that jointly evaluates a model's likelihood of knowing the correct answer AND the probability that its generated response is accurate
- Multi-scale cascading inference pipeline that dynamically delegates uncertain or complex queries to progressively larger models, avoiding unnecessary large-model calls
- Demonstrated applicability to both open-weight multi-scale local inference and commercial API settings (GPT-4o), enabling broad real-world deployment scenarios
Evaluation Highlights
- On MMLU benchmark, achieves accuracy comparable to the largest model while reducing computational costs by 20%–40% over always using the largest model
- When applied to GPT-4o API calls, reduces token usage by approximately 60%, translating directly into cost savings for commercial API deployments
Breakthrough Assessment
Methodology
- Step 1 – Confidence Assessment: For a given input query, run inference on the smallest candidate model and compute two confidence signals: (a) the model's estimated probability of possessing correct knowledge for the task, and (b) the probability that the specific generated response is accurate.
- Step 2 – Routing Decision: Compare the combined confidence estimate against a threshold; if confidence is sufficiently high, accept the small model's output; otherwise, escalate the query to the next larger model in the cascade.
- Step 3 – Iterative Delegation: Repeat the confidence assessment and routing process at each scale level until either confidence exceeds the threshold or the largest available model is reached, ensuring reliability for hard cases while minimizing average compute.
System Components
Evaluates whether the model is likely to possess the factual or reasoning knowledge required to answer the query correctly, acting as a prior confidence signal before generation.
Assesses the probability that the model's actual generated response is correct, acting as a post-generation verification signal to catch overconfident but wrong outputs.
An ordered sequence of models from smallest to largest; queries are passed up the chain only when confidence thresholds are not met, minimizing average inference cost.
A tunable decision boundary that balances the accuracy-cost trade-off by controlling how aggressively queries are escalated to larger models.
Results
| Metric/Benchmark | Baseline (Largest Model Always) | This Paper (Cascade) | Delta |
|---|---|---|---|
| MMLU Accuracy | Largest model accuracy (~100% reference) | Comparable to largest model | Negligible accuracy loss |
| Computational Cost (Local) | 100% (full large-model inference) | 60%–80% of baseline | -20% to -40% |
| GPT-4o Token Usage (API) | 100% (all queries to GPT-4o) | ~40% of baseline | -60% token reduction |
Key Takeaways
- Practitioners can integrate confidence-based cascading into existing multi-model pipelines with minimal overhead, achieving large-model quality at substantially lower cost—especially valuable for high-volume production workloads.
- The ~60% token reduction on GPT-4o API calls makes this approach highly attractive for cost-sensitive commercial applications, where API billing is a primary operational expense.
- The dual-signal confidence design (knowledge likelihood + response accuracy) is more robust than single-signal approaches; ML engineers should consider both pre- and post-generation confidence when designing routing systems for edge or resource-constrained deployments.
Abstract
Large Language Models (LLMs) have revolutionized inference across diverse natural language tasks, with larger models performing better but at higher computational costs. We propose a confidence-driven strategy that dynamically selects the most suitable model based on confidence estimates. By assessing a model's confidence in handling the task and response accuracy, tasks that are likely to be solved correctly are retained, while more uncertain or complex cases are delegated to a larger model, ensuring reliability while minimizing computation. Specifically, we evaluate a model's likelihood of knowing the correct answer and the probability that its response is accurate. Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%. When applied to GPT-4o API calls, it reduces token usage by approximately 60\%, further improving cost efficiency. These findings indicate the potential of confidence-based model selection to enhance real-world LLM deployment, particularly in resource-constrained settings such as edge devices and commercial API applications.