DAJ: Data-Reweighted LLM Judge for Test-Time Scaling in Code Generation
Problem Statement
Test-time scaling via Best-of-N selection depends on reliable LLM judges, but training such judges is hampered by severe distribution shifts: imbalances between easy and hard problems, mismatches between training tasks and target benchmarks, and trajectory mismatch from using cheaper models to generate training data. Existing approaches rely on hand-crafted heuristics to address these issues, which are brittle and fail to generalize across benchmarks. A principled, automatic method for handling these distribution shifts is needed to unlock the full potential of LLM judges in test-time compute scaling.
Key Novelty
- First application of data reweighting to LLM-as-a-Judge training for test-time scaling, learning domain-level or instance-level importance weights automatically via a bi-level optimization framework
- A meta-set aligned with target benchmarks is used as a held-out validation signal to drive the bi-level learning, enabling generalization-focused judge training without hand-crafted heuristics
- Integration of verifiable rewards into reasoning-based LLM judge training, combining reinforcement-style learning with data reweighting to handle hard problems, in-distribution samples, and trajectory-aligned data jointly
Evaluation Highlights
- DAJ achieves state-of-the-art performance on LiveCodeBench, outperforming strong test-time scaling baselines and leading proprietary models
- DAJ achieves state-of-the-art performance on BigCodeBench, demonstrating generalization across multiple competitive code generation benchmarks
Breakthrough Assessment
Methodology
- Sample multiple candidate code solutions from a base model for each problem, forming the pool for Best-of-N selection at test time
- Train a reasoning-based LLM judge using verifiable rewards (e.g., execution-based feedback) within a bi-level optimization framework that simultaneously learns domain-level or instance-level data-importance weights by optimizing generalization on a held-out meta set aligned with target benchmarks
- At inference time, apply the trained DAJ judge to score and select the best candidate solution from the sampled pool, effectively scaling test-time compute with a distribution-aware judge
System Components
An outer-inner optimization loop where inner-level training updates judge parameters given current data weights, and outer-level optimization updates data-importance weights (domain or instance level) to maximize performance on a held-out meta set
A held-out validation set aligned with target benchmarks (LiveCodeBench, BigCodeBench) that serves as the generalization signal for learning data-importance weights
Execution-based or rule-based rewards that provide reliable, ground-truth feedback for training the reasoning LLM judge without relying on noisy human annotations
A judge model that produces chain-of-thought style reasoning before scoring candidate solutions, enabling more nuanced evaluation of code correctness and quality
The test-time scaling mechanism that uses the trained DAJ judge to rank N sampled solutions and return the highest-scoring one
Results
| Benchmark | Best Baseline (proprietary/open) | DAJ | Delta |
|---|---|---|---|
| LiveCodeBench | Strong proprietary model / test-time scaling baseline | State-of-the-art (SOTA) | Outperforms all baselines |
| BigCodeBench | Strong proprietary model / test-time scaling baseline | State-of-the-art (SOTA) | Outperforms all baselines |
Key Takeaways
- Data reweighting is a practical and principled lever for improving LLM judge quality in test-time scaling pipelines — aligning training data distribution with target benchmarks matters more than raw data quantity
- Bi-level optimization with a benchmark-aligned meta set can replace brittle hand-crafted heuristics for handling hard problems and trajectory mismatch in judge training, and is worth adopting when distribution shift is a concern
- Combining verifiable (execution-based) rewards with reasoning-based judge architectures is a strong recipe for reliable code evaluation, and practitioners should prefer this over preference-based or model-graded approaches where ground truth execution feedback is available
Abstract
Test-time scaling for code generation commonly relies on Best-of-N selection, in which multiple candidate solutions are sampled from a base model, and the best one is selected by an LLM judge. However, training reliable LLM judges is challenging due to severe distribution shifts, including imbalances between easy and hard problems, mismatches between training tasks and evaluation benchmarks, and trajectory mismatch arising from training data generated by cheaper models whose behavior differs from that of inference-time models. We propose DAJ, a reasoning-based LLM judge trained with verifiable rewards under a bi-level data-reweighted learning framework. The proposed framework learns data-importance weights (either domain-level or instance-level) to optimize generalization performance on a held-out meta set aligned with target benchmarks. To the best of our knowledge, this is the first application of data reweighting to LLM-as-a-Judge training for test-time scaling. Our approach automatically emphasizes hard problems, in-distribution samples, and trajectory-aligned data, without relying on hand-crafted heuristics. Empirically, DAJ achieves state-of-the-art performance on LiveCodeBench and BigCodeBench, outperforming strong test-time scaling baselines as well as leading proprietary models.