DAJ: Data-Reweighted LLM Judge for Test-Time Scaling in Code Generation

DAJ is a reasoning-based LLM judge trained with verifiable rewards under a bi-level data-reweighting framework that automatically learns data-importance weights to improve Best-of-N selection for code generation at test time. By addressing distribution shifts in judge training, DAJ achieves state-of-the-art performance on competitive code benchmarks.

Problem Statement

Test-time scaling via Best-of-N selection depends on reliable LLM judges, but training such judges is hampered by severe distribution shifts: imbalances between easy and hard problems, mismatches between training tasks and target benchmarks, and trajectory mismatch from using cheaper models to generate training data. Existing approaches rely on hand-crafted heuristics to address these issues, which are brittle and fail to generalize across benchmarks. A principled, automatic method for handling these distribution shifts is needed to unlock the full potential of LLM judges in test-time compute scaling.

Key Novelty

First application of data reweighting to LLM-as-a-Judge training for test-time scaling, learning domain-level or instance-level importance weights automatically via a bi-level optimization framework
A meta-set aligned with target benchmarks is used as a held-out validation signal to drive the bi-level learning, enabling generalization-focused judge training without hand-crafted heuristics
Integration of verifiable rewards into reasoning-based LLM judge training, combining reinforcement-style learning with data reweighting to handle hard problems, in-distribution samples, and trajectory-aligned data jointly

Evaluation Highlights

DAJ achieves state-of-the-art performance on LiveCodeBench, outperforming strong test-time scaling baselines and leading proprietary models
DAJ achieves state-of-the-art performance on BigCodeBench, demonstrating generalization across multiple competitive code generation benchmarks

Signal Assessment

6/10 DAJ makes a solid and principled contribution by applying bi-level data reweighting to LLM judge training, addressing a well-known but under-studied problem in test-time scaling. While the empirical results are strong, the core ideas (bi-level optimization, data reweighting, verifiable rewards) are drawn from existing literature, making this a meaningful incremental advance rather than a paradigm shift.

Methodology

Sample multiple candidate code solutions from a base model for each problem, forming the pool for Best-of-N selection at test time
Train a reasoning-based LLM judge using verifiable rewards (e.g., execution-based feedback) within a bi-level optimization framework that simultaneously learns domain-level or instance-level data-importance weights by optimizing generalization on a held-out meta set aligned with target benchmarks
At inference time, apply the trained DAJ judge to score and select the best candidate solution from the sampled pool, effectively scaling test-time compute with a distribution-aware judge

System Components

Bi-Level Data Reweighting Framework

An outer-inner optimization loop where inner-level training updates judge parameters given current data weights, and outer-level optimization updates data-importance weights (domain or instance level) to maximize performance on a held-out meta set

Meta Set

A held-out validation set aligned with target benchmarks (LiveCodeBench, BigCodeBench) that serves as the generalization signal for learning data-importance weights

Verifiable Reward Signal

Execution-based or rule-based rewards that provide reliable, ground-truth feedback for training the reasoning LLM judge without relying on noisy human annotations

Reasoning-Based LLM Judge

A judge model that produces chain-of-thought style reasoning before scoring candidate solutions, enabling more nuanced evaluation of code correctness and quality

Best-of-N Selector

The test-time scaling mechanism that uses the trained DAJ judge to rank N sampled solutions and return the highest-scoring one

Results

Benchmark	Best Baseline (proprietary/open)	DAJ	Delta
LiveCodeBench	Strong proprietary model / test-time scaling baseline	State-of-the-art (SOTA)	Outperforms all baselines
BigCodeBench	Strong proprietary model / test-time scaling baseline	State-of-the-art (SOTA)	Outperforms all baselines

Key Takeaways

Data reweighting is a practical and principled lever for improving LLM judge quality in test-time scaling pipelines — aligning training data distribution with target benchmarks matters more than raw data quantity
Bi-level optimization with a benchmark-aligned meta set can replace brittle hand-crafted heuristics for handling hard problems and trajectory mismatch in judge training, and is worth adopting when distribution shift is a concern
Combining verifiable (execution-based) rewards with reasoning-based judge architectures is a strong recipe for reliable code evaluation, and practitioners should prefer this over preference-based or model-graded approaches where ground truth execution feedback is available

Abstract

Test-time scaling for code generation commonly relies on Best-of-N selection, in which multiple candidate solutions are sampled from a base model, and the best one is selected by an LLM judge. However, training reliable LLM judges is challenging due to severe distribution shifts, including imbalances between easy and hard problems, mismatches between training tasks and evaluation benchmarks, and trajectory mismatch arising from training data generated by cheaper models whose behavior differs from that of inference-time models. We propose DAJ, a reasoning-based LLM judge trained with verifiable rewards under a bi-level data-reweighted learning framework. The proposed framework learns data-importance weights (either domain-level or instance-level) to optimize generalization performance on a held-out meta set aligned with target benchmarks. To the best of our knowledge, this is the first application of data reweighting to LLM-as-a-Judge training for test-time scaling. Our approach automatically emphasizes hard problems, in-distribution samples, and trajectory-aligned data, without relying on hand-crafted heuristics. Empirically, DAJ achieves state-of-the-art performance on LiveCodeBench and BigCodeBench, outperforming strong test-time scaling baselines as well as leading proprietary models.