Luna-2: Scalable Single-Token Evaluation with Small Language Models

Luna-2 is a scalable evaluation architecture that converts LLM-as-a-judge metrics into single-token, deterministic outputs using small language models with lightweight LoRA/PEFT heads, enabling real-time guardrails at a fraction of the cost and latency of frontier LLM evaluators.

Problem Statement

LLM-as-a-judge (LLMAJ) systems are increasingly used for AI safety and quality evaluation but are prohibitively slow, expensive, and non-deterministic due to multi-token generation, making them unsuitable for real-time guardrail applications. Existing alternatives either sacrifice accuracy or lack the flexibility to handle complex, task-specific metrics like hallucination detection, toxicity scoring, and tool selection quality. There is a critical need for evaluation infrastructure that is simultaneously accurate, cost-efficient, low-latency, and privacy-preserving.

Key Novelty

Single-token deterministic evaluation: converts complex LLMAJ metrics into a single-token classification output using a decoder-only SLM, eliminating non-determinism and multi-token generation overhead
Multi-head LoRA/PEFT architecture: each metric is implemented as a lightweight adapter head on a shared SLM backbone, enabling hundreds of specialized metrics to run concurrently on a single GPU with minimal resource overhead
Production-scale deployment paradigm: Luna-2 is designed to run locally alongside AI systems in a privacy-preserving manner, achieving 80x cost reduction and 20x latency reduction while protecting 100M+ AI sessions in real-world deployment

Evaluation Highlights

Matches or exceeds state-of-the-art LLM-based evaluators on content safety and hallucination benchmarks while reducing inference cost by over 80x and latency by over 20x
In production, processes over 100B tokens per month across 100M+ AI sessions, delivering over $30M in annual cost savings for customers

Signal Assessment

7/10 Luna-2 represents a significant engineering and architectural advance by making real-time, accurate AI guardrails practically viable at scale — the multi-head LoRA approach for concurrent metric evaluation on a shared backbone is elegant and highly practical. However, the core ideas (SLM fine-tuning, LoRA adapters, classification heads) are extensions of established techniques rather than fundamentally new ML concepts.

Methodology

Train a shared decoder-only small language model (SLM) backbone on diverse evaluation tasks, establishing a strong foundational representation for judging AI outputs across multiple safety and quality dimensions
Attach lightweight LoRA/PEFT adapter heads per metric (toxicity, hallucination, tool selection quality, etc.) to the shared backbone, fine-tuning each head on metric-specific labeled data to produce single-token classification outputs
Deploy the multi-head model on a single GPU co-located with the AI system being monitored, routing inference requests to the appropriate metric head for deterministic, low-latency, real-time evaluation without external API calls

System Components

Shared SLM Backbone

A decoder-only small language model that serves as the shared feature extractor across all evaluation metrics, providing rich contextual representations at low computational cost

LoRA/PEFT Metric Heads

Lightweight task-specific adapter modules attached to the backbone, one per evaluation metric (e.g., toxicity, hallucination, tool selection quality), enabling hundreds of specialized evaluators to coexist on a single GPU

Single-Token Output Layer

Converts the evaluation task into a deterministic single-token classification decision, eliminating multi-token generation variability and dramatically reducing inference latency

Local Deployment Runtime

Inference infrastructure designed to run on a single GPU co-located with the production AI system, ensuring privacy preservation, minimized network latency, and cost efficiency

Results

Metric/Benchmark	Baseline (Frontier LLMAJ)	Luna-2	Delta
Inference Cost	Baseline (100%)	~1.25% of baseline	~80x reduction
Inference Latency	Baseline (100%)	~5% of baseline	~20x reduction
Content Safety Accuracy	SOTA LLMAJ	Matches or exceeds	Parity or better
Hallucination Detection Accuracy	SOTA LLMAJ	Matches or exceeds	Parity or better
Annual Cost Savings (Production)	$0 (baseline)	$30M+ saved	+$30M/year
Production Scale	N/A	100M+ sessions, 100B+ tokens/month	Full production deployment

Key Takeaways

Multi-head LoRA on a shared SLM backbone is a highly practical pattern for productionizing many evaluation metrics simultaneously — ML teams building guardrail systems should consider this architecture over separate fine-tuned models or API-based LLMAJ calls
Framing evaluation as single-token classification rather than free-text generation is a key design principle that simultaneously solves non-determinism, latency, and cost — applicable broadly to any LLM-based scoring task where the output space is discrete
Co-locating evaluation models with the AI system being monitored (rather than routing to external APIs) is both a latency optimization and a privacy advantage — teams operating in regulated or latency-sensitive domains should prioritize local deployment architectures for guardrails

Abstract

Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation. We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g. toxicity, hallucination, tool selection quality, etc.) at an accuracy at par or higher than LLMAJ using frontier LLMs while drastically reducing the cost and latency of computation. Each metric is implemented as a lightweight LoRA/PEFT head on top of a shared SLM backbone, enabling hundreds of specialized metrics to run concurrently on a single GPU, deployable locally next to AI systems in a privacy-preserving and latency optimizing manner. Across content safety and hallucination benchmarks, Luna-2 matches the accuracy of state-of-the-art LLM-based evaluators while reducing inference cost by over 80x and latency by over 20x. In this paper, we outline the model architecture, training methodology and report real-world empirical results on accuracy, latency, and throughput results. In production, Luna-2 is protecting 100M+ AI sessions and processing over 100B tokens per month for our customers with eval cost savings of over $30M annually.