Learning to Reason in 13 Parameters

TinyLoRA extends low-rank adaptation below the model dimension barrier, enabling reasoning capabilities to be learned in as few as 13 parameters (26 bytes) via reinforcement learning on large language models.

Problem Statement

Standard LoRA cannot scale below rank=1 relative to model dimension, leaving a gap in our understanding of the minimum parameter budget needed to instill reasoning. This matters for understanding the nature of reasoning in LLMs and for extremely resource-constrained fine-tuning scenarios. Existing work assumes at minimum rank=1 LoRA is necessary, but this assumption has not been rigorously challenged.

Key Novelty

TinyLoRA: a novel parameterization that scales low-rank adapters to arbitrarily small sizes, including a single parameter, breaking the conventional LoRA minimum-rank barrier
Empirical demonstration that reasoning (not just surface-level knowledge) can be elicited with only 13 trained bf16 parameters in an 8B model, achieving 91% on GSM8K
Systematic finding that RL-based training is qualitatively superior to SFT for extreme low-parameter regimes, requiring 100-1000x fewer parameter updates to reach equivalent performance

Evaluation Highlights

Qwen2.5-8B trained with TinyLoRA + RL achieves 91% on GSM8K with only 13 parameters (26 bytes), recovering ~90% of full fine-tuning performance gains
Across AIME, AMC, and MATH500 benchmarks, TinyLoRA recovers 90% of performance improvements while training 1000x fewer parameters compared to standard approaches; SFT requires 100-1000x larger updates to match RL performance at the same parameter count

Breakthrough Assessment

8/10 The finding that reasoning can be learned in 13 parameters via RL is a striking empirical result that challenges fundamental assumptions about fine-tuning granularity and the nature of reasoning in LLMs. It opens new research directions in parameter-efficient learning and the mechanistic basis of RL-induced reasoning, though broader generalization beyond math benchmarks remains to be shown.

Methodology

Design TinyLoRA by decomposing weight updates into a parameterization that can represent rank-1 and sub-rank-1 updates, allowing the number of trainable parameters to be set independently of model dimension
Apply TinyLoRA to Qwen2.5-8B and train using reinforcement learning (likely GRPO or similar outcome-based RL) on math reasoning benchmarks, optimizing only the TinyLoRA parameters
Evaluate on GSM8K, AIME, AMC, and MATH500, comparing TinyLoRA+RL against TinyLoRA+SFT and standard LoRA baselines across varying parameter budgets to establish scaling trends

System Components

TinyLoRA

A generalized low-rank adapter parameterization that decouples the number of trainable parameters from model dimension, enabling updates as small as a single scalar parameter

RL Training

Reinforcement learning objective (outcome-based reward) used to train TinyLoRA parameters, shown to be far more sample- and parameter-efficient than supervised fine-tuning for reasoning tasks

SFT Baseline

Supervised fine-tuning comparison used to isolate the contribution of RL vs. standard next-token prediction training under the same TinyLoRA parameterization

Results

Benchmark	Standard LoRA (Full Params)	TinyLoRA 13 params (RL)	Delta
GSM8K	~100% of gains	91% accuracy (≈90% of gains)	~10% gap with 1000x fewer params
AIME/AMC/MATH500	Full improvement	~90% of improvement recovered	10% gap at 1000x compression
SFT vs RL (param efficiency)	SFT baseline	RL needs 100-1000x fewer params than SFT	+100-1000x efficiency

Key Takeaways

Reasoning in LLMs is surprisingly compressible: a handful of scalar parameters trained with RL can unlock most of the reasoning improvement, suggesting reasoning is more about activating latent capabilities than learning new knowledge
RL is not just quantitatively better than SFT in low-parameter regimes — it is qualitatively different, requiring orders of magnitude fewer parameter updates, making it the preferred training paradigm for extreme parameter efficiency
TinyLoRA is a practical tool for researchers studying the mechanistic basis of fine-tuning and reasoning: it allows controlled ablations at unprecedented parameter granularity and could enable on-device or privacy-preserving personalization with near-zero storage overhead

Abstract

Recent research has shown that language models can learn to \textit{reason}, often via reinforcement learning. Some work even trains low-rank parameterizations for reasoning, but conventional LoRA cannot scale below the model dimension. We question whether even rank=1 LoRA is necessary for learning to reason and propose TinyLoRA, a method for scaling low-rank adapters to sizes as small as one parameter. Within our new parameterization, we are able to train the 8B parameter size of Qwen2.5 to 91\% accuracy on GSM8K with only 13 trained parameters in bf16 (26 total bytes). We find this trend holds in general: we are able to recover 90\% of performance improvements while training $1000x$ fewer parameters across a suite of more difficult learning-to-reason benchmarks such as AIME, AMC, and MATH500. Notably, we are only able to achieve such strong performance with RL: models trained using SFT require $100-1000x$ larger updates to reach the same performance.