Learning to Reason in 13 Parameters
Problem Statement
Standard LoRA cannot scale below rank=1 relative to model dimension, leaving a gap in our understanding of the minimum parameter budget needed to instill reasoning. This matters for understanding the nature of reasoning in LLMs and for extremely resource-constrained fine-tuning scenarios. Existing work assumes at minimum rank=1 LoRA is necessary, but this assumption has not been rigorously challenged.
Key Novelty
- TinyLoRA: a novel parameterization that scales low-rank adapters to arbitrarily small sizes, including a single parameter, breaking the conventional LoRA minimum-rank barrier
- Empirical demonstration that reasoning (not just surface-level knowledge) can be elicited with only 13 trained bf16 parameters in an 8B model, achieving 91% on GSM8K
- Systematic finding that RL-based training is qualitatively superior to SFT for extreme low-parameter regimes, requiring 100-1000x fewer parameter updates to reach equivalent performance
Evaluation Highlights
- Qwen2.5-8B trained with TinyLoRA + RL achieves 91% on GSM8K with only 13 parameters (26 bytes), recovering ~90% of full fine-tuning performance gains
- Across AIME, AMC, and MATH500 benchmarks, TinyLoRA recovers 90% of performance improvements while training 1000x fewer parameters compared to standard approaches; SFT requires 100-1000x larger updates to match RL performance at the same parameter count
Breakthrough Assessment
Methodology
- Design TinyLoRA by decomposing weight updates into a parameterization that can represent rank-1 and sub-rank-1 updates, allowing the number of trainable parameters to be set independently of model dimension
- Apply TinyLoRA to Qwen2.5-8B and train using reinforcement learning (likely GRPO or similar outcome-based RL) on math reasoning benchmarks, optimizing only the TinyLoRA parameters
- Evaluate on GSM8K, AIME, AMC, and MATH500, comparing TinyLoRA+RL against TinyLoRA+SFT and standard LoRA baselines across varying parameter budgets to establish scaling trends
System Components
A generalized low-rank adapter parameterization that decouples the number of trainable parameters from model dimension, enabling updates as small as a single scalar parameter
Reinforcement learning objective (outcome-based reward) used to train TinyLoRA parameters, shown to be far more sample- and parameter-efficient than supervised fine-tuning for reasoning tasks
Supervised fine-tuning comparison used to isolate the contribution of RL vs. standard next-token prediction training under the same TinyLoRA parameterization
Results
| Benchmark | Standard LoRA (Full Params) | TinyLoRA 13 params (RL) | Delta |
|---|---|---|---|
| GSM8K | ~100% of gains | 91% accuracy (≈90% of gains) | ~10% gap with 1000x fewer params |
| AIME/AMC/MATH500 | Full improvement | ~90% of improvement recovered | 10% gap at 1000x compression |
| SFT vs RL (param efficiency) | SFT baseline | RL needs 100-1000x fewer params than SFT | +100-1000x efficiency |
Key Takeaways
- Reasoning in LLMs is surprisingly compressible: a handful of scalar parameters trained with RL can unlock most of the reasoning improvement, suggesting reasoning is more about activating latent capabilities than learning new knowledge
- RL is not just quantitatively better than SFT in low-parameter regimes — it is qualitatively different, requiring orders of magnitude fewer parameter updates, making it the preferred training paradigm for extreme parameter efficiency
- TinyLoRA is a practical tool for researchers studying the mechanistic basis of fine-tuning and reasoning: it allows controlled ablations at unprecedented parameter granularity and could enable on-device or privacy-preserving personalization with near-zero storage overhead
Abstract
Recent research has shown that language models can learn to \textit{reason}, often via reinforcement learning. Some work even trains low-rank parameterizations for reasoning, but conventional LoRA cannot scale below the model dimension. We question whether even rank=1 LoRA is necessary for learning to reason and propose TinyLoRA, a method for scaling low-rank adapters to sizes as small as one parameter. Within our new parameterization, we are able to train the 8B parameter size of Qwen2.5 to 91\% accuracy on GSM8K with only 13 trained parameters in bf16 (26 total bytes). We find this trend holds in general: we are able to recover 90\% of performance improvements while training $1000x$ fewer parameters across a suite of more difficult learning-to-reason benchmarks such as AIME, AMC, and MATH500. Notably, we are only able to achieve such strong performance with RL: models trained using SFT require $100-1000x$ larger updates to reach the same performance.