Parameter-efficient fine-tuning of small language models for code generation: a comparative study of Gemma, Qwen 2.5 and Llama 3.2
Problem Statement
Large LLMs for code generation impose prohibitive computational costs, raise privacy concerns, and are impractical for edge or on-premise deployment. Domain-specific software engineering tasks require capable yet resource-efficient models. Existing SLMs without fine-tuning underperform, creating a gap between accessibility and capability.
Key Novelty
- Systematic comparative study of QLoRA fine-tuning across five SLMs (<3B params) from three major AI providers (Google Gemma 2B, Meta LLaMA 3 1B/3B, Alibaba Qwen2.5 1.5B/3B)
- Demonstration that QLoRA-tuned SLMs can surpass larger baseline models (e.g., Phi-3 Mini 4K base) on ROUGE-L for code generation
- Quantified performance gains showing 54-55% ROUGE-L improvement over untuned counterparts for LLaMA 3 3B and Qwen2.5 3B
Evaluation Highlights
- ROUGE-L scores for QLoRA fine-tuned LLaMA 3 3B and Qwen2.5 3B improved by ~54% and ~55% respectively over their untuned baselines
- Fine-tuned SLMs outperform the larger Phi-3 Mini 4K base model on ROUGE-L, demonstrating that PEFT can close the size-performance gap
Breakthrough Assessment
Methodology
- Select five SLMs (<3B parameters) from Google, Meta, and Alibaba; establish untuned baselines on CodeAlpaca-20k
- Apply QLoRA (LoRA adapters + 4-bit quantization) to each model for fine-tuning on CodeAlpaca-20k, reducing memory and compute requirements
- Evaluate all fine-tuned and baseline models using ROUGE-L metric, comparing against untuned SLMs and larger reference models like Phi-3 Mini 4K base
System Components
Injects trainable low-rank matrices into transformer layers, drastically reducing the number of trainable parameters while preserving model capacity
Quantizes the frozen base model weights to 4-bit precision, enabling fine-tuning of large(r) models on consumer-grade or edge hardware
A 20,000-sample instruction-following dataset for code generation used for fine-tuning and evaluation
Longest common subsequence-based evaluation metric used to measure semantic and syntactic similarity between generated and reference code
The five small language models evaluated, representing diverse architectures and training regimes from major AI providers
Results
| Model | Untuned ROUGE-L | QLoRA Fine-tuned ROUGE-L | Improvement |
|---|---|---|---|
| LLaMA 3 3B | Baseline | +54% over baseline | ~54% |
| Qwen2.5 3B | Baseline | +55% over baseline | ~55% |
| Fine-tuned SLMs vs Phi-3 Mini 4K base | Phi-3 Mini 4K (larger) | Fine-tuned SLMs exceed | Positive |
| Gemma 2B | Baseline | Improved (magnitude not specified) | N/A |
| LLaMA 3 1B / Qwen2.5 1.5B | Baseline | Improved (magnitude not specified) | N/A |
Key Takeaways
- QLoRA is a highly practical recipe for adapting sub-3B models to domain-specific code generation tasks with minimal hardware requirements, making it suitable for edge deployment and privacy-sensitive environments
- Model size is not the primary determinant of code generation quality after fine-tuning — a well-tuned 3B model can outperform a larger untuned baseline, suggesting practitioners should prioritize fine-tuning over scaling
- Qwen2.5 3B and LLaMA 3 3B show the strongest response to QLoRA fine-tuning (~55% ROUGE-L gain), making them the recommended SLM choices for code generation tasks under resource constraints
Abstract
Large language models (LLMs) have demonstrated impressive capabilities in code generation; however, their high computational demands, privacy limitations, and challenges in edge deployment restrict their practical use in domain-specific applications. This study explores the effectiveness of parameter efficient fine-tuning for small language models (SLMs) with fewer than 3 billion parameters. We adopt a hybrid approach that combines low-rank adaptation (LoRA) and 4-bit quantization (QLoRA) to reduce fine-tuning costs while preserving semantic consistency. Experiments on the CodeAlpaca-20k dataset reveal that SLMs fine-tuned with this method outperform larger baseline models, including Phi-3 Mini 4K base, in ROUGE-L. Notably, applying our approach to the LLaMA 3 3B and Qwen2.5 3B models yielded performance improvements of 54% and 55%, respectively, over untuned counterparts. We evaluate models developed by major artificial intelligence (AI) providers Google (Gemma 2B), Meta (LLaMA 3 1B/3B), and Alibaba (Qwen2.5 1.5B/3B) and show that parameter-efficient fine-tuning enables them to serve as cost-effective, high-performing alternatives to larger LLMs. These findings highlight the potential of SLMs as scalable solutions for domain-specific software engineering tasks, supporting broader adoption and democratization of neural code synthesis.