Parameter-efficient fine-tuning of small language models for code generation: a comparative study of Gemma, Qwen 2.5 and Llama 3.2

Parameter-efficient fine-tuning via QLoRA (LoRA + 4-bit quantization) enables small language models (<3B parameters) to achieve competitive or superior code generation performance compared to larger baseline models. This study benchmarks this approach across models from Google, Meta, and Alibaba on the CodeAlpaca-20k dataset.

Problem Statement

Large LLMs for code generation impose prohibitive computational costs, raise privacy concerns, and are impractical for edge or on-premise deployment. Domain-specific software engineering tasks require capable yet resource-efficient models. Existing SLMs without fine-tuning underperform, creating a gap between accessibility and capability.

Key Novelty

Systematic comparative study of QLoRA fine-tuning across five SLMs (<3B params) from three major AI providers (Google Gemma 2B, Meta LLaMA 3 1B/3B, Alibaba Qwen2.5 1.5B/3B)
Demonstration that QLoRA-tuned SLMs can surpass larger baseline models (e.g., Phi-3 Mini 4K base) on ROUGE-L for code generation
Quantified performance gains showing 54-55% ROUGE-L improvement over untuned counterparts for LLaMA 3 3B and Qwen2.5 3B

Evaluation Highlights

ROUGE-L scores for QLoRA fine-tuned LLaMA 3 3B and Qwen2.5 3B improved by ~54% and ~55% respectively over their untuned baselines
Fine-tuned SLMs outperform the larger Phi-3 Mini 4K base model on ROUGE-L, demonstrating that PEFT can close the size-performance gap

Signal Assessment

4/10 The paper is a solid empirical contribution and comparative benchmark for practitioners, but applies well-established QLoRA techniques to existing models without introducing new algorithms or architectures. Its value is primarily practical validation and model selection guidance rather than methodological novelty.

Methodology

Select five SLMs (<3B parameters) from Google, Meta, and Alibaba; establish untuned baselines on CodeAlpaca-20k
Apply QLoRA (LoRA adapters + 4-bit quantization) to each model for fine-tuning on CodeAlpaca-20k, reducing memory and compute requirements
Evaluate all fine-tuned and baseline models using ROUGE-L metric, comparing against untuned SLMs and larger reference models like Phi-3 Mini 4K base

System Components

LoRA (Low-Rank Adaptation)

Injects trainable low-rank matrices into transformer layers, drastically reducing the number of trainable parameters while preserving model capacity

4-bit Quantization (QLoRA)

Quantizes the frozen base model weights to 4-bit precision, enabling fine-tuning of large(r) models on consumer-grade or edge hardware

CodeAlpaca-20k Dataset

A 20,000-sample instruction-following dataset for code generation used for fine-tuning and evaluation

ROUGE-L Metric

Longest common subsequence-based evaluation metric used to measure semantic and syntactic similarity between generated and reference code

SLM Zoo (Gemma 2B, LLaMA 3 1B/3B, Qwen2.5 1.5B/3B)

The five small language models evaluated, representing diverse architectures and training regimes from major AI providers

Results

Model	Untuned ROUGE-L	QLoRA Fine-tuned ROUGE-L	Improvement
LLaMA 3 3B	Baseline	+54% over baseline	~54%
Qwen2.5 3B	Baseline	+55% over baseline	~55%
Fine-tuned SLMs vs Phi-3 Mini 4K base	Phi-3 Mini 4K (larger)	Fine-tuned SLMs exceed	Positive
Gemma 2B	Baseline	Improved (magnitude not specified)	N/A
LLaMA 3 1B / Qwen2.5 1.5B	Baseline	Improved (magnitude not specified)	N/A

Key Takeaways

QLoRA is a highly practical recipe for adapting sub-3B models to domain-specific code generation tasks with minimal hardware requirements, making it suitable for edge deployment and privacy-sensitive environments
Model size is not the primary determinant of code generation quality after fine-tuning — a well-tuned 3B model can outperform a larger untuned baseline, suggesting practitioners should prioritize fine-tuning over scaling
Qwen2.5 3B and LLaMA 3 3B show the strongest response to QLoRA fine-tuning (~55% ROUGE-L gain), making them the recommended SLM choices for code generation tasks under resource constraints

Abstract

Large language models (LLMs) have demonstrated impressive capabilities in code generation; however, their high computational demands, privacy limitations, and challenges in edge deployment restrict their practical use in domain-specific applications. This study explores the effectiveness of parameter efficient fine-tuning for small language models (SLMs) with fewer than 3 billion parameters. We adopt a hybrid approach that combines low-rank adaptation (LoRA) and 4-bit quantization (QLoRA) to reduce fine-tuning costs while preserving semantic consistency. Experiments on the CodeAlpaca-20k dataset reveal that SLMs fine-tuned with this method outperform larger baseline models, including Phi-3 Mini 4K base, in ROUGE-L. Notably, applying our approach to the LLaMA 3 3B and Qwen2.5 3B models yielded performance improvements of 54% and 55%, respectively, over untuned counterparts. We evaluate models developed by major artificial intelligence (AI) providers Google (Gemma 2B), Meta (LLaMA 3 1B/3B), and Alibaba (Qwen2.5 1.5B/3B) and show that parameter-efficient fine-tuning enables them to serve as cost-effective, high-performing alternatives to larger LLMs. These findings highlight the potential of SLMs as scalable solutions for domain-specific software engineering tasks, supporting broader adoption and democratization of neural code synthesis.