← Back to Papers

Parameter-efficient fine-tuning of small language models for code generation: a comparative study of Gemma, Qwen 2.5 and Llama 3.2

Van-Viet Nguyen, The-Vinh Nguyen, Huu-Khanh Nguyen, Duc-Quang Vu
International Journal of Electrical and Computer Engineering (IJECE) | 2026
Parameter-efficient fine-tuning via QLoRA (LoRA + 4-bit quantization) enables small language models (<3B parameters) to achieve competitive or superior code generation performance compared to larger baseline models. This study benchmarks this approach across models from Google, Meta, and Alibaba on the CodeAlpaca-20k dataset.

Problem Statement

Large LLMs for code generation impose prohibitive computational costs, raise privacy concerns, and are impractical for edge or on-premise deployment. Domain-specific software engineering tasks require capable yet resource-efficient models. Existing SLMs without fine-tuning underperform, creating a gap between accessibility and capability.

Key Novelty

  • Systematic comparative study of QLoRA fine-tuning across five SLMs (<3B params) from three major AI providers (Google Gemma 2B, Meta LLaMA 3 1B/3B, Alibaba Qwen2.5 1.5B/3B)
  • Demonstration that QLoRA-tuned SLMs can surpass larger baseline models (e.g., Phi-3 Mini 4K base) on ROUGE-L for code generation
  • Quantified performance gains showing 54-55% ROUGE-L improvement over untuned counterparts for LLaMA 3 3B and Qwen2.5 3B

Evaluation Highlights

  • ROUGE-L scores for QLoRA fine-tuned LLaMA 3 3B and Qwen2.5 3B improved by ~54% and ~55% respectively over their untuned baselines
  • Fine-tuned SLMs outperform the larger Phi-3 Mini 4K base model on ROUGE-L, demonstrating that PEFT can close the size-performance gap

Breakthrough Assessment

4/10 The paper is a solid empirical contribution and comparative benchmark for practitioners, but applies well-established QLoRA techniques to existing models without introducing new algorithms or architectures. Its value is primarily practical validation and model selection guidance rather than methodological novelty.

Methodology

  1. Select five SLMs (<3B parameters) from Google, Meta, and Alibaba; establish untuned baselines on CodeAlpaca-20k
  2. Apply QLoRA (LoRA adapters + 4-bit quantization) to each model for fine-tuning on CodeAlpaca-20k, reducing memory and compute requirements
  3. Evaluate all fine-tuned and baseline models using ROUGE-L metric, comparing against untuned SLMs and larger reference models like Phi-3 Mini 4K base

System Components

LoRA (Low-Rank Adaptation)

Injects trainable low-rank matrices into transformer layers, drastically reducing the number of trainable parameters while preserving model capacity

4-bit Quantization (QLoRA)

Quantizes the frozen base model weights to 4-bit precision, enabling fine-tuning of large(r) models on consumer-grade or edge hardware

CodeAlpaca-20k Dataset

A 20,000-sample instruction-following dataset for code generation used for fine-tuning and evaluation

ROUGE-L Metric

Longest common subsequence-based evaluation metric used to measure semantic and syntactic similarity between generated and reference code

SLM Zoo (Gemma 2B, LLaMA 3 1B/3B, Qwen2.5 1.5B/3B)

The five small language models evaluated, representing diverse architectures and training regimes from major AI providers

Results

Model Untuned ROUGE-L QLoRA Fine-tuned ROUGE-L Improvement
LLaMA 3 3B Baseline +54% over baseline ~54%
Qwen2.5 3B Baseline +55% over baseline ~55%
Fine-tuned SLMs vs Phi-3 Mini 4K base Phi-3 Mini 4K (larger) Fine-tuned SLMs exceed Positive
Gemma 2B Baseline Improved (magnitude not specified) N/A
LLaMA 3 1B / Qwen2.5 1.5B Baseline Improved (magnitude not specified) N/A

Key Takeaways

  • QLoRA is a highly practical recipe for adapting sub-3B models to domain-specific code generation tasks with minimal hardware requirements, making it suitable for edge deployment and privacy-sensitive environments
  • Model size is not the primary determinant of code generation quality after fine-tuning — a well-tuned 3B model can outperform a larger untuned baseline, suggesting practitioners should prioritize fine-tuning over scaling
  • Qwen2.5 3B and LLaMA 3 3B show the strongest response to QLoRA fine-tuning (~55% ROUGE-L gain), making them the recommended SLM choices for code generation tasks under resource constraints

Abstract

Large language models (LLMs) have demonstrated impressive capabilities in code generation; however, their high computational demands, privacy limitations, and challenges in edge deployment restrict their practical use in domain-specific applications. This study explores the effectiveness of parameter efficient fine-tuning for small language models (SLMs) with fewer than 3 billion parameters. We adopt a hybrid approach that combines low-rank adaptation (LoRA) and 4-bit quantization (QLoRA) to reduce fine-tuning costs while preserving semantic consistency. Experiments on the CodeAlpaca-20k dataset reveal that SLMs fine-tuned with this method outperform larger baseline models, including Phi-3 Mini 4K base, in ROUGE-L. Notably, applying our approach to the LLaMA 3 3B and Qwen2.5 3B models yielded performance improvements of 54% and 55%, respectively, over untuned counterparts. We evaluate models developed by major artificial intelligence (AI) providers Google (Gemma 2B), Meta (LLaMA 3 1B/3B), and Alibaba (Qwen2.5 1.5B/3B) and show that parameter-efficient fine-tuning enables them to serve as cost-effective, high-performing alternatives to larger LLMs. These findings highlight the potential of SLMs as scalable solutions for domain-specific software engineering tasks, supporting broader adoption and democratization of neural code synthesis.

Generated on 2026-03-02 using Claude