Residual Feature Enhancement for Large Language Models: Methodology and Applications

Residual Feature Enhancement (RFE) is a lightweight plug-in module that enriches LLM attention outputs via dimension-preserving linear transforms, SwiGLU activation, and residual connections to boost complex reasoning without modifying the backbone architecture. Applied to ChatGLM4-9B, RFE consistently outperforms adapter-based PEFT methods and larger open-source models across six diverse reasoning benchmarks.

Problem Statement

Large language models still struggle with complex logical reasoning despite their general NLP capabilities. Existing remedies—prompting, RAG, and PEFT adapters—each carry trade-offs: prompt sensitivity, lossy semantic compression, or non-trivial computational overhead. There is a need for a lightweight, architecture-agnostic enhancement that specifically strengthens reasoning fidelity without disrupting pre-trained representations.

Key Novelty

Introduction of the RFE module: a post-attention residual block combining a dimension-preserving linear projection, SwiGLU gated nonlinearity, and skip connections that enriches hidden representations without changing the backbone
Demonstrated that a small 9B-parameter model (ChatGLM4-9B+RFE) can surpass both adapter-based fine-tuning methods and larger open-source baselines (Llama3.1-8B, Qwen1.5-MoE, DeepSeek distilled) across six heterogeneous reasoning tasks
Comprehensive ablation and convergence analysis showing RFE contributes up to 3.84 pp of task performance and improves training stability/convergence speed compared to the base model and adapter alternatives

Evaluation Highlights

ChatGLM4-9B+RFE achieves 95.68% on GSM8K (math reasoning), 82.00% on ReClor, and 79.74% on LogiQA2.0, consistently outperforming the Adapter baseline by 0.52–5.57 pp across all six benchmarks
Ablation studies confirm removing RFE degrades performance by up to 3.84 percentage points, and convergence curves show faster, more stable training loss reduction compared to adapter-only fine-tuning

Signal Assessment

4/10 RFE is a solid, well-executed engineering contribution that delivers measurable and consistent gains on a broad set of reasoning benchmarks, but the core idea—inserting a residual gated MLP block after attention—is an incremental architectural refinement rather than a conceptually new paradigm. Its value lies in practical efficiency and breadth of validation rather than fundamental novelty.

Methodology

Insert RFE modules after the multi-head attention output in each (or selected) transformer layers of a frozen or lightly fine-tuned LLM backbone, preserving the original hidden dimension throughout
Within each RFE block, apply a dimension-preserving linear transformation followed by SwiGLU gated activation to introduce expressive nonlinear feature mixing, then add the result back via a residual connection to the attention output
Fine-tune the RFE parameters (and optionally a small set of backbone parameters) on task-specific data across reasoning benchmarks; evaluate against adapter baselines and larger open-source models, with ablations toggling RFE on/off to isolate contribution

System Components

Dimension-Preserving Linear Transform

A weight matrix that projects the attention output into an equally-sized latent space, enabling feature re-weighting without changing tensor shapes or requiring dimensional bottlenecks

SwiGLU Nonlinear Activation

A gated activation function (Swish × linear gate) that selectively amplifies relevant features and suppresses noise, providing richer nonlinearity than standard ReLU or GeLU activations

Residual Connection

Adds the RFE transformation output directly back to the original attention output, preserving pre-trained knowledge and enabling stable gradient flow during fine-tuning

ChatGLM4-9B Backbone

The base LLM onto which RFE modules are attached; its weights serve as the pre-trained foundation, with RFE providing targeted reasoning enhancement at minimal parameter cost

Results

Benchmark	Adapter Baseline	ChatGLM4-9B+RFE	Delta
LogiQA	67.68%	68.20%	+0.52 pp
ReClor	81.15%	82.00%	+0.85 pp
LogiQA2.0	78.52%	79.74%	+1.22 pp
GSM8K	94.47%	95.68%	+1.21 pp
HellaSwag	66.85%	72.42%	+5.57 pp
MBPP	55.02%	56.82%	+1.80 pp

Key Takeaways

RFE offers a practical plug-in strategy for practitioners who want to boost reasoning on smaller models without scaling up: attaching a lightweight residual gated-MLP block after attention layers can close part of the gap to larger models at low computational cost
The HellaSwag result (+5.57 pp over adapter) suggests that commonsense/contextual inference tasks benefit disproportionately from richer attention-level feature mixing, making RFE especially worth trying for commonsense or multi-step reasoning applications
Convergence analysis showing faster, more stable training implies RFE may reduce fine-tuning compute budgets in practice—a useful property when iterating quickly on new domains or tasks with limited data

Abstract

Large language models (LLMs) have achieved remarkable progress in natural language processing, yet their ability to perform complex logical reasoning remains limited. Existing approaches such as prompting, retrieval-augmented generation, and parameter-efficient fine-tuning (PEFT) provide partial improvements but often suffer from prompt sensitivity, semantic compression, or additional computational cost. In this work, we propose a Residual Feature Enhancement (RFE) module, a lightweight architectural component designed to strengthen reasoning ability while maintaining computational efficiency. RFE integrates a dimension-preserving linear transformation, SwiGLU nonlinear activation, and residual connections to enrich attention outputs without altering the backbone structure. We conducted comprehensive experiments across six reasoning and comprehension benchmarks—LogiQA, ReClor, LogiQA2.0, GSM8K, HellaSwag, and MBPP—covering deductive reasoning, standardized test comprehension, commonsense inference, and program synthesis. Results demonstrate that ChatGLM4-9B augmented with RFE consistently achieves superior performance compared with both adapter-based methods and larger-scale baselines. Specifically, ChatGLM4-9B+ RFE attains 68.20% on LogiQA, 82.00% on ReClor, 79.74% on LogiQA 2.0, 95.68% on GSM8K, 72.42% on HellaSwag, and 56.82% on MBPP, all of which surpass the Adapter mechanism (67.68%, 81.15%, 78.52%, 94.47%, 66.85%, 55.02%) and show clear advantages over open-source baselines such as Qwen1.5-MoE-A2.7B, Llama3.1-8B, and DeepSeek distilled models. Ablation studies further confirm that removing RFE leads to performance degradation of up to 3.84 percentage points, and convergence analysis shows improved stability and faster training.