Residual Feature Enhancement for Large Language Models: Methodology and Applications
Problem Statement
Large language models still struggle with complex logical reasoning despite their general NLP capabilities. Existing remedies—prompting, RAG, and PEFT adapters—each carry trade-offs: prompt sensitivity, lossy semantic compression, or non-trivial computational overhead. There is a need for a lightweight, architecture-agnostic enhancement that specifically strengthens reasoning fidelity without disrupting pre-trained representations.
Key Novelty
- Introduction of the RFE module: a post-attention residual block combining a dimension-preserving linear projection, SwiGLU gated nonlinearity, and skip connections that enriches hidden representations without changing the backbone
- Demonstrated that a small 9B-parameter model (ChatGLM4-9B+RFE) can surpass both adapter-based fine-tuning methods and larger open-source baselines (Llama3.1-8B, Qwen1.5-MoE, DeepSeek distilled) across six heterogeneous reasoning tasks
- Comprehensive ablation and convergence analysis showing RFE contributes up to 3.84 pp of task performance and improves training stability/convergence speed compared to the base model and adapter alternatives
Evaluation Highlights
- ChatGLM4-9B+RFE achieves 95.68% on GSM8K (math reasoning), 82.00% on ReClor, and 79.74% on LogiQA2.0, consistently outperforming the Adapter baseline by 0.52–5.57 pp across all six benchmarks
- Ablation studies confirm removing RFE degrades performance by up to 3.84 percentage points, and convergence curves show faster, more stable training loss reduction compared to adapter-only fine-tuning
Breakthrough Assessment
Methodology
- Insert RFE modules after the multi-head attention output in each (or selected) transformer layers of a frozen or lightly fine-tuned LLM backbone, preserving the original hidden dimension throughout
- Within each RFE block, apply a dimension-preserving linear transformation followed by SwiGLU gated activation to introduce expressive nonlinear feature mixing, then add the result back via a residual connection to the attention output
- Fine-tune the RFE parameters (and optionally a small set of backbone parameters) on task-specific data across reasoning benchmarks; evaluate against adapter baselines and larger open-source models, with ablations toggling RFE on/off to isolate contribution
System Components
A weight matrix that projects the attention output into an equally-sized latent space, enabling feature re-weighting without changing tensor shapes or requiring dimensional bottlenecks
A gated activation function (Swish × linear gate) that selectively amplifies relevant features and suppresses noise, providing richer nonlinearity than standard ReLU or GeLU activations
Adds the RFE transformation output directly back to the original attention output, preserving pre-trained knowledge and enabling stable gradient flow during fine-tuning
The base LLM onto which RFE modules are attached; its weights serve as the pre-trained foundation, with RFE providing targeted reasoning enhancement at minimal parameter cost
Results
| Benchmark | Adapter Baseline | ChatGLM4-9B+RFE | Delta |
|---|---|---|---|
| LogiQA | 67.68% | 68.20% | +0.52 pp |
| ReClor | 81.15% | 82.00% | +0.85 pp |
| LogiQA2.0 | 78.52% | 79.74% | +1.22 pp |
| GSM8K | 94.47% | 95.68% | +1.21 pp |
| HellaSwag | 66.85% | 72.42% | +5.57 pp |
| MBPP | 55.02% | 56.82% | +1.80 pp |
Key Takeaways
- RFE offers a practical plug-in strategy for practitioners who want to boost reasoning on smaller models without scaling up: attaching a lightweight residual gated-MLP block after attention layers can close part of the gap to larger models at low computational cost
- The HellaSwag result (+5.57 pp over adapter) suggests that commonsense/contextual inference tasks benefit disproportionately from richer attention-level feature mixing, making RFE especially worth trying for commonsense or multi-step reasoning applications
- Convergence analysis showing faster, more stable training implies RFE may reduce fine-tuning compute budgets in practice—a useful property when iterating quickly on new domains or tasks with limited data
Abstract
Large language models (LLMs) have achieved remarkable progress in natural language processing, yet their ability to perform complex logical reasoning remains limited. Existing approaches such as prompting, retrieval-augmented generation, and parameter-efficient fine-tuning (PEFT) provide partial improvements but often suffer from prompt sensitivity, semantic compression, or additional computational cost. In this work, we propose a Residual Feature Enhancement (RFE) module, a lightweight architectural component designed to strengthen reasoning ability while maintaining computational efficiency. RFE integrates a dimension-preserving linear transformation, SwiGLU nonlinear activation, and residual connections to enrich attention outputs without altering the backbone structure. We conducted comprehensive experiments across six reasoning and comprehension benchmarks—LogiQA, ReClor, LogiQA2.0, GSM8K, HellaSwag, and MBPP—covering deductive reasoning, standardized test comprehension, commonsense inference, and program synthesis. Results demonstrate that ChatGLM4-9B augmented with RFE consistently achieves superior performance compared with both adapter-based methods and larger-scale baselines. Specifically, ChatGLM4-9B+ RFE attains 68.20% on LogiQA, 82.00% on ReClor, 79.74% on LogiQA 2.0, 95.68% on GSM8K, 72.42% on HellaSwag, and 56.82% on MBPP, all of which surpass the Adapter mechanism (67.68%, 81.15%, 78.52%, 94.47%, 66.85%, 55.02%) and show clear advantages over open-source baselines such as Qwen1.5-MoE-A2.7B, Llama3.1-8B, and DeepSeek distilled models. Ablation studies further confirm that removing RFE leads to performance degradation of up to 3.84 percentage points, and convergence analysis shows improved stability and faster training.