Synthesizing question answering data from financial documents: An End-to-End Multi-Agent Approach

Chetan Harsha, Karmvir Singh Phogat, Sridhar Dasaratha, Shashishekar Ramakrishna

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track) | 2026

Semantic Scholar

LLMs Reasoning & Planning Agents Efficient LLMs

An end-to-end multi-agent pipeline automatically synthesizes high-quality question-answer pairs from unstructured financial documents, enabling cost-effective fine-tuning of small language models (SLMs) for numerical reasoning tasks without manual annotation.

Problem Statement

Deploying LLMs for financial numerical reasoning is expensive and slow at enterprise scale, while SLMs require high-quality domain-specific fine-tuning data that traditionally demands costly manual expert annotation. This bottleneck limits the practical adoption of smaller, more efficient models in financial NLP applications. The scattered and heterogeneous nature of financial documents further complicates automated data extraction and QA generation.

Key Novelty

Modular, scalable agentic pipeline that autonomously extracts, selects, and structures relevant content from unstructured financial documents end-to-end
Automated synthetic QA data generation tailored for numerical reasoning over financial documents, replacing the manual annotation bottleneck
Demonstrated that SLMs fine-tuned on pipeline-generated synthetic data achieve competitive in-distribution performance and superior out-of-distribution generalization compared to models trained on manually curated data

Evaluation Highlights

One SLM trained on synthetic data achieved competitive in-distribution performance relative to models trained on prior manually generated datasets
All tested SLMs fine-tuned on synthetic data demonstrated superior generalization (out-of-distribution performance) compared to counterparts trained on manual data

Breakthrough Assessment

5/10 The work makes a solid, practically valuable contribution by automating financial QA data synthesis with a multi-agent framework, but the core ideas (agentic pipelines, synthetic data generation for fine-tuning) are established paradigms applied to a specific domain rather than a fundamental methodological advance.

Methodology

Step 1 – Content Extraction & Selection: Agents parse unstructured financial documents, identify numerically relevant sections (tables, narratives), and select content pertinent to complex reasoning queries
Step 2 – QA Pair Generation: A generation agent synthesizes diverse, high-quality question-answer pairs requiring numerical reasoning from the selected content, mimicking expert annotation patterns
Step 3 – SLM Fine-tuning & Evaluation: Small language models are fine-tuned on the synthetic QA dataset and evaluated on both in-distribution and out-of-distribution benchmarks against models trained on manually annotated data

System Components

Document Parsing Agent

Extracts and structures raw content (text, tables, figures) from unstructured financial documents such as annual reports and earnings filings

Content Selection Agent

Identifies and filters the most relevant numerical and contextual information needed for complex financial reasoning questions

QA Generation Agent

Generates diverse question-answer pairs requiring multi-step numerical reasoning from the selected financial content

Quality Control Module

Ensures generated QA pairs meet domain-specific quality standards before inclusion in the fine-tuning dataset

SLM Fine-tuning Pipeline

Orchestrates supervised fine-tuning of small language models using the synthetic QA dataset to enable cost-effective deployment

Results

Metric/Benchmark	Baseline (Manual Data)	This Paper (Synthetic Data)	Delta
In-distribution performance (best SLM)	Competitive reference	Competitive (matches baseline)	~Neutral
Out-of-distribution generalization (all SLMs)	Lower generalization	Superior generalization	Positive improvement
Manual annotation effort	High (expert required)	Eliminated (automated)	Significant reduction

Key Takeaways

Multi-agent pipelines can effectively replace costly manual annotation for domain-specific QA datasets, making SLM fine-tuning more accessible for enterprise financial NLP
Synthetic data generated by agentic pipelines can yield better generalization than manually curated data, suggesting that diversity and scale from automation may outweigh the precision of human annotation
Practitioners should consider modular, agent-based data synthesis as a first-pass strategy when entering new financial sub-domains where labeled data is scarce or expensive to acquire

Abstract

Answering complex questions that require numerical reasoning over financial documents is challenging due to the diverse and scattered nature of relevant information. While large language models (LLMs) excel at financial reasoning, their enterprise deployment is often limited by cost and latency. Small language models (SLMs) present a cost-effective alternative but need to be fine-tuned with high-quality, domain-specific question-answer (QA) data. Acquiring such data requires manual expert annotation, presenting a bottleneck to the wider application of SLMs. This work introduces a modular, scalable end-to-end agentic pipeline that extracts and selects relevant content from unstructured financial documents and then generates QA pairs from the selected content for SLM fine-tuning. Compared to the same models trained on previous manually generated data for the task, one of the models trained on our pipeline-produced synthetic data achieved competitive in-distribution performance, and all tested models demonstrated superior generalization. The framework thus demonstrates considerable potential to accelerate the deployment of smaller, cost-effective models by reducing manual data creation efforts.

Generated on 2026-04-01 using Claude