Synthesizing question answering data from financial documents: An End-to-End Multi-Agent Approach
Problem Statement
Deploying LLMs for financial numerical reasoning is expensive and slow at enterprise scale, while SLMs require high-quality domain-specific fine-tuning data that traditionally demands costly manual expert annotation. This bottleneck limits the practical adoption of smaller, more efficient models in financial NLP applications. The scattered and heterogeneous nature of financial documents further complicates automated data extraction and QA generation.
Key Novelty
- Modular, scalable agentic pipeline that autonomously extracts, selects, and structures relevant content from unstructured financial documents end-to-end
- Automated synthetic QA data generation tailored for numerical reasoning over financial documents, replacing the manual annotation bottleneck
- Demonstrated that SLMs fine-tuned on pipeline-generated synthetic data achieve competitive in-distribution performance and superior out-of-distribution generalization compared to models trained on manually curated data
Evaluation Highlights
- One SLM trained on synthetic data achieved competitive in-distribution performance relative to models trained on prior manually generated datasets
- All tested SLMs fine-tuned on synthetic data demonstrated superior generalization (out-of-distribution performance) compared to counterparts trained on manual data
Breakthrough Assessment
Methodology
- Step 1 – Content Extraction & Selection: Agents parse unstructured financial documents, identify numerically relevant sections (tables, narratives), and select content pertinent to complex reasoning queries
- Step 2 – QA Pair Generation: A generation agent synthesizes diverse, high-quality question-answer pairs requiring numerical reasoning from the selected content, mimicking expert annotation patterns
- Step 3 – SLM Fine-tuning & Evaluation: Small language models are fine-tuned on the synthetic QA dataset and evaluated on both in-distribution and out-of-distribution benchmarks against models trained on manually annotated data
System Components
Extracts and structures raw content (text, tables, figures) from unstructured financial documents such as annual reports and earnings filings
Identifies and filters the most relevant numerical and contextual information needed for complex financial reasoning questions
Generates diverse question-answer pairs requiring multi-step numerical reasoning from the selected financial content
Ensures generated QA pairs meet domain-specific quality standards before inclusion in the fine-tuning dataset
Orchestrates supervised fine-tuning of small language models using the synthetic QA dataset to enable cost-effective deployment
Results
| Metric/Benchmark | Baseline (Manual Data) | This Paper (Synthetic Data) | Delta |
|---|---|---|---|
| In-distribution performance (best SLM) | Competitive reference | Competitive (matches baseline) | ~Neutral |
| Out-of-distribution generalization (all SLMs) | Lower generalization | Superior generalization | Positive improvement |
| Manual annotation effort | High (expert required) | Eliminated (automated) | Significant reduction |
Key Takeaways
- Multi-agent pipelines can effectively replace costly manual annotation for domain-specific QA datasets, making SLM fine-tuning more accessible for enterprise financial NLP
- Synthetic data generated by agentic pipelines can yield better generalization than manually curated data, suggesting that diversity and scale from automation may outweigh the precision of human annotation
- Practitioners should consider modular, agent-based data synthesis as a first-pass strategy when entering new financial sub-domains where labeled data is scarce or expensive to acquire
Abstract
Answering complex questions that require numerical reasoning over financial documents is challenging due to the diverse and scattered nature of relevant information. While large language models (LLMs) excel at financial reasoning, their enterprise deployment is often limited by cost and latency. Small language models (SLMs) present a cost-effective alternative but need to be fine-tuned with high-quality, domain-specific question-answer (QA) data. Acquiring such data requires manual expert annotation, presenting a bottleneck to the wider application of SLMs. This work introduces a modular, scalable end-to-end agentic pipeline that extracts and selects relevant content from unstructured financial documents and then generates QA pairs from the selected content for SLM fine-tuning. Compared to the same models trained on previous manually generated data for the task, one of the models trained on our pipeline-produced synthetic data achieved competitive in-distribution performance, and all tested models demonstrated superior generalization. The framework thus demonstrates considerable potential to accelerate the deployment of smaller, cost-effective models by reducing manual data creation efforts.