CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era
Problem Statement
LLMs increasingly generate fabricated references that appear plausible but correspond to no real publications, with such hallucinated citations already appearing in accepted papers at major ML venues. Manual verification is impractical given rapidly growing reference lists, and existing automated tools are fragile to noisy citation formats and lack standardized evaluation benchmarks. This creates a systemic threat to scientific integrity that has no robust, scalable solution.
Key Novelty
- First comprehensive, large-scale human-validated benchmark dataset for hallucinated citation detection spanning multiple scientific domains with unified evaluation metrics for citation faithfulness and evidence alignment
- Multi-agent verification pipeline that decomposes citation checking into modular stages: claim extraction, evidence retrieval, passage matching, reasoning, and calibrated judgment
- Standardized evaluation framework with unified metrics that enables systematic comparison of citation verification methods across heterogeneous and noisy citation formats
Evaluation Highlights
- Framework significantly outperforms prior citation verification methods in both detection accuracy and interpretability when tested with state-of-the-art LLMs
- Experiments reveal substantial citation errors across tested LLMs, quantifying the real-world prevalence of hallucinated or unsupported references in scientific writing
Breakthrough Assessment
Methodology
- Construct a large-scale human-validated dataset of scientific citations with ground-truth labels for hallucination, faithfulness, and evidence alignment across multiple domains
- Deploy a multi-agent pipeline that sequentially performs claim extraction from the citing context, evidence retrieval from cited sources, passage matching, chain-of-thought reasoning, and calibrated verdict generation
- Evaluate state-of-the-art LLMs on the benchmark using unified metrics, comparing the multi-agent framework against prior automated citation verification baselines in accuracy and interpretability
System Components
Parses the citing sentence/paragraph to identify the specific scientific claim being attributed to the referenced source
Locates and fetches relevant content from the cited publication (abstract, sections) to serve as grounding for verification
Aligns extracted claims with retrieved passages to identify whether supporting evidence exists in the cited document
Performs chain-of-thought reasoning over the claim-evidence pair to assess semantic support, contradiction, or irrelevance
Produces a final verdict with confidence calibration on whether the citation is faithful, hallucinated, or unsupported
Human-validated, multi-domain dataset with standardized labels and metrics for citation faithfulness and evidence alignment
Results
| Metric/Benchmark | Prior Methods (Baseline) | CiteAudit Framework | Delta |
|---|---|---|---|
| Citation hallucination detection accuracy | Fragile / lower accuracy | Significantly higher accuracy | Substantial improvement |
| Interpretability of verification output | Limited explanations | Full chain-of-thought reasoning | Qualitatively superior |
| Robustness to noisy citation formats | Fragile | Robust (multi-agent decomposition) | Meaningful improvement |
| Evaluation standardization | No unified benchmark | Unified metrics + human-validated dataset | First of its kind |
Key Takeaways
- ML practitioners and researchers should treat LLM-assisted writing with skepticism regarding citations — the benchmark confirms substantial hallucination rates even in plausible-looking references at top venues, making automated auditing essential before submission
- The multi-agent decomposition approach (claim → retrieval → matching → reasoning → judgment) is a reusable design pattern for any fact-verification or grounding task in scientific or enterprise NLP pipelines
- CiteAudit's benchmark and metrics provide a ready-to-use evaluation harness for teams building citation verification, RAG grounding checks, or scientific integrity tools, lowering the barrier to rigorous evaluation in this space
Abstract
Scientific research relies on accurate citation for attribution and integrity, yet large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications. Such hallucinated citations have already been observed in submissions and accepted papers at major machine learning venues, exposing vulnerabilities in peer review. Meanwhile, rapidly growing reference lists make manual verification impractical, and existing automated tools remain fragile to noisy and heterogeneous citation formats and lack standardized evaluation. We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing. Our multi-agent verification pipeline decomposes citation checking into claim extraction, evidence retrieval, passage matching, reasoning, and calibrated judgment to assess whether a cited source truly supports its claim. We construct a large-scale human-validated dataset across domains and define unified metrics for citation faithfulness and evidence alignment. Experiments with state-of-the-art LLMs reveal substantial citation errors and show that our framework significantly outperforms prior methods in both accuracy and interpretability. This work provides the first scalable infrastructure for auditing citations in the LLM era and practical tools to improve the trustworthiness of scientific references.