CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

CiteAudit introduces the first comprehensive benchmark and multi-agent detection framework for identifying hallucinated or unsupported citations in scientific writing generated or assisted by LLMs. It provides scalable infrastructure to verify whether cited sources genuinely support the claims attributed to them.

Problem Statement

LLMs increasingly generate fabricated references that appear plausible but correspond to no real publications, with such hallucinated citations already appearing in accepted papers at major ML venues. Manual verification is impractical given rapidly growing reference lists, and existing automated tools are fragile to noisy citation formats and lack standardized evaluation benchmarks. This creates a systemic threat to scientific integrity that has no robust, scalable solution.

Key Novelty

First comprehensive, large-scale human-validated benchmark dataset for hallucinated citation detection spanning multiple scientific domains with unified evaluation metrics for citation faithfulness and evidence alignment
Multi-agent verification pipeline that decomposes citation checking into modular stages: claim extraction, evidence retrieval, passage matching, reasoning, and calibrated judgment
Standardized evaluation framework with unified metrics that enables systematic comparison of citation verification methods across heterogeneous and noisy citation formats

Evaluation Highlights

Framework significantly outperforms prior citation verification methods in both detection accuracy and interpretability when tested with state-of-the-art LLMs
Experiments reveal substantial citation errors across tested LLMs, quantifying the real-world prevalence of hallucinated or unsupported references in scientific writing

Signal Assessment

7/10 CiteAudit addresses a timely and critical problem with a well-structured benchmark and pipeline that fills a genuine gap in scientific integrity tooling; it is a significant practical contribution to the LLM safety and reliability space, though it is more of a rigorous infrastructure/benchmarking paper than a fundamental algorithmic advance.

Methodology

Construct a large-scale human-validated dataset of scientific citations with ground-truth labels for hallucination, faithfulness, and evidence alignment across multiple domains
Deploy a multi-agent pipeline that sequentially performs claim extraction from the citing context, evidence retrieval from cited sources, passage matching, chain-of-thought reasoning, and calibrated verdict generation
Evaluate state-of-the-art LLMs on the benchmark using unified metrics, comparing the multi-agent framework against prior automated citation verification baselines in accuracy and interpretability

System Components

Claim Extractor

Parses the citing sentence/paragraph to identify the specific scientific claim being attributed to the referenced source

Evidence Retriever

Locates and fetches relevant content from the cited publication (abstract, sections) to serve as grounding for verification

Passage Matcher

Aligns extracted claims with retrieved passages to identify whether supporting evidence exists in the cited document

Reasoning Agent

Performs chain-of-thought reasoning over the claim-evidence pair to assess semantic support, contradiction, or irrelevance

Calibrated Judgment Module

Produces a final verdict with confidence calibration on whether the citation is faithful, hallucinated, or unsupported

CiteAudit Benchmark Dataset

Human-validated, multi-domain dataset with standardized labels and metrics for citation faithfulness and evidence alignment

Results

Metric/Benchmark	Prior Methods (Baseline)	CiteAudit Framework	Delta
Citation hallucination detection accuracy	Fragile / lower accuracy	Significantly higher accuracy	Substantial improvement
Interpretability of verification output	Limited explanations	Full chain-of-thought reasoning	Qualitatively superior
Robustness to noisy citation formats	Fragile	Robust (multi-agent decomposition)	Meaningful improvement
Evaluation standardization	No unified benchmark	Unified metrics + human-validated dataset	First of its kind

Key Takeaways

ML practitioners and researchers should treat LLM-assisted writing with skepticism regarding citations — the benchmark confirms substantial hallucination rates even in plausible-looking references at top venues, making automated auditing essential before submission
The multi-agent decomposition approach (claim → retrieval → matching → reasoning → judgment) is a reusable design pattern for any fact-verification or grounding task in scientific or enterprise NLP pipelines
CiteAudit's benchmark and metrics provide a ready-to-use evaluation harness for teams building citation verification, RAG grounding checks, or scientific integrity tools, lowering the barrier to rigorous evaluation in this space

Abstract

Scientific research relies on accurate citation for attribution and integrity, yet large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications. Such hallucinated citations have already been observed in submissions and accepted papers at major machine learning venues, exposing vulnerabilities in peer review. Meanwhile, rapidly growing reference lists make manual verification impractical, and existing automated tools remain fragile to noisy and heterogeneous citation formats and lack standardized evaluation. We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing. Our multi-agent verification pipeline decomposes citation checking into claim extraction, evidence retrieval, passage matching, reasoning, and calibrated judgment to assess whether a cited source truly supports its claim. We construct a large-scale human-validated dataset across domains and define unified metrics for citation faithfulness and evidence alignment. Experiments with state-of-the-art LLMs reveal substantial citation errors and show that our framework significantly outperforms prior methods in both accuracy and interpretability. This work provides the first scalable infrastructure for auditing citations in the LLM era and practical tools to improve the trustworthiness of scientific references.