Chaincite: A Clinical RAG Benchmark for Retrieval Quality and Citation Faithfulness
Two questions vanilla RAG metrics conflate: does the retrieved passage support the claim, and did the model actually use it? Built on the correctness-vs-faithfulness distinction.
Abstract Clinical RAG systems routinely produce answers that look cited but are not actually grounded in the cited passage. ALCE[1] reported that even the best LLMs on ELI5 lack complete citation support 50% of the time. SourceCheckup[11] (Wu et al., Nature Communications 2025) found 50–90% of LLM medical responses are not fully supported by their cited sources, with ~30% of GPT-4o-with-Search statements unsupported. Wallat et al.'s ICTIR 2025 best-paper[3] formalised the deeper problem: correctness is necessary but insufficient for faithfulness — models can post-rationalise citations they didn't actually use. Chaincite is a clinical RAG benchmark that measures both axes. It builds on the AIS attribution framework[2], the RAGAS[4] reference-free evaluation triad, and the TruLens RAG-triad[5]; uses RAGTruth[6] and NoMIRACL[7] as methodological precedents; and pins clinical results against MedRAG / MIRAGE[9] and Almanac[10]. Pass criterion: 500 clinical questions with attribution-graded responses across at least 5 RAG configurations; documented gap between correctness and faithfulness.
§ 1 Introduction
A clinical RAG response with a citation is not the same as a clinical RAG response grounded in its citation. Three findings from the last 18 months establish how serious the gap is. ALCE[1] measured citation support on ELI5: even GPT-4 with the best retrieval setup lacked complete citation support 50% of the time. SourceCheckup[11] extended this to medical questions and found 50–90% of LLM medical responses are not fully supported by their cited sources; ~30% of GPT-4o-with-Search statements are entirely unsupported. Wallat et al.[3] made the conceptual distinction precise — citation correctness (does this passage support this claim?) is operationally different from citation faithfulness (did the model actually use this passage, or write the claim from its parametric knowledge and tack a citation on?).
Clinical RAG benchmarks today conflate the two. MedRAG[9] measures answer accuracy; Almanac[10] evaluated by 8 board-certified clinicians on 314 questions reports factuality and adversarial safety gains. Neither separately measures whether the model relied on what it cited. Chaincite fills that gap.
1.1 Contributions
- A 500-question clinical RAG benchmark covering retrieval-anchored, multi-hop, and global-style queries.
- A scoring protocol that evaluates both AIS[2] citation correctness and Wallat-style[3] citation faithfulness — the first benchmark to do both for clinical RAG.
- Reference RAG configurations spanning vanilla MedRAG[9], Almanac[10], GraphCore (Project 17), and a long-context-only baseline.
§ 2 Background and Related Work
2.1 The Correctness/Faithfulness Distinction
Wallat, Heuss, de Rijke & Anand's ICTIR 2025 paper[3] (Best Paper Honorable Mention) is the load-bearing conceptual contribution. They show empirically that LLMs can produce citations whose passage does support the claim while the model did not causally rely on that passage. Standard correctness metrics — including ALCE's[1] citation precision/recall and the AIS framework[2] — cannot detect this. Faithfulness measurement requires either attention-based proxies (Lookback Lens[12]: attention-ratio classifiers cut XSum hallucinations by 9.6%) or counterfactual retrieval (does the model produce the same claim if the passage is replaced?). Chaincite uses both.
2.2 The AIS Lineage
Rashkin et al.'s AIS framework[2] (Computational Linguistics 2023) defines "Attributable to Identified Sources": a claim is AIS-supported if a competent reader given the cited source would judge it supported. ALCE[1] operationalises AIS for long-form QA with citation-precision and citation-recall metrics. Chaincite inherits AIS as the correctness axis and extends with faithfulness measurement Wallat-style.
2.3 Reference-Free RAG Evaluation
RAGAS (Es et al., EACL 2024 Demo)[4] defines three reference-free metrics: context relevance, faithfulness, answer relevance. TruLens[5] codified the canonical RAG triad of context relevance, groundedness, and answer relevance with LLM-as-judge scoring. Chaincite uses RAGAS as a reference-free baseline scorer alongside the more rigorous AIS-plus-faithfulness protocol.
2.4 Hallucination Benchmarks
RAGTruth (Niu et al., ACL 2024)[6] released ~18,000 word-level hallucination annotations across multiple LLMs — the largest open RAG-hallucination corpus. NoMIRACL (Thakur et al., EMNLP Findings 2024)[7] documents that LLaMA-2 and Orca-2 hallucinate at >88% on non-relevant subsets across 18 languages; GPT-4 had the best tradeoff. FActScore (Min et al., EMNLP 2023)[8] reported ChatGPT's biography-FActScore at 58% with an automated estimator showing <2% error vs human. Chaincite uses RAGTruth's annotation methodology and FActScore's automated estimator as cross-checks.
2.5 Clinical RAG State of the Art
MIRAGE / MedRAG[9] covers 7,663 questions across 5 medical QA datasets, with MedRAG improving accuracy by up to 18% over CoT. Almanac[10] (Zakka et al., NEJM AI 2024) was clinician-evaluated on 314 questions across 9 specialties and showed factuality and adversarial-safety gains over GPT-4 / Bing / Bard — but the published evaluation does not separately measure citation faithfulness. SourceCheckup[11] is the closest existing faithfulness audit; Chaincite extends it into a reproducible benchmark.
§ 3 Proposed Approach
3.1 Benchmark Construction
3.2 Question Set
Five hundred questions in three strata: 200 focused-fact (single-passage answer), 200 multi-hop (requires synthesising 2+ passages), 100 global-style (requires synthesis across themes — these align with the GraphCore questions in Project 17). Source: physician-curated questions seeded from MedRAG's MIRAGE[9] distribution, with explicit physician-validated gold passages for the focused-fact subset.
3.3 RAG Systems Under Test
Six baseline configurations evaluated head-to-head: (1) GPT-5 with vanilla vector RAG over MIRAGE corpora; (2) Claude with the same; (3) MedRAG reference implementation[9]; (4) Almanac architecture[10]; (5) Graphcore (Project 17 of this dossier); (6) long-context-only (no retrieval) baseline on Claude 1M. Reporting per-system on both axes.
§ 4 Evaluation Protocol
| Metric | Source | Reports |
|---|---|---|
| AIS attribution rate | Rashkin et al. 2023[2] | Binary per claim; aggregated |
| ALCE citation precision / recall | Gao et al. 2023[1] | Standard ALCE |
| RAGAS faithfulness | Es et al. 2024[4] | Reference-free LLM-as-judge |
| Counterfactual stability | Wallat-inspired[3] | Does claim change when passage swapped? |
| Attention-ratio faithfulness | Chuang et al. 2024[12] | Lookback Lens score (where applicable) |
| Correctness–faithfulness gap | Chaincite headline | Difference between Axes 1 and 2 |
§ 5 Expected Contributions
- Benchmark. A 500-question clinical RAG benchmark scored on correctness AND faithfulness — the first.
- Empirical finding. Per-system correctness-vs-faithfulness gaps for five frontier clinical RAG configurations.
- Methodology. A reproducible faithfulness-measurement protocol combining counterfactual retrieval and Lookback Lens-style attention analysis.
§ 6 Limitations and Risks
Faithfulness measurement is harder than correctness. Counterfactual-retrieval probes require access to retrieval-time inputs, which closed APIs do not always provide; attention-based probes require weights, which closed models do not expose. Chaincite reports faithfulness scores for open systems unconditionally and for closed systems where input-only counterfactual probing is feasible. The 500-question scale is intentionally modest — physician validation does not scale linearly, and we prefer a high-quality smaller set to a noisy larger one.
§ 7 Conclusion
Chaincite tests whether clinical RAG citations are real or theatrical. The conceptual basis is Wallat et al.[3]; the operationalisation is novel for the clinical domain. The expected output is the first quantitative gap measurement between citation correctness and citation faithfulness in frontier clinical RAG — exactly the number a deploying hospital should be asking for and currently cannot.
References
- Gao T, Yen H, Yu J, Chen D. Enabling Large Language Models to Generate Text with Citations (ALCE). EMNLP, 2023. Best LLMs lack complete citation support 50% of the time on ELI5. arxiv.org/abs/2305.14627
- Rashkin H, Nikolaev V, Lamm M, et al. Measuring Attribution in Natural Language Generation Models (AIS). Computational Linguistics 49(4):777–840, 2023. aclanthology.org/2023.cl-4.2
- Wallat J, Heuss M, de Rijke M, Anand A. Correctness is not Faithfulness in RAG Attributions. ICTIR 2025 (SIGIR), Best Paper Honorable Mention. arxiv.org/abs/2412.18004
- Es S, James J, Espinosa-Anke L, Schockaert S. RAGAS: Automated Evaluation of Retrieval Augmented Generation. EACL 2024 Demo. Three reference-free dimensions: context relevance, faithfulness, answer relevance. arxiv.org/abs/2309.15217
- TruEra / TruLens contributors. The RAG Triad. TruLens framework documentation. Canonical three-metric triad: context relevance, groundedness, answer relevance. trulens.org/getting_started/core_concepts/rag_triad
- Niu C, Wu Y, Zhu J, et al. RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models. ACL 2024. ~18,000 word-level annotated responses. arxiv.org/abs/2401.00396
- Thakur N et al. "Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust RAG (NoMIRACL). EMNLP Findings 2024. LLaMA-2 and Orca-2 hallucinate at >88% on non-relevant subsets across 18 languages. arxiv.org/abs/2312.11361
- Min S, Krishna K, et al. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. EMNLP 2023. ChatGPT biography FActScore 58%; automated estimator <2% error vs human. arxiv.org/abs/2305.14251
- Xiong G, Jin Q, Lu Z, Zhang A. Benchmarking Retrieval-Augmented Generation for Medicine (MedRAG / MIRAGE). ACL Findings 2024. 7,663 questions; up to +18% over CoT. arxiv.org/abs/2402.13178
- Zakka C, Shad R, Chaurasia A, et al. Almanac — Retrieval-Augmented Language Models for Clinical Medicine. NEJM AI, 2024. 8 board-certified clinicians; 314 questions; 9 specialties. ai.nejm.org/doi/abs/10.1056/AIoa2300068
- Wu K et al. An automated framework for assessing how well LLMs cite relevant medical references (SourceCheckup). Nature Communications, 2025. 50–90% of LLM medical responses are not fully supported; ~30% of GPT-4o-with-Search statements unsupported. nature.com/articles/s41467-025-58551-6
- Chuang YS et al. Lookback Lens: Detecting and Mitigating Contextual Hallucinations Using Only Attention Maps. EMNLP 2024. Attention-ratio classifier reduces XSum hallucinations by 9.6%; transfers across model sizes. arxiv.org/abs/2407.07071
C. Takeoff AI · Set in EB Garamond