Dossier №01 · Project 18 · Chaincite

Chaincite: A Clinical RAG Benchmark for Retrieval Quality and Citation Faithfulness

Two questions vanilla RAG metrics conflate: does the retrieved passage support the claim, and did the model actually use it? Built on the correctness-vs-faithfulness distinction.

Chandra Takeoff AI · Healthcare AI Engineering Compiled · May 2026

Abstract Clinical RAG systems routinely produce answers that look cited but are not actually grounded in the cited passage. ALCE[1] reported that even the best LLMs on ELI5 lack complete citation support 50% of the time. SourceCheckup[11] (Wu et al., Nature Communications 2025) found 50–90% of LLM medical responses are not fully supported by their cited sources, with ~30% of GPT-4o-with-Search statements unsupported. Wallat et al.'s ICTIR 2025 best-paper[3] formalised the deeper problem: correctness is necessary but insufficient for faithfulness — models can post-rationalise citations they didn't actually use. Chaincite is a clinical RAG benchmark that measures both axes. It builds on the AIS attribution framework[2], the RAGAS[4] reference-free evaluation triad, and the TruLens RAG-triad[5]; uses RAGTruth[6] and NoMIRACL[7] as methodological precedents; and pins clinical results against MedRAG / MIRAGE[9] and Almanac[10]. Pass criterion: 500 clinical questions with attribution-graded responses across at least 5 RAG configurations; documented gap between correctness and faithfulness.

§ 1 Introduction

A clinical RAG response with a citation is not the same as a clinical RAG response grounded in its citation. Three findings from the last 18 months establish how serious the gap is. ALCE[1] measured citation support on ELI5: even GPT-4 with the best retrieval setup lacked complete citation support 50% of the time. SourceCheckup[11] extended this to medical questions and found 50–90% of LLM medical responses are not fully supported by their cited sources; ~30% of GPT-4o-with-Search statements are entirely unsupported. Wallat et al.[3] made the conceptual distinction precise — citation correctness (does this passage support this claim?) is operationally different from citation faithfulness (did the model actually use this passage, or write the claim from its parametric knowledge and tack a citation on?).

Clinical RAG benchmarks today conflate the two. MedRAG[9] measures answer accuracy; Almanac[10] evaluated by 8 board-certified clinicians on 314 questions reports factuality and adversarial safety gains. Neither separately measures whether the model relied on what it cited. Chaincite fills that gap.

1.1 Contributions

A 500-question clinical RAG benchmark covering retrieval-anchored, multi-hop, and global-style queries.
A scoring protocol that evaluates both AIS[2] citation correctness and Wallat-style[3] citation faithfulness — the first benchmark to do both for clinical RAG.
Reference RAG configurations spanning vanilla MedRAG[9], Almanac[10], GraphCore (Project 17), and a long-context-only baseline.

§ 2 Background and Related Work

2.1 The Correctness/Faithfulness Distinction

Wallat, Heuss, de Rijke & Anand's ICTIR 2025 paper[3] (Best Paper Honorable Mention) is the load-bearing conceptual contribution. They show empirically that LLMs can produce citations whose passage does support the claim while the model did not causally rely on that passage. Standard correctness metrics — including ALCE's[1] citation precision/recall and the AIS framework[2] — cannot detect this. Faithfulness measurement requires either attention-based proxies (Lookback Lens[12]: attention-ratio classifiers cut XSum hallucinations by 9.6%) or counterfactual retrieval (does the model produce the same claim if the passage is replaced?). Chaincite uses both.

2.2 The AIS Lineage

Rashkin et al.'s AIS framework[2] (Computational Linguistics 2023) defines "Attributable to Identified Sources": a claim is AIS-supported if a competent reader given the cited source would judge it supported. ALCE[1] operationalises AIS for long-form QA with citation-precision and citation-recall metrics. Chaincite inherits AIS as the correctness axis and extends with faithfulness measurement Wallat-style.

2.3 Reference-Free RAG Evaluation

RAGAS (Es et al., EACL 2024 Demo)[4] defines three reference-free metrics: context relevance, faithfulness, answer relevance. TruLens[5] codified the canonical RAG triad of context relevance, groundedness, and answer relevance with LLM-as-judge scoring. Chaincite uses RAGAS as a reference-free baseline scorer alongside the more rigorous AIS-plus-faithfulness protocol.

2.4 Hallucination Benchmarks

RAGTruth (Niu et al., ACL 2024)[6] released ~18,000 word-level hallucination annotations across multiple LLMs — the largest open RAG-hallucination corpus. NoMIRACL (Thakur et al., EMNLP Findings 2024)[7] documents that LLaMA-2 and Orca-2 hallucinate at >88% on non-relevant subsets across 18 languages; GPT-4 had the best tradeoff. FActScore (Min et al., EMNLP 2023)[8] reported ChatGPT's biography-FActScore at 58% with an automated estimator showing <2% error vs human. Chaincite uses RAGTruth's annotation methodology and FActScore's automated estimator as cross-checks.

2.5 Clinical RAG State of the Art

MIRAGE / MedRAG[9] covers 7,663 questions across 5 medical QA datasets, with MedRAG improving accuracy by up to 18% over CoT. Almanac[10] (Zakka et al., NEJM AI 2024) was clinician-evaluated on 314 questions across 9 specialties and showed factuality and adversarial-safety gains over GPT-4 / Bing / Bard — but the published evaluation does not separately measure citation faithfulness. SourceCheckup[11] is the closest existing faithfulness audit; Chaincite extends it into a reproducible benchmark.

§ 3 Proposed Approach

3.1 Benchmark Construction

Figure 1 · Chaincite scoring pipeline

Figure 1. Chaincite evaluates a RAG response on two orthogonal axes. Axis 1 (Correctness) applies AIS attribution[2] and ALCE precision/recall[1]: for each claim, does the cited passage actually support it? Axis 2 (Faithfulness) follows Wallat et al.[3]: did the model causally rely on the cited document, or post-rationalise? Two probes: counterfactual retrieval (swap the document — does the model still produce the same claim?) and attention-ratio analysis via Lookback Lens[12].

3.2 Question Set

Five hundred questions in three strata: 200 focused-fact (single-passage answer), 200 multi-hop (requires synthesising 2+ passages), 100 global-style (requires synthesis across themes — these align with the GraphCore questions in Project 17). Source: physician-curated questions seeded from MedRAG's MIRAGE[9] distribution, with explicit physician-validated gold passages for the focused-fact subset.

3.3 RAG Systems Under Test

Six baseline configurations evaluated head-to-head: (1) GPT-5 with vanilla vector RAG over MIRAGE corpora; (2) Claude with the same; (3) MedRAG reference implementation[9]; (4) Almanac architecture[10]; (5) Graphcore (Project 17 of this dossier); (6) long-context-only (no retrieval) baseline on Claude 1M. Reporting per-system on both axes.

§ 4 Evaluation Protocol

**Table 1.** Chaincite metric suite.
Metric	Source	Reports
AIS attribution rate	Rashkin et al. 2023[2]	Binary per claim; aggregated
ALCE citation precision / recall	Gao et al. 2023[1]	Standard ALCE
RAGAS faithfulness	Es et al. 2024[4]	Reference-free LLM-as-judge
Counterfactual stability	Wallat-inspired[3]	Does claim change when passage swapped?
Attention-ratio faithfulness	Chuang et al. 2024[12]	Lookback Lens score (where applicable)
Correctness–faithfulness gap	Chaincite headline	Difference between Axes 1 and 2

Pass criterion Chaincite v0.1 succeeds when the benchmark is fully scored on ≥ 5 frontier RAG configurations, the correctness-vs-faithfulness gap is quantitatively reported per system, and physician inter-rater agreement on the calibration subset reaches Cohen's κ ≥ 0.70.

§ 5 Expected Contributions

Benchmark. A 500-question clinical RAG benchmark scored on correctness AND faithfulness — the first.
Empirical finding. Per-system correctness-vs-faithfulness gaps for five frontier clinical RAG configurations.
Methodology. A reproducible faithfulness-measurement protocol combining counterfactual retrieval and Lookback Lens-style attention analysis.

§ 6 Limitations and Risks

Faithfulness measurement is harder than correctness. Counterfactual-retrieval probes require access to retrieval-time inputs, which closed APIs do not always provide; attention-based probes require weights, which closed models do not expose. Chaincite reports faithfulness scores for open systems unconditionally and for closed systems where input-only counterfactual probing is feasible. The 500-question scale is intentionally modest — physician validation does not scale linearly, and we prefer a high-quality smaller set to a noisy larger one.

§ 7 Conclusion

Chaincite tests whether clinical RAG citations are real or theatrical. The conceptual basis is Wallat et al.[3]; the operationalisation is novel for the clinical domain. The expected output is the first quantitative gap measurement between citation correctness and citation faithfulness in frontier clinical RAG — exactly the number a deploying hospital should be asking for and currently cannot.

References

Gao T, Yen H, Yu J, Chen D. Enabling Large Language Models to Generate Text with Citations (ALCE). EMNLP, 2023. Best LLMs lack complete citation support 50% of the time on ELI5. arxiv.org/abs/2305.14627
Rashkin H, Nikolaev V, Lamm M, et al. Measuring Attribution in Natural Language Generation Models (AIS). Computational Linguistics 49(4):777–840, 2023. aclanthology.org/2023.cl-4.2
Wallat J, Heuss M, de Rijke M, Anand A. Correctness is not Faithfulness in RAG Attributions. ICTIR 2025 (SIGIR), Best Paper Honorable Mention. arxiv.org/abs/2412.18004
Es S, James J, Espinosa-Anke L, Schockaert S. RAGAS: Automated Evaluation of Retrieval Augmented Generation. EACL 2024 Demo. Three reference-free dimensions: context relevance, faithfulness, answer relevance. arxiv.org/abs/2309.15217
TruEra / TruLens contributors. The RAG Triad. TruLens framework documentation. Canonical three-metric triad: context relevance, groundedness, answer relevance. trulens.org/getting_started/core_concepts/rag_triad
Niu C, Wu Y, Zhu J, et al. RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models. ACL 2024. ~18,000 word-level annotated responses. arxiv.org/abs/2401.00396
Thakur N et al. "Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust RAG (NoMIRACL). EMNLP Findings 2024. LLaMA-2 and Orca-2 hallucinate at >88% on non-relevant subsets across 18 languages. arxiv.org/abs/2312.11361
Min S, Krishna K, et al. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. EMNLP 2023. ChatGPT biography FActScore 58%; automated estimator <2% error vs human. arxiv.org/abs/2305.14251
Xiong G, Jin Q, Lu Z, Zhang A. Benchmarking Retrieval-Augmented Generation for Medicine (MedRAG / MIRAGE). ACL Findings 2024. 7,663 questions; up to +18% over CoT. arxiv.org/abs/2402.13178
Zakka C, Shad R, Chaurasia A, et al. Almanac — Retrieval-Augmented Language Models for Clinical Medicine. NEJM AI, 2024. 8 board-certified clinicians; 314 questions; 9 specialties. ai.nejm.org/doi/abs/10.1056/AIoa2300068
Wu K et al. An automated framework for assessing how well LLMs cite relevant medical references (SourceCheckup). Nature Communications, 2025. 50–90% of LLM medical responses are not fully supported; ~30% of GPT-4o-with-Search statements unsupported. nature.com/articles/s41467-025-58551-6
Chuang YS et al. Lookback Lens: Detecting and Mitigating Contextual Hallucinations Using Only Attention Maps. EMNLP 2024. Attention-ratio classifier reduces XSum hallucinations by 9.6%; transfers across model sizes. arxiv.org/abs/2407.07071

— · § · — Preliminary manuscript · Chaincite v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond