Oracle: An Evidence-Grounded Differential Diagnosis Agent with Citation-per-Claim Attribution
A structured H&P interview, a ranked DDx, and — for every clinical claim — a verifiable citation back to PubMed or MedlinePlus. Audited on the NEJM Case Records.
Abstract Kanjee et al.[2] showed in JAMA (2023) that GPT-4 achieves a 64% top-DDx-inclusive accuracy and 39% exact-top-diagnosis accuracy on 70 NEJM Clinicopathological Cases. Google's AMIE[1] matched or exceeded board-certified PCPs on a 159-OSCE study of conversational diagnostic AI. Both systems share a common limitation: their answers are not attributable. Oracle takes the diagnostic-reasoning capability demonstrated in these systems and adds a citation-per-claim attribution layer, grounded in the Attributable-to-Identified-Sources (AIS) framework of Rashkin et al.[9] and the ALCE citation evaluation protocol of Gao et al.[8] The retrieval index reuses the MedRAG / MIRAGE corpora[6]; the evaluation reuses a held-out 30-case slice of the NEJM Case Records[2]. Pass criterion: top-3 DDx accuracy that matches or exceeds GPT-4's 64% top-DDx-inclusive baseline reported in Kanjee et al.[2], with ≥ 95% of clinical claims AIS-attributable.
§ 1 Introduction
A clinician shown a ranked differential diagnosis from a language model cannot trust what they cannot audit. Current systems present DDx as if it were authoritative — a numbered list with no provenance. This is the wrong design. Trust in clinical AI requires that every assertion be traceable to a source the clinician can verify in seconds.
The empirical case for diagnostic LLMs is strong and improving. AMIE[1] outperformed PCPs on diagnostic accuracy, management reasoning, and communication quality in a 20-condition OSCE study. Med-PaLM 2[3] reached 86.5% on MedQA via ensemble refinement. Med-Gemini[5] integrates uncertainty-guided web search into clinical responses. Liévin et al.[10] demonstrated that chain-of-thought plus retrieval substantially improves performance on MedQA, MedMCQA, and PubMedQA. The capability exists. The accountability does not.
1.1 Contributions
- A diagnostic agent that emits, for each ranked DDx entry, the supporting and refuting evidence with explicit citation IDs into a PubMed / MedlinePlus retrieval index.
- A scoring harness that combines top-K DDx accuracy on NEJM Case Records[2] with the AIS attribution rate of Rashkin et al.[9] and the citation-quality metrics of ALCE[8].
- An open implementation reusing MedRAG's MIRAGE corpus[6] so that the retrieval ablations from prior work transfer directly.
§ 2 Background and Related Work
2.1 Diagnostic Reasoning Benchmarks
The NEJM Clinicopathological Cases serve as the most demanding public diagnostic benchmark. Kanjee et al.[2] evaluated GPT-4 on 70 cases under a fixed prompt; performance was: 64% (top-DDx-includes-correct), 39% (top-DDx-is-correct), against a quoted historical clinician-DDx-inclusive rate of approximately 89%. We treat their evaluation methodology as canonical and use a held-out 30-case set drawn from later issues, with case-source year recorded for any future contamination audit.
2.2 Conversational Diagnostic Agents
AMIE[1] was trained via self-play in a simulated clinician-patient environment with three role-played LLMs (Patient Agent, Doctor Agent, Critic Agent); the Nature 2025 publication of the OSCE study confirmed AMIE was superior on 28 of 32 axes per specialists and 24 of 26 axes per patient-actors across 159 case scenarios. The follow-on Google Research work has gone in two directions: AMIE-Vision[11] (Tu et al., May 2025) extended AMIE to multimodal reasoning on Gemini 2.0 Flash and matched or beat PCPs on 29 of 32 clinical axes and 7 of 9 multimodal-specific criteria in a 105-scenario / 25-patient-actor OSCE; AMIE Longitudinal[12] (March 2025) extended the architecture to multi-visit disease management, matching or exceeding clinicians on investigations, prescriptions, and guideline adherence. Oracle's H&P-interview front-end follows AMIE's structured-elicitation pattern; the diagnostic ranking stage is informed by Med-PaLM 2's[3] ensemble-refinement procedure (three diverse CoT samples reconciled into a final answer — the procedure that drove the +19-point MedQA gain to 86.5%).
Important counter-evidence from the same venue: NEJM AI's Script Concordance Test benchmark[13] evaluated 10 frontier LLMs (GPT-4o, o1, o3, Claude 3.5 Sonnet, Gemini 2.5, DeepSeek R1, Llama 3.3 70B) on 750 SCT items against 1,070 students, 193 residents, and 300 attendings — LLMs performed markedly worse than on multiple-choice benchmarks, and CoT prompting hurt SCT scores. The NEJM AI Automation Bias RCT[14] showed that even AI-literate physicians exhibit automation bias under discretionary LLM consultation — a strong empirical argument for Oracle's evidence-grounding architecture rather than raw model recommendations. Medical hallucination remains substantial: Omar et al.'s adversarial-attack study in Communications Medicine (2025)[15] documented 50–82% hallucination rates across frontier models on adversarial clinical vignettes, with prompt-based mitigation lowering GPT-4o's rate from 53% to 23%.
2.3 Retrieval-Augmented Clinical Reasoning
MedRAG[6] provides the MIRAGE benchmark and a unified RAG toolkit indexing PubMed, StatPearls, and medical textbooks across roughly 60 million chunks. Almanac[7] (NEJM AI, 2024) demonstrated that grounding clinical answers in a curated corpus substantially reduces hallucination compared to ungrounded GPT-4. Oracle reuses the MIRAGE corpus directly so that retrieval-stack ablations from MedRAG transfer.
2.4 Citation-Faithful Generation
The Attributable-to-Identified-Sources framework of Rashkin et al.[9] formalises what it means for a generated statement to be "attributable": a claim is AIS-attributable if a competent human reader, given the cited source, would judge the claim as supported. ALCE[8] operationalises three measurable axes — fluency, correctness, and citation quality (recall and precision of citations). Oracle adopts AIS as the binary attribution metric and ALCE's citation-precision metric as the secondary measure.
2.5 The Med-PaLM Lineage
Singhal et al.[4] established MultiMedQA and the per-axis human evaluation protocol — factuality, scientific consensus, possible harm, possible bias — that Oracle's manual audit reuses for a 50-case subset of the NEJM evaluation.
§ 3 Proposed Approach
3.1 Agent Architecture
Oracle is a three-stage tool-using agent. The stages run sequentially; tool calls are interleaved within each stage via the ReAct pattern.
- Stage 1 — Structured H&P Elicitation. The agent conducts a clinician-style intake: chief complaint, HPI, PMH, medications, allergies, social/family history, ROS. Elicitation follows a fixed-template prompt; the schema is mappable to a FHIR Encounter+Condition+Observation bundle so that downstream Oracle outputs are interoperable.
- Stage 2 — Ranked DDx Generation. The agent emits a ranked list of up to 10 candidate diagnoses with a one-line clinical rationale per entry. The ensemble-refinement procedure of Med-PaLM 2[3] is applied: three independent samples are generated and reconciled by a single follow-up pass.
- Stage 3 — Per-Claim Evidence Retrieval. For each rationale claim, the agent issues a retrieval query against MIRAGE[6]. Top-5 retrieved passages are scored for support (Yes/No/Insufficient); the highest-scoring supporting passage is attached as the citation. If no passage supports the claim, the claim is either dropped or rewritten with the available evidence.
3.2 Output Format
Each diagnostic response is structured as JSON with the schema:
The format makes attribution machine-checkable: a downstream evaluator can re-query the retrieval index by evidence_id, confirm the passage exists, and run a secondary AIS adjudication.
3.3 Chain-of-Evidence Prompting
Liévin et al.[10] demonstrated that chain-of-thought combined with retrieval yields a measurable accuracy gain on MedQA/MedMCQA/PubMedQA — and that requiring evidence citation in the chain further reduces hallucination. Oracle's Stage-3 prompt template makes this requirement explicit: the model is asked to draft its reasoning trace as a sequence of (claim, retrieved evidence, judgement) triples rather than free text.
§ 4 Evaluation Protocol
4.1 Dataset
A 30-case held-out set is drawn from NEJM Clinicopathological Cases published after the most recent training-cutoff of all evaluated frontier models. Each case is reduced to a structured H&P input prior to evaluation (mirroring Kanjee et al.[2]'s methodology) so that the model is not given the discussion section.
4.2 Metrics
| Metric | Definition | Source |
|---|---|---|
| Top-1 accuracy | Final diagnosis matches gold. | Kanjee et al.[2] |
| Top-3 inclusive | Gold diagnosis appears in top 3 of DDx. | Kanjee et al.[2] |
| AIS attribution rate | Fraction of claims judged as supported by their cited evidence. | Rashkin et al.[9] |
| ALCE citation precision | Fraction of cited passages that actually support the claim. | Gao et al.[8] |
| Hallucinated-claim rate | Claims that cite non-existent or unsupported evidence. | Almanac[7] |
4.3 Baselines
Compared against (i) GPT-4 with the Kanjee et al. prompt (their original 64%/39%), (ii) Claude Opus 4.7 without retrieval, (iii) Almanac architecture[7] reproduced on MIRAGE, and (iv) Oracle without the per-claim citation requirement (ablation). The fourth baseline isolates the contribution of structured attribution.
§ 5 Expected Contributions
- System. A public reference implementation of a citation-faithful diagnostic agent, runnable against any NEJM-style case.
- Methodology. The first composite metric suite that combines diagnostic accuracy (Kanjee), attribution (AIS), and citation quality (ALCE) for clinical reasoning.
- Empirical finding. Quantification of the accuracy cost (if any) of imposing a 95% AIS-attribution constraint on diagnostic generation.
§ 6 Limitations and Risks
Oracle's attribution layer is only as good as its retrieval corpus. MIRAGE[6] is comprehensive but not exhaustive — niche conditions and recent literature may be under-represented, biasing Oracle against rare diagnoses. The system also imposes computational overhead: per-claim retrieval and support judgement roughly triples token consumption compared to ungrounded DDx generation, which has operational cost implications a deployment must accept.
A second concern: NEJM Case Records are deliberately rare-and-instructive presentations. Performance on them is informative but does not generalise directly to bread-and-butter ambulatory or ED presentations. A v0.2 extension should add an ambulatory eval set, likely drawn from MedAlign[1]-style clinician-written prompts over MIMIC-IV records.
§ 7 Conclusion
Oracle bets that the next-decade-defining capability in clinical AI is not raw accuracy but accountability. The capability to produce a top-3 DDx is now broadly available; the capability to produce a top-3 DDx where every clinical claim resolves to a verifiable citation is not. Oracle is the simplest end-to-end demonstration of that capability, built on previously-validated components — NEJM evaluation[2], MIRAGE retrieval[6], AIS attribution[9], ALCE measurement[8] — combined for the first time in a single open system.
References
- Tu T, Palepu A, Schaekermann M, et al. Towards Conversational Diagnostic AI (AMIE). Google, 2024. arxiv.org/abs/2401.05654
- Kanjee Z, Crowe B, Rodman A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA, 2023. jamanetwork.com/journals/jama/fullarticle/2806457
- Singhal K, Tu T, Gottweis J, et al. Toward Expert-Level Medical Question Answering with Large Language Models (Med-PaLM 2). Nature Medicine, 2025. arxiv.org/abs/2305.09617
- Singhal K, Azizi S, Tu T, et al. Large Language Models Encode Clinical Knowledge. Nature, 620:172–180, 2023. arxiv.org/abs/2212.13138
- Saab K, Tu T, Weng WH, et al. Capabilities of Gemini Models in Medicine (Med-Gemini). Google, 2024. arxiv.org/abs/2404.18416
- Xiong G, Jin Q, Lu Z, Zhang A. Benchmarking Retrieval-Augmented Generation for Medicine (MedRAG / MIRAGE). ACL Findings, 2024. arxiv.org/abs/2402.13178
- Zakka C, Shad R, Chaurasia A, et al. Almanac — Retrieval-Augmented Language Models for Clinical Medicine. NEJM AI, 2024. ai.nejm.org/doi/abs/10.1056/AIoa2300068
- Gao T, Yen H, Yu J, Chen D. Enabling Large Language Models to Generate Text with Citations (ALCE). EMNLP, 2023. arxiv.org/abs/2305.14627
- Rashkin H, Nikolaev V, Lamm M, et al. Measuring Attribution in Natural Language Generation Models (AIS). Computational Linguistics, 2023. arxiv.org/abs/2112.12870
- Liévin V, Hother CE, Motzfeldt AG, Winther O. Can Large Language Models Reason About Medical Questions? Patterns (Cell), 2024. arxiv.org/abs/2207.08143
- Tu T, Palepu A, Schaekermann M, et al. Advancing Conversational Diagnostic AI with Multimodal Reasoning (AMIE-Vision). Google DeepMind, May 2025. AMIE on Gemini 2.0 Flash matched or beat PCPs on 29 of 32 axes and 7 of 9 multimodal-specific criteria in a 105-scenario / 25-actor OSCE. arxiv.org/abs/2505.04653
- Google Research. From diagnosis to treatment: Advancing AMIE for longitudinal disease management. Research blog post, March 2025. AMIE matched or exceeded clinicians on multi-visit management reasoning (investigations, prescriptions, guideline adherence). research.google/blog/from-diagnosis-to-treatment-advancing-amie-for-longitudinal-disease-management
- NEJM AI editorial team. Assessment of Large Language Models in Clinical Reasoning (Script Concordance Test benchmark). NEJM AI, 2025. 10 frontier LLMs on 750 SCT items vs 1,070 students / 193 residents / 300 attendings; CoT prompting hurt SCT scores. ai.nejm.org/doi/full/10.1056/AIdbp2500120
- NEJM AI editorial team. Automation Bias in LLM-Assisted Diagnostic Reasoning: A Randomized Controlled Trial. NEJM AI, 2025. AI-literate physicians exhibited automation bias when LLM consultation was discretionary — direct argument for evidence-grounded recommendations. ai.nejm.org/doi/full/10.1056/AIoa2501001
- Omar M, et al. Multi-model assurance analysis: large language models are highly vulnerable to adversarial hallucination attacks during clinical decision-support. Communications Medicine, 2025. Adversarial-vignette study; hallucination rates of 50–82% across frontier models; prompt mitigation lowered GPT-4o from 53% to 23%. nature.com/articles/s43856-025-01021-3
- Fansi Tchango A, Goel R, Wen Z, Martel J, Ghosn J. DDXPlus: A new dataset for medical automatic diagnosis. NeurIPS Datasets and Benchmarks, 2022. ~1.3M synthetic patient cases with ground-truth pathology + full differential — canonical DDx training/eval benchmark. arxiv.org/abs/2205.09148
- Wallat C, et al. Correctness is not Faithfulness in RAG Attributions. SIGIR ICTIR, 2025. Distinguishes citation correctness (does the doc support the claim) from faithfulness (did the model rely on the doc, or post-rationalise) — successor framing to ALCE. arxiv.org/abs/2412.18004
C. Takeoff AI · Set in EB Garamond