Back to Dossier
Paper 05 / 10 Preliminary Manuscript · v0.1 May 2026
Dossier №01 · Project 05 · Oracle

Oracle: An Evidence-Grounded Differential Diagnosis Agent with Citation-per-Claim Attribution

A structured H&P interview, a ranked DDx, and — for every clinical claim — a verifiable citation back to PubMed or MedlinePlus. Audited on the NEJM Case Records.

Abstract Kanjee et al.[2] showed in JAMA (2023) that GPT-4 achieves a 64% top-DDx-inclusive accuracy and 39% exact-top-diagnosis accuracy on 70 NEJM Clinicopathological Cases. Google's AMIE[1] matched or exceeded board-certified PCPs on a 159-OSCE study of conversational diagnostic AI. Both systems share a common limitation: their answers are not attributable. Oracle takes the diagnostic-reasoning capability demonstrated in these systems and adds a citation-per-claim attribution layer, grounded in the Attributable-to-Identified-Sources (AIS) framework of Rashkin et al.[9] and the ALCE citation evaluation protocol of Gao et al.[8] The retrieval index reuses the MedRAG / MIRAGE corpora[6]; the evaluation reuses a held-out 30-case slice of the NEJM Case Records[2]. Pass criterion: top-3 DDx accuracy that matches or exceeds GPT-4's 64% top-DDx-inclusive baseline reported in Kanjee et al.[2], with ≥ 95% of clinical claims AIS-attributable.

§ 1 Introduction

A clinician shown a ranked differential diagnosis from a language model cannot trust what they cannot audit. Current systems present DDx as if it were authoritative — a numbered list with no provenance. This is the wrong design. Trust in clinical AI requires that every assertion be traceable to a source the clinician can verify in seconds.

The empirical case for diagnostic LLMs is strong and improving. AMIE[1] outperformed PCPs on diagnostic accuracy, management reasoning, and communication quality in a 20-condition OSCE study. Med-PaLM 2[3] reached 86.5% on MedQA via ensemble refinement. Med-Gemini[5] integrates uncertainty-guided web search into clinical responses. Liévin et al.[10] demonstrated that chain-of-thought plus retrieval substantially improves performance on MedQA, MedMCQA, and PubMedQA. The capability exists. The accountability does not.

1.1 Contributions

  1. A diagnostic agent that emits, for each ranked DDx entry, the supporting and refuting evidence with explicit citation IDs into a PubMed / MedlinePlus retrieval index.
  2. A scoring harness that combines top-K DDx accuracy on NEJM Case Records[2] with the AIS attribution rate of Rashkin et al.[9] and the citation-quality metrics of ALCE[8].
  3. An open implementation reusing MedRAG's MIRAGE corpus[6] so that the retrieval ablations from prior work transfer directly.

§ 2 Background and Related Work

2.1 Diagnostic Reasoning Benchmarks

The NEJM Clinicopathological Cases serve as the most demanding public diagnostic benchmark. Kanjee et al.[2] evaluated GPT-4 on 70 cases under a fixed prompt; performance was: 64% (top-DDx-includes-correct), 39% (top-DDx-is-correct), against a quoted historical clinician-DDx-inclusive rate of approximately 89%. We treat their evaluation methodology as canonical and use a held-out 30-case set drawn from later issues, with case-source year recorded for any future contamination audit.

2.2 Conversational Diagnostic Agents

AMIE[1] was trained via self-play in a simulated clinician-patient environment with three role-played LLMs (Patient Agent, Doctor Agent, Critic Agent); the Nature 2025 publication of the OSCE study confirmed AMIE was superior on 28 of 32 axes per specialists and 24 of 26 axes per patient-actors across 159 case scenarios. The follow-on Google Research work has gone in two directions: AMIE-Vision[11] (Tu et al., May 2025) extended AMIE to multimodal reasoning on Gemini 2.0 Flash and matched or beat PCPs on 29 of 32 clinical axes and 7 of 9 multimodal-specific criteria in a 105-scenario / 25-patient-actor OSCE; AMIE Longitudinal[12] (March 2025) extended the architecture to multi-visit disease management, matching or exceeding clinicians on investigations, prescriptions, and guideline adherence. Oracle's H&P-interview front-end follows AMIE's structured-elicitation pattern; the diagnostic ranking stage is informed by Med-PaLM 2's[3] ensemble-refinement procedure (three diverse CoT samples reconciled into a final answer — the procedure that drove the +19-point MedQA gain to 86.5%).

Important counter-evidence from the same venue: NEJM AI's Script Concordance Test benchmark[13] evaluated 10 frontier LLMs (GPT-4o, o1, o3, Claude 3.5 Sonnet, Gemini 2.5, DeepSeek R1, Llama 3.3 70B) on 750 SCT items against 1,070 students, 193 residents, and 300 attendings — LLMs performed markedly worse than on multiple-choice benchmarks, and CoT prompting hurt SCT scores. The NEJM AI Automation Bias RCT[14] showed that even AI-literate physicians exhibit automation bias under discretionary LLM consultation — a strong empirical argument for Oracle's evidence-grounding architecture rather than raw model recommendations. Medical hallucination remains substantial: Omar et al.'s adversarial-attack study in Communications Medicine (2025)[15] documented 50–82% hallucination rates across frontier models on adversarial clinical vignettes, with prompt-based mitigation lowering GPT-4o's rate from 53% to 23%.

2.3 Retrieval-Augmented Clinical Reasoning

MedRAG[6] provides the MIRAGE benchmark and a unified RAG toolkit indexing PubMed, StatPearls, and medical textbooks across roughly 60 million chunks. Almanac[7] (NEJM AI, 2024) demonstrated that grounding clinical answers in a curated corpus substantially reduces hallucination compared to ungrounded GPT-4. Oracle reuses the MIRAGE corpus directly so that retrieval-stack ablations from MedRAG transfer.

2.4 Citation-Faithful Generation

The Attributable-to-Identified-Sources framework of Rashkin et al.[9] formalises what it means for a generated statement to be "attributable": a claim is AIS-attributable if a competent human reader, given the cited source, would judge the claim as supported. ALCE[8] operationalises three measurable axes — fluency, correctness, and citation quality (recall and precision of citations). Oracle adopts AIS as the binary attribution metric and ALCE's citation-precision metric as the secondary measure.

2.5 The Med-PaLM Lineage

Singhal et al.[4] established MultiMedQA and the per-axis human evaluation protocol — factuality, scientific consensus, possible harm, possible bias — that Oracle's manual audit reuses for a 50-case subset of the NEJM evaluation.

§ 3 Proposed Approach

3.1 Agent Architecture

Oracle is a three-stage tool-using agent. The stages run sequentially; tool calls are interleaved within each stage via the ReAct pattern.

  1. Stage 1 — Structured H&P Elicitation. The agent conducts a clinician-style intake: chief complaint, HPI, PMH, medications, allergies, social/family history, ROS. Elicitation follows a fixed-template prompt; the schema is mappable to a FHIR Encounter+Condition+Observation bundle so that downstream Oracle outputs are interoperable.
  2. Stage 2 — Ranked DDx Generation. The agent emits a ranked list of up to 10 candidate diagnoses with a one-line clinical rationale per entry. The ensemble-refinement procedure of Med-PaLM 2[3] is applied: three independent samples are generated and reconciled by a single follow-up pass.
  3. Stage 3 — Per-Claim Evidence Retrieval. For each rationale claim, the agent issues a retrieval query against MIRAGE[6]. Top-5 retrieved passages are scored for support (Yes/No/Insufficient); the highest-scoring supporting passage is attached as the citation. If no passage supports the claim, the claim is either dropped or rewritten with the available evidence.
Figure 1 · Oracle three-stage agent flow
STAGE 1 H&P interview CC · HPI · PMH meds · allergies ROS · social STAGE 2 Ranked DDx ensemble refinement 3 CoT samples + reconciliation → top-10 list STAGE 3 Citation retrieval per-claim claim → MIRAGE top-5 → support (Y/N/Ins) → AIS attribution MIRAGE corpus PubMed 23.9M · StatPearls · MedlinePlus query Verifiable output JSON: {claim, evidence_id, quote, support} 1 2 3
Figure 1. Oracle's three-stage pipeline. (1) Structured H&P intake follows AMIE's[1] elicitation template. (2) Ranked DDx generation uses Med-PaLM 2's[3] ensemble refinement procedure (three diverse chain-of-thought samples reconciled into a final answer) which produced the +19-point gain on MedQA to 86.5%. (3) Each rationale claim issues a retrieval query against the MIRAGE corpus[6] (PubMed 23.9M abstracts, StatPearls, textbooks); retrieved passages are scored Yes/No/Insufficient and the AIS attribution metric[9] determines whether the cited passage supports the claim. Final output is structured JSON so attribution is machine-checkable.

3.2 Output Format

Each diagnostic response is structured as JSON with the schema:

{ "differential": [ { "rank": 1, "diagnosis": "Diabetic ketoacidosis", "claims": [ { "text": "The patient's anion-gap metabolic acidosis (AG = 22) is consistent with DKA.", "evidence_id": "PMID:30315098", "evidence_quote": "...anion gap above 12 mEq/L is the defining laboratory feature...", "support": "SUPPORTS" }, ... ] }, ... ] }

The format makes attribution machine-checkable: a downstream evaluator can re-query the retrieval index by evidence_id, confirm the passage exists, and run a secondary AIS adjudication.

3.3 Chain-of-Evidence Prompting

Liévin et al.[10] demonstrated that chain-of-thought combined with retrieval yields a measurable accuracy gain on MedQA/MedMCQA/PubMedQA — and that requiring evidence citation in the chain further reduces hallucination. Oracle's Stage-3 prompt template makes this requirement explicit: the model is asked to draft its reasoning trace as a sequence of (claim, retrieved evidence, judgement) triples rather than free text.

§ 4 Evaluation Protocol

4.1 Dataset

A 30-case held-out set is drawn from NEJM Clinicopathological Cases published after the most recent training-cutoff of all evaluated frontier models. Each case is reduced to a structured H&P input prior to evaluation (mirroring Kanjee et al.[2]'s methodology) so that the model is not given the discussion section.

4.2 Metrics

Table 1. Oracle evaluation metric suite.
MetricDefinitionSource
Top-1 accuracyFinal diagnosis matches gold.Kanjee et al.[2]
Top-3 inclusiveGold diagnosis appears in top 3 of DDx.Kanjee et al.[2]
AIS attribution rateFraction of claims judged as supported by their cited evidence.Rashkin et al.[9]
ALCE citation precisionFraction of cited passages that actually support the claim.Gao et al.[8]
Hallucinated-claim rateClaims that cite non-existent or unsupported evidence.Almanac[7]
Pass criterion Oracle v0.1 succeeds if: (a) top-3 inclusive accuracy matches or exceeds the 64% GPT-4 baseline established by Kanjee et al.[2]; (b) AIS attribution rate ≥ 95% on supported claims; (c) zero hallucinated citations on a 100-claim audited subset.

4.3 Baselines

Compared against (i) GPT-4 with the Kanjee et al. prompt (their original 64%/39%), (ii) Claude Opus 4.7 without retrieval, (iii) Almanac architecture[7] reproduced on MIRAGE, and (iv) Oracle without the per-claim citation requirement (ablation). The fourth baseline isolates the contribution of structured attribution.

§ 5 Expected Contributions

  1. System. A public reference implementation of a citation-faithful diagnostic agent, runnable against any NEJM-style case.
  2. Methodology. The first composite metric suite that combines diagnostic accuracy (Kanjee), attribution (AIS), and citation quality (ALCE) for clinical reasoning.
  3. Empirical finding. Quantification of the accuracy cost (if any) of imposing a 95% AIS-attribution constraint on diagnostic generation.

§ 6 Limitations and Risks

Oracle's attribution layer is only as good as its retrieval corpus. MIRAGE[6] is comprehensive but not exhaustive — niche conditions and recent literature may be under-represented, biasing Oracle against rare diagnoses. The system also imposes computational overhead: per-claim retrieval and support judgement roughly triples token consumption compared to ungrounded DDx generation, which has operational cost implications a deployment must accept.

A second concern: NEJM Case Records are deliberately rare-and-instructive presentations. Performance on them is informative but does not generalise directly to bread-and-butter ambulatory or ED presentations. A v0.2 extension should add an ambulatory eval set, likely drawn from MedAlign[1]-style clinician-written prompts over MIMIC-IV records.

§ 7 Conclusion

Oracle bets that the next-decade-defining capability in clinical AI is not raw accuracy but accountability. The capability to produce a top-3 DDx is now broadly available; the capability to produce a top-3 DDx where every clinical claim resolves to a verifiable citation is not. Oracle is the simplest end-to-end demonstration of that capability, built on previously-validated components — NEJM evaluation[2], MIRAGE retrieval[6], AIS attribution[9], ALCE measurement[8] — combined for the first time in a single open system.

References

  1. Tu T, Palepu A, Schaekermann M, et al. Towards Conversational Diagnostic AI (AMIE). Google, 2024. arxiv.org/abs/2401.05654
  2. Kanjee Z, Crowe B, Rodman A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA, 2023. jamanetwork.com/journals/jama/fullarticle/2806457
  3. Singhal K, Tu T, Gottweis J, et al. Toward Expert-Level Medical Question Answering with Large Language Models (Med-PaLM 2). Nature Medicine, 2025. arxiv.org/abs/2305.09617
  4. Singhal K, Azizi S, Tu T, et al. Large Language Models Encode Clinical Knowledge. Nature, 620:172–180, 2023. arxiv.org/abs/2212.13138
  5. Saab K, Tu T, Weng WH, et al. Capabilities of Gemini Models in Medicine (Med-Gemini). Google, 2024. arxiv.org/abs/2404.18416
  6. Xiong G, Jin Q, Lu Z, Zhang A. Benchmarking Retrieval-Augmented Generation for Medicine (MedRAG / MIRAGE). ACL Findings, 2024. arxiv.org/abs/2402.13178
  7. Zakka C, Shad R, Chaurasia A, et al. Almanac — Retrieval-Augmented Language Models for Clinical Medicine. NEJM AI, 2024. ai.nejm.org/doi/abs/10.1056/AIoa2300068
  8. Gao T, Yen H, Yu J, Chen D. Enabling Large Language Models to Generate Text with Citations (ALCE). EMNLP, 2023. arxiv.org/abs/2305.14627
  9. Rashkin H, Nikolaev V, Lamm M, et al. Measuring Attribution in Natural Language Generation Models (AIS). Computational Linguistics, 2023. arxiv.org/abs/2112.12870
  10. Liévin V, Hother CE, Motzfeldt AG, Winther O. Can Large Language Models Reason About Medical Questions? Patterns (Cell), 2024. arxiv.org/abs/2207.08143
  11. Tu T, Palepu A, Schaekermann M, et al. Advancing Conversational Diagnostic AI with Multimodal Reasoning (AMIE-Vision). Google DeepMind, May 2025. AMIE on Gemini 2.0 Flash matched or beat PCPs on 29 of 32 axes and 7 of 9 multimodal-specific criteria in a 105-scenario / 25-actor OSCE. arxiv.org/abs/2505.04653
  12. Google Research. From diagnosis to treatment: Advancing AMIE for longitudinal disease management. Research blog post, March 2025. AMIE matched or exceeded clinicians on multi-visit management reasoning (investigations, prescriptions, guideline adherence). research.google/blog/from-diagnosis-to-treatment-advancing-amie-for-longitudinal-disease-management
  13. NEJM AI editorial team. Assessment of Large Language Models in Clinical Reasoning (Script Concordance Test benchmark). NEJM AI, 2025. 10 frontier LLMs on 750 SCT items vs 1,070 students / 193 residents / 300 attendings; CoT prompting hurt SCT scores. ai.nejm.org/doi/full/10.1056/AIdbp2500120
  14. NEJM AI editorial team. Automation Bias in LLM-Assisted Diagnostic Reasoning: A Randomized Controlled Trial. NEJM AI, 2025. AI-literate physicians exhibited automation bias when LLM consultation was discretionary — direct argument for evidence-grounded recommendations. ai.nejm.org/doi/full/10.1056/AIoa2501001
  15. Omar M, et al. Multi-model assurance analysis: large language models are highly vulnerable to adversarial hallucination attacks during clinical decision-support. Communications Medicine, 2025. Adversarial-vignette study; hallucination rates of 50–82% across frontier models; prompt mitigation lowered GPT-4o from 53% to 23%. nature.com/articles/s43856-025-01021-3
  16. Fansi Tchango A, Goel R, Wen Z, Martel J, Ghosn J. DDXPlus: A new dataset for medical automatic diagnosis. NeurIPS Datasets and Benchmarks, 2022. ~1.3M synthetic patient cases with ground-truth pathology + full differential — canonical DDx training/eval benchmark. arxiv.org/abs/2205.09148
  17. Wallat C, et al. Correctness is not Faithfulness in RAG Attributions. SIGIR ICTIR, 2025. Distinguishes citation correctness (does the doc support the claim) from faithfulness (did the model rely on the doc, or post-rationalise) — successor framing to ALCE. arxiv.org/abs/2412.18004
— · § · — Preliminary manuscript · Oracle v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond