Graphcore: GraphRAG for Clinical Decision Support over Guideline Corpora
Microsoft's GraphRAG methodology applied to UpToDate-style clinical guidelines + recent literature — testing whether community-aware graph retrieval beats vanilla vector RAG on multi-hop clinical questions.
Abstract Edge et al.'s GraphRAG (Microsoft, 2024)[1] demonstrated substantial gains over conventional RAG on global sensemaking questions across 1-million-token corpora. The methodology — entity extraction, Leiden community detection[6], hierarchical summarisation, query-time global vs local retrieval — has spawned a family of variants: LightRAG[2], LazyGraphRAG[3] (Microsoft Research, 700× cheaper queries), HippoRAG[4] (NeurIPS 2024, +20% on multi-hop QA, 10–30× cheaper than IRCoT), RAPTOR[5] (ICLR 2024, +20% on QuALITY with GPT-4). Graphcore instantiates this family for clinical decision support over UpToDate-style guidelines and recent literature. Direct prior art exists: Wu et al.'s Medical Graph RAG[8] (ACL 2025) and a 2025 medRxiv CKD-guideline validation[9] showed multi-hop graph walks improved patient-specificity over vector RAG. Pass criterion: ≥ 10-point improvement over MedRAG baseline[13] on global-style clinical questions; head-to-head benchmarking against Han et al.'s systematic GraphRAG-vs-RAG framework[10].
§ 1 Introduction
Vanilla retrieval-augmented generation answers "needle" questions well — pull a passage, ground a response, cite it. It fails on "global" questions that require synthesizing themes across a corpus. Edge et al.[1] diagnosed this precisely: vector retrieval surfaces a small number of locally-similar passages, but clinical reasoning often requires understanding what a body of evidence collectively says — exactly the failure mode GraphRAG was designed to address.
Clinical decision support over guideline corpora is a natural application. UpToDate, NICE, AHA, and ASPEN guidelines are structurally interconnected: cross-references between recommendations, citations into the underlying trials, condition-treatment-contraindication relations. The corpus is large but bounded; the questions span themes; the cost of getting it wrong is high. Graphcore tests whether GraphRAG's methodology produces measurable clinical-reasoning gains over vanilla MedRAG[13] in this setting.
1.1 Contributions
- An open implementation of GraphRAG over clinical guideline corpora, with the LightRAG[2] and LazyGraphRAG[3] variants as ablations.
- A head-to-head evaluation against MedRAG / MIRAGE[13] on the existing 7,663-question medical QA benchmark, plus a 200-question new global-style benchmark.
- The first cost-quality Pareto curve for clinical GraphRAG — informed by LazyGraphRAG's reported 700× query-cost reduction in the general domain.
§ 2 Background and Related Work
2.1 GraphRAG Family
Microsoft's GraphRAG paper[1] introduced the canonical pipeline: LLM-driven entity-and-relation extraction; community detection via the Leiden algorithm[6] (Traag, Waltman & van Eck 2019: empirically up to 25% of Louvain communities are badly connected, up to 16% disconnected; Leiden guarantees well-connectedness via a refinement phase); hierarchical community summarisation; and a query-time choice between local retrieval (entity-anchored) and global retrieval (community-summary-anchored). LightRAG[2] simplified to a dual-level low/high retrieval with incremental updates. LazyGraphRAG[3] moved most cost to query time, matching GraphRAG quality at indexing cost identical to vector RAG and queries ~700× cheaper.
2.2 Parallel Architectures
HippoRAG (Jiménez Gutiérrez et al., NeurIPS 2024)[4] takes a neurobiological framing: KG plus personalised-PageRank retrieval reaches +20% on multi-hop QA at 10–30× the throughput and 6–13× the speed of iterative retrieval methods like IRCoT. RAPTOR (Sarthi et al., ICLR 2024)[5] abandons explicit KGs in favour of recursive cluster-and-summarise trees; combined with GPT-4 it improved QuALITY by 20% absolute. KAPING (Baek, Aji & Saffari, NLRSE 2023)[7] showed that KG-fact prompting beats zero-shot baselines by up to 48% averaged across LLM sizes — the simplest possible KG-augmented prompting.
2.3 Clinical Applications
Wu et al.'s Medical Graph RAG (ACL 2025)[8] is the closest clinical prior art: a Triple Graph Construction + U-Retrieval architecture linking user documents to credible medical sources and controlled vocabularies. A 2025 medRxiv validation paper on chronic kidney disease[9] reports that GraphRAG over a NICE CKD guideline knowledge graph achieved the highest patient-specificity via multi-hop walks — but scored lower on clarity due to long guideline excerpts. Both findings inform Graphcore's design: clinical GraphRAG works, but verbosity is a real failure mode.
2.4 GraphRAG vs Vanilla RAG
Han et al.'s systematic GraphRAG-vs-RAG comparison[10] (2025) and the GraphRAG-Bench framework[11] establish where graph structure helps and where it hurts: complex multi-hop and global sensemaking favour GraphRAG; focused-fact retrieval often does not. Graphcore adopts this taxonomy directly and reports per-category results rather than aggregate.
§ 3 Proposed Approach
3.1 Pipeline
3.2 Corpus and Variants
The base corpus is approximately 200 NICE clinical guidelines plus the StatPearls subset of MedRAG's MIRAGE[13], totalling roughly 30,000 documents and ~250M tokens. Three variants will be benchmarked: vanilla GraphRAG[1] (the headline), LightRAG[2] (simplified dual-level), and LazyGraphRAG[3] (cost-optimised). Each carries its index and query costs into the evaluation.
§ 4 Evaluation Protocol
Two question sets:
- MIRAGE 7,663-question benchmark[13] as the focused-fact baseline. Graphcore is not expected to win this; reporting it documents the local/global tradeoff.
- 200-question new global-style benchmark testing themes that span guidelines (e.g., "What is the evolving consensus on SGLT2 inhibitors across heart failure, CKD, and T2DM guidelines published 2022–2026?").
| Metric | Definition | Target |
|---|---|---|
| MIRAGE accuracy | Standard MIRAGE eval on Graphcore vs MedRAG baseline | ≥ MedRAG baseline |
| Global-question accuracy | Rubric score on 200 global-style clinical questions | ≥ 10 pts over MedRAG |
| Citation faithfulness | AIS attribution rate (handed to ChainCite) | ≥ 0.90 |
| Query cost | Tokens per query; reports Pareto curve | LazyGraphRAG < vanilla GraphRAG |
§ 5 Expected Contributions
- System. Open clinical GraphRAG implementation with vanilla, LightRAG, and LazyGraphRAG variants.
- Benchmark. A 200-question global-style clinical benchmark — the missing complement to MIRAGE's focused-fact orientation.
- Empirical finding. The first cost-quality Pareto curve for clinical GraphRAG informed by the LazyGraphRAG 700× cost-reduction claim.
§ 6 Limitations and Risks
GraphRAG's index cost is the primary deployment barrier: Edge et al.[1] report that LLM-driven entity-and-relation extraction over a 1M-token corpus is substantially more expensive than vector indexing. The LazyGraphRAG[3] design alleviates this but introduces query-time cost asymmetry. For clinical guidelines whose update cadence is annual rather than streaming, the index cost is amortised — but a v0.2 effort needs incremental update mechanics (LightRAG's[2] approach is the candidate). The verbosity finding from the CKD validation[9] is a real concern: GraphRAG retrieves community summaries that can be much longer than needed and degrade response clarity. Graphcore mitigates with a post-retrieval summary-compression step.
§ 7 Conclusion
Graphcore tests a specific, falsifiable claim: GraphRAG's local/global retrieval distinction produces measurable clinical-reasoning gains over vanilla MedRAG on global-style questions, at a Pareto-acceptable cost tradeoff via the LazyGraphRAG variant. The infrastructure exists, the methodology is published, the clinical corpora are accessible. What is missing is the open clinical instantiation with cost-aware evaluation. Graphcore provides it.
References
- Edge D, Trinh H, Cheng N, Bradley J, Chao A, Mody A, Truitt S, Metropolitansky D, Ness RO, Larson J. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. Microsoft, arXiv:2404.16130, 2024. arxiv.org/abs/2404.16130
- Guo Z, Xia L, Yu Y, Ao T, Huang C. LightRAG: Simple and Fast Retrieval-Augmented Generation. arXiv:2410.05779; EMNLP 2025. Dual-level retrieval with incremental updates. arxiv.org/abs/2410.05779
- Edge D, Trinh H, Larson J. LazyGraphRAG: Setting a new standard for quality and cost. Microsoft Research Blog, Nov 25 2024. Indexing cost identical to vector RAG; ~0.1% of full GraphRAG; >700× lower query cost at comparable global-query quality. microsoft.com/.../lazygraphrag-setting-a-new-standard-for-quality-and-cost
- Jiménez Gutiérrez B, Shu Y, Gu Y, Yasunaga M, Su Y. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. NeurIPS 2024. +20% on multi-hop QA; 10–30× cheaper and 6–13× faster than IRCoT. arxiv.org/abs/2405.14831
- Sarthi P, Abdullah S, Tuli A, Khanna S, Goldie A, Manning CD. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. ICLR 2024. +20% absolute on QuALITY when coupled with GPT-4. arxiv.org/abs/2401.18059
- Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: Guaranteeing Well-Connected Communities. Scientific Reports 9:5233, 2019. Up to 25% of Louvain communities badly connected, up to 16% disconnected; Leiden guarantees connectedness. nature.com/articles/s41598-019-41695-z
- Baek J, Aji AF, Saffari A. Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering (KAPING). NLRSE @ ACL, 2023. Zero-shot KG-fact prompting outperforms zero-shot baselines by up to 48% on average. arxiv.org/abs/2306.04136
- Wu J, Zhu J, Qi Y, Chen J, Xu M, Menolascina F, Jin Y, Grau V. Medical Graph RAG: Evidence-based Medical Large Language Model via Graph Retrieval-Augmented Generation. ACL 2025 (Long Papers, pp. 28443–28467); arXiv:2408.04187. aclanthology.org/2025.acl-long.1381
- Development and validation of Retrieval Augmented Generation (RAG) and GraphRAG for complex clinical cases (CKD). medRxiv preprint, 2025. GraphRAG achieved highest patient-specificity via multi-hop walks across NICE CKD guideline KG. medrxiv.org/content/10.1101/2025.11.25.25341010v1
- Han H, et al. RAG vs. GraphRAG: A Systematic Evaluation and Key Insights. arXiv:2502.11371, 2025. arxiv.org/abs/2502.11371
- When to Use Graphs in RAG: A Comprehensive Analysis (GraphRAG-Bench). arXiv:2506.05690, 2025. arxiv.org/abs/2506.05690
- LlamaIndex contributors. Property Graph Index — KG retriever framework. Documentation. Four composable retrievers (LLMSynonym, VectorContext, TextToCypher, CypherTemplate). developers.llamaindex.ai/.../lpg_index_guide
- Xiong G, Jin Q, Lu Z, Zhang A. Benchmarking Retrieval-Augmented Generation for Medicine (MedRAG / MIRAGE). ACL Findings 2024. 7,663 questions across 5 medical QA datasets; up to +18% over CoT. arxiv.org/abs/2402.13178
C. Takeoff AI · Set in EB Garamond