Dossier №01 · Project 17 · Graphcore

Graphcore: GraphRAG for Clinical Decision Support over Guideline Corpora

Microsoft's GraphRAG methodology applied to UpToDate-style clinical guidelines + recent literature — testing whether community-aware graph retrieval beats vanilla vector RAG on multi-hop clinical questions.

Chandra Takeoff AI · Healthcare AI Engineering Compiled · May 2026

Abstract Edge et al.'s GraphRAG (Microsoft, 2024)[1] demonstrated substantial gains over conventional RAG on global sensemaking questions across 1-million-token corpora. The methodology — entity extraction, Leiden community detection[6], hierarchical summarisation, query-time global vs local retrieval — has spawned a family of variants: LightRAG[2], LazyGraphRAG[3] (Microsoft Research, 700× cheaper queries), HippoRAG[4] (NeurIPS 2024, +20% on multi-hop QA, 10–30× cheaper than IRCoT), RAPTOR[5] (ICLR 2024, +20% on QuALITY with GPT-4). Graphcore instantiates this family for clinical decision support over UpToDate-style guidelines and recent literature. Direct prior art exists: Wu et al.'s Medical Graph RAG[8] (ACL 2025) and a 2025 medRxiv CKD-guideline validation[9] showed multi-hop graph walks improved patient-specificity over vector RAG. Pass criterion: ≥ 10-point improvement over MedRAG baseline[13] on global-style clinical questions; head-to-head benchmarking against Han et al.'s systematic GraphRAG-vs-RAG framework[10].

§ 1 Introduction

Vanilla retrieval-augmented generation answers "needle" questions well — pull a passage, ground a response, cite it. It fails on "global" questions that require synthesizing themes across a corpus. Edge et al.[1] diagnosed this precisely: vector retrieval surfaces a small number of locally-similar passages, but clinical reasoning often requires understanding what a body of evidence collectively says — exactly the failure mode GraphRAG was designed to address.

Clinical decision support over guideline corpora is a natural application. UpToDate, NICE, AHA, and ASPEN guidelines are structurally interconnected: cross-references between recommendations, citations into the underlying trials, condition-treatment-contraindication relations. The corpus is large but bounded; the questions span themes; the cost of getting it wrong is high. Graphcore tests whether GraphRAG's methodology produces measurable clinical-reasoning gains over vanilla MedRAG[13] in this setting.

1.1 Contributions

An open implementation of GraphRAG over clinical guideline corpora, with the LightRAG[2] and LazyGraphRAG[3] variants as ablations.
A head-to-head evaluation against MedRAG / MIRAGE[13] on the existing 7,663-question medical QA benchmark, plus a 200-question new global-style benchmark.
The first cost-quality Pareto curve for clinical GraphRAG — informed by LazyGraphRAG's reported 700× query-cost reduction in the general domain.

§ 2 Background and Related Work

2.1 GraphRAG Family

Microsoft's GraphRAG paper[1] introduced the canonical pipeline: LLM-driven entity-and-relation extraction; community detection via the Leiden algorithm[6] (Traag, Waltman & van Eck 2019: empirically up to 25% of Louvain communities are badly connected, up to 16% disconnected; Leiden guarantees well-connectedness via a refinement phase); hierarchical community summarisation; and a query-time choice between local retrieval (entity-anchored) and global retrieval (community-summary-anchored). LightRAG[2] simplified to a dual-level low/high retrieval with incremental updates. LazyGraphRAG[3] moved most cost to query time, matching GraphRAG quality at indexing cost identical to vector RAG and queries ~700× cheaper.

2.2 Parallel Architectures

HippoRAG (Jiménez Gutiérrez et al., NeurIPS 2024)[4] takes a neurobiological framing: KG plus personalised-PageRank retrieval reaches +20% on multi-hop QA at 10–30× the throughput and 6–13× the speed of iterative retrieval methods like IRCoT. RAPTOR (Sarthi et al., ICLR 2024)[5] abandons explicit KGs in favour of recursive cluster-and-summarise trees; combined with GPT-4 it improved QuALITY by 20% absolute. KAPING (Baek, Aji & Saffari, NLRSE 2023)[7] showed that KG-fact prompting beats zero-shot baselines by up to 48% averaged across LLM sizes — the simplest possible KG-augmented prompting.

2.3 Clinical Applications

Wu et al.'s Medical Graph RAG (ACL 2025)[8] is the closest clinical prior art: a Triple Graph Construction + U-Retrieval architecture linking user documents to credible medical sources and controlled vocabularies. A 2025 medRxiv validation paper on chronic kidney disease[9] reports that GraphRAG over a NICE CKD guideline knowledge graph achieved the highest patient-specificity via multi-hop walks — but scored lower on clarity due to long guideline excerpts. Both findings inform Graphcore's design: clinical GraphRAG works, but verbosity is a real failure mode.

2.4 GraphRAG vs Vanilla RAG

Han et al.'s systematic GraphRAG-vs-RAG comparison[10] (2025) and the GraphRAG-Bench framework[11] establish where graph structure helps and where it hurts: complex multi-hop and global sensemaking favour GraphRAG; focused-fact retrieval often does not. Graphcore adopts this taxonomy directly and reports per-category results rather than aggregate.

§ 3 Proposed Approach

3.1 Pipeline

Figure 1 · Graphcore architecture

Figure 1. Graphcore indexing-time pipeline (top) follows Edge et al.[1]: LLM entity/relation extraction → Leiden community detection[6] → hierarchical summarisation. Query-time (bottom) a scope router decides between local retrieval (entity-anchored, equivalent to vanilla vector RAG) and global retrieval (community-summary-anchored, the GraphRAG distinctive). Output flows into ChainCite (Project 18) for citation-faithfulness evaluation.

3.2 Corpus and Variants

The base corpus is approximately 200 NICE clinical guidelines plus the StatPearls subset of MedRAG's MIRAGE[13], totalling roughly 30,000 documents and ~250M tokens. Three variants will be benchmarked: vanilla GraphRAG[1] (the headline), LightRAG[2] (simplified dual-level), and LazyGraphRAG[3] (cost-optimised). Each carries its index and query costs into the evaluation.

§ 4 Evaluation Protocol

Two question sets:

MIRAGE 7,663-question benchmark[13] as the focused-fact baseline. Graphcore is not expected to win this; reporting it documents the local/global tradeoff.
200-question new global-style benchmark testing themes that span guidelines (e.g., "What is the evolving consensus on SGLT2 inhibitors across heart failure, CKD, and T2DM guidelines published 2022–2026?").

**Table 1.** Graphcore evaluation metrics.
Metric	Definition	Target
MIRAGE accuracy	Standard MIRAGE eval on Graphcore vs MedRAG baseline	≥ MedRAG baseline
Global-question accuracy	Rubric score on 200 global-style clinical questions	≥ 10 pts over MedRAG
Citation faithfulness	AIS attribution rate (handed to ChainCite)	≥ 0.90
Query cost	Tokens per query; reports Pareto curve	LazyGraphRAG < vanilla GraphRAG

Pass criterion Graphcore v0.1 succeeds if it improves on MedRAG by ≥ 10 rubric points on the 200-question global benchmark while remaining within ±3 points on MIRAGE focused-fact accuracy — the local/global Pareto tradeoff made explicit.

§ 5 Expected Contributions

System. Open clinical GraphRAG implementation with vanilla, LightRAG, and LazyGraphRAG variants.
Benchmark. A 200-question global-style clinical benchmark — the missing complement to MIRAGE's focused-fact orientation.
Empirical finding. The first cost-quality Pareto curve for clinical GraphRAG informed by the LazyGraphRAG 700× cost-reduction claim.

§ 6 Limitations and Risks

GraphRAG's index cost is the primary deployment barrier: Edge et al.[1] report that LLM-driven entity-and-relation extraction over a 1M-token corpus is substantially more expensive than vector indexing. The LazyGraphRAG[3] design alleviates this but introduces query-time cost asymmetry. For clinical guidelines whose update cadence is annual rather than streaming, the index cost is amortised — but a v0.2 effort needs incremental update mechanics (LightRAG's[2] approach is the candidate). The verbosity finding from the CKD validation[9] is a real concern: GraphRAG retrieves community summaries that can be much longer than needed and degrade response clarity. Graphcore mitigates with a post-retrieval summary-compression step.

§ 7 Conclusion

Graphcore tests a specific, falsifiable claim: GraphRAG's local/global retrieval distinction produces measurable clinical-reasoning gains over vanilla MedRAG on global-style questions, at a Pareto-acceptable cost tradeoff via the LazyGraphRAG variant. The infrastructure exists, the methodology is published, the clinical corpora are accessible. What is missing is the open clinical instantiation with cost-aware evaluation. Graphcore provides it.

References

Edge D, Trinh H, Cheng N, Bradley J, Chao A, Mody A, Truitt S, Metropolitansky D, Ness RO, Larson J. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. Microsoft, arXiv:2404.16130, 2024. arxiv.org/abs/2404.16130
Guo Z, Xia L, Yu Y, Ao T, Huang C. LightRAG: Simple and Fast Retrieval-Augmented Generation. arXiv:2410.05779; EMNLP 2025. Dual-level retrieval with incremental updates. arxiv.org/abs/2410.05779
Edge D, Trinh H, Larson J. LazyGraphRAG: Setting a new standard for quality and cost. Microsoft Research Blog, Nov 25 2024. Indexing cost identical to vector RAG; ~0.1% of full GraphRAG; >700× lower query cost at comparable global-query quality. microsoft.com/.../lazygraphrag-setting-a-new-standard-for-quality-and-cost
Jiménez Gutiérrez B, Shu Y, Gu Y, Yasunaga M, Su Y. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. NeurIPS 2024. +20% on multi-hop QA; 10–30× cheaper and 6–13× faster than IRCoT. arxiv.org/abs/2405.14831
Sarthi P, Abdullah S, Tuli A, Khanna S, Goldie A, Manning CD. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. ICLR 2024. +20% absolute on QuALITY when coupled with GPT-4. arxiv.org/abs/2401.18059
Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: Guaranteeing Well-Connected Communities. Scientific Reports 9:5233, 2019. Up to 25% of Louvain communities badly connected, up to 16% disconnected; Leiden guarantees connectedness. nature.com/articles/s41598-019-41695-z
Baek J, Aji AF, Saffari A. Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering (KAPING). NLRSE @ ACL, 2023. Zero-shot KG-fact prompting outperforms zero-shot baselines by up to 48% on average. arxiv.org/abs/2306.04136
Wu J, Zhu J, Qi Y, Chen J, Xu M, Menolascina F, Jin Y, Grau V. Medical Graph RAG: Evidence-based Medical Large Language Model via Graph Retrieval-Augmented Generation. ACL 2025 (Long Papers, pp. 28443–28467); arXiv:2408.04187. aclanthology.org/2025.acl-long.1381
Development and validation of Retrieval Augmented Generation (RAG) and GraphRAG for complex clinical cases (CKD). medRxiv preprint, 2025. GraphRAG achieved highest patient-specificity via multi-hop walks across NICE CKD guideline KG. medrxiv.org/content/10.1101/2025.11.25.25341010v1
Han H, et al. RAG vs. GraphRAG: A Systematic Evaluation and Key Insights. arXiv:2502.11371, 2025. arxiv.org/abs/2502.11371
When to Use Graphs in RAG: A Comprehensive Analysis (GraphRAG-Bench). arXiv:2506.05690, 2025. arxiv.org/abs/2506.05690
LlamaIndex contributors. Property Graph Index — KG retriever framework. Documentation. Four composable retrievers (LLMSynonym, VectorContext, TextToCypher, CypherTemplate). developers.llamaindex.ai/.../lpg_index_guide
Xiong G, Jin Q, Lu Z, Zhang A. Benchmarking Retrieval-Augmented Generation for Medicine (MedRAG / MIRAGE). ACL Findings 2024. 7,663 questions across 5 medical QA datasets; up to +18% over CoT. arxiv.org/abs/2402.13178

— · § · — Preliminary manuscript · Graphcore v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond