Back to Dossier
Paper 17 / 19 Preliminary Manuscript · v0.1 May 2026
Dossier №01 · Project 17 · Graphcore

Graphcore: GraphRAG for Clinical Decision Support over Guideline Corpora

Microsoft's GraphRAG methodology applied to UpToDate-style clinical guidelines + recent literature — testing whether community-aware graph retrieval beats vanilla vector RAG on multi-hop clinical questions.

Abstract Edge et al.'s GraphRAG (Microsoft, 2024)[1] demonstrated substantial gains over conventional RAG on global sensemaking questions across 1-million-token corpora. The methodology — entity extraction, Leiden community detection[6], hierarchical summarisation, query-time global vs local retrieval — has spawned a family of variants: LightRAG[2], LazyGraphRAG[3] (Microsoft Research, 700× cheaper queries), HippoRAG[4] (NeurIPS 2024, +20% on multi-hop QA, 10–30× cheaper than IRCoT), RAPTOR[5] (ICLR 2024, +20% on QuALITY with GPT-4). Graphcore instantiates this family for clinical decision support over UpToDate-style guidelines and recent literature. Direct prior art exists: Wu et al.'s Medical Graph RAG[8] (ACL 2025) and a 2025 medRxiv CKD-guideline validation[9] showed multi-hop graph walks improved patient-specificity over vector RAG. Pass criterion: ≥ 10-point improvement over MedRAG baseline[13] on global-style clinical questions; head-to-head benchmarking against Han et al.'s systematic GraphRAG-vs-RAG framework[10].

§ 1 Introduction

Vanilla retrieval-augmented generation answers "needle" questions well — pull a passage, ground a response, cite it. It fails on "global" questions that require synthesizing themes across a corpus. Edge et al.[1] diagnosed this precisely: vector retrieval surfaces a small number of locally-similar passages, but clinical reasoning often requires understanding what a body of evidence collectively says — exactly the failure mode GraphRAG was designed to address.

Clinical decision support over guideline corpora is a natural application. UpToDate, NICE, AHA, and ASPEN guidelines are structurally interconnected: cross-references between recommendations, citations into the underlying trials, condition-treatment-contraindication relations. The corpus is large but bounded; the questions span themes; the cost of getting it wrong is high. Graphcore tests whether GraphRAG's methodology produces measurable clinical-reasoning gains over vanilla MedRAG[13] in this setting.

1.1 Contributions

  1. An open implementation of GraphRAG over clinical guideline corpora, with the LightRAG[2] and LazyGraphRAG[3] variants as ablations.
  2. A head-to-head evaluation against MedRAG / MIRAGE[13] on the existing 7,663-question medical QA benchmark, plus a 200-question new global-style benchmark.
  3. The first cost-quality Pareto curve for clinical GraphRAG — informed by LazyGraphRAG's reported 700× query-cost reduction in the general domain.

§ 2 Background and Related Work

2.1 GraphRAG Family

Microsoft's GraphRAG paper[1] introduced the canonical pipeline: LLM-driven entity-and-relation extraction; community detection via the Leiden algorithm[6] (Traag, Waltman & van Eck 2019: empirically up to 25% of Louvain communities are badly connected, up to 16% disconnected; Leiden guarantees well-connectedness via a refinement phase); hierarchical community summarisation; and a query-time choice between local retrieval (entity-anchored) and global retrieval (community-summary-anchored). LightRAG[2] simplified to a dual-level low/high retrieval with incremental updates. LazyGraphRAG[3] moved most cost to query time, matching GraphRAG quality at indexing cost identical to vector RAG and queries ~700× cheaper.

2.2 Parallel Architectures

HippoRAG (Jiménez Gutiérrez et al., NeurIPS 2024)[4] takes a neurobiological framing: KG plus personalised-PageRank retrieval reaches +20% on multi-hop QA at 10–30× the throughput and 6–13× the speed of iterative retrieval methods like IRCoT. RAPTOR (Sarthi et al., ICLR 2024)[5] abandons explicit KGs in favour of recursive cluster-and-summarise trees; combined with GPT-4 it improved QuALITY by 20% absolute. KAPING (Baek, Aji & Saffari, NLRSE 2023)[7] showed that KG-fact prompting beats zero-shot baselines by up to 48% averaged across LLM sizes — the simplest possible KG-augmented prompting.

2.3 Clinical Applications

Wu et al.'s Medical Graph RAG (ACL 2025)[8] is the closest clinical prior art: a Triple Graph Construction + U-Retrieval architecture linking user documents to credible medical sources and controlled vocabularies. A 2025 medRxiv validation paper on chronic kidney disease[9] reports that GraphRAG over a NICE CKD guideline knowledge graph achieved the highest patient-specificity via multi-hop walks — but scored lower on clarity due to long guideline excerpts. Both findings inform Graphcore's design: clinical GraphRAG works, but verbosity is a real failure mode.

2.4 GraphRAG vs Vanilla RAG

Han et al.'s systematic GraphRAG-vs-RAG comparison[10] (2025) and the GraphRAG-Bench framework[11] establish where graph structure helps and where it hurts: complex multi-hop and global sensemaking favour GraphRAG; focused-fact retrieval often does not. Graphcore adopts this taxonomy directly and reports per-category results rather than aggregate.

§ 3 Proposed Approach

3.1 Pipeline

Figure 1 · Graphcore architecture
Guideline corpus UpToDate / NICE ASPEN / AHA + PubMed deltas INDEX 1 Entity/relation LLM extraction INDEX 2 Leiden clusters community detection INDEX 3 Hierarchical summaries per cluster + corpus Graph + community-summary store Neo4j · vector index over summaries · provenance preserved Clinical query free text QUERY ROUTER local vs global scope classifier Local retrieval Global retrieval Grounded answer + provenance → ChainCite eval
Figure 1. Graphcore indexing-time pipeline (top) follows Edge et al.[1]: LLM entity/relation extraction → Leiden community detection[6] → hierarchical summarisation. Query-time (bottom) a scope router decides between local retrieval (entity-anchored, equivalent to vanilla vector RAG) and global retrieval (community-summary-anchored, the GraphRAG distinctive). Output flows into ChainCite (Project 18) for citation-faithfulness evaluation.

3.2 Corpus and Variants

The base corpus is approximately 200 NICE clinical guidelines plus the StatPearls subset of MedRAG's MIRAGE[13], totalling roughly 30,000 documents and ~250M tokens. Three variants will be benchmarked: vanilla GraphRAG[1] (the headline), LightRAG[2] (simplified dual-level), and LazyGraphRAG[3] (cost-optimised). Each carries its index and query costs into the evaluation.

§ 4 Evaluation Protocol

Two question sets:

Table 1. Graphcore evaluation metrics.
MetricDefinitionTarget
MIRAGE accuracyStandard MIRAGE eval on Graphcore vs MedRAG baseline≥ MedRAG baseline
Global-question accuracyRubric score on 200 global-style clinical questions≥ 10 pts over MedRAG
Citation faithfulnessAIS attribution rate (handed to ChainCite)≥ 0.90
Query costTokens per query; reports Pareto curveLazyGraphRAG < vanilla GraphRAG
Pass criterion Graphcore v0.1 succeeds if it improves on MedRAG by ≥ 10 rubric points on the 200-question global benchmark while remaining within ±3 points on MIRAGE focused-fact accuracy — the local/global Pareto tradeoff made explicit.

§ 5 Expected Contributions

  1. System. Open clinical GraphRAG implementation with vanilla, LightRAG, and LazyGraphRAG variants.
  2. Benchmark. A 200-question global-style clinical benchmark — the missing complement to MIRAGE's focused-fact orientation.
  3. Empirical finding. The first cost-quality Pareto curve for clinical GraphRAG informed by the LazyGraphRAG 700× cost-reduction claim.

§ 6 Limitations and Risks

GraphRAG's index cost is the primary deployment barrier: Edge et al.[1] report that LLM-driven entity-and-relation extraction over a 1M-token corpus is substantially more expensive than vector indexing. The LazyGraphRAG[3] design alleviates this but introduces query-time cost asymmetry. For clinical guidelines whose update cadence is annual rather than streaming, the index cost is amortised — but a v0.2 effort needs incremental update mechanics (LightRAG's[2] approach is the candidate). The verbosity finding from the CKD validation[9] is a real concern: GraphRAG retrieves community summaries that can be much longer than needed and degrade response clarity. Graphcore mitigates with a post-retrieval summary-compression step.

§ 7 Conclusion

Graphcore tests a specific, falsifiable claim: GraphRAG's local/global retrieval distinction produces measurable clinical-reasoning gains over vanilla MedRAG on global-style questions, at a Pareto-acceptable cost tradeoff via the LazyGraphRAG variant. The infrastructure exists, the methodology is published, the clinical corpora are accessible. What is missing is the open clinical instantiation with cost-aware evaluation. Graphcore provides it.

References

  1. Edge D, Trinh H, Cheng N, Bradley J, Chao A, Mody A, Truitt S, Metropolitansky D, Ness RO, Larson J. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. Microsoft, arXiv:2404.16130, 2024. arxiv.org/abs/2404.16130
  2. Guo Z, Xia L, Yu Y, Ao T, Huang C. LightRAG: Simple and Fast Retrieval-Augmented Generation. arXiv:2410.05779; EMNLP 2025. Dual-level retrieval with incremental updates. arxiv.org/abs/2410.05779
  3. Edge D, Trinh H, Larson J. LazyGraphRAG: Setting a new standard for quality and cost. Microsoft Research Blog, Nov 25 2024. Indexing cost identical to vector RAG; ~0.1% of full GraphRAG; >700× lower query cost at comparable global-query quality. microsoft.com/.../lazygraphrag-setting-a-new-standard-for-quality-and-cost
  4. Jiménez Gutiérrez B, Shu Y, Gu Y, Yasunaga M, Su Y. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. NeurIPS 2024. +20% on multi-hop QA; 10–30× cheaper and 6–13× faster than IRCoT. arxiv.org/abs/2405.14831
  5. Sarthi P, Abdullah S, Tuli A, Khanna S, Goldie A, Manning CD. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. ICLR 2024. +20% absolute on QuALITY when coupled with GPT-4. arxiv.org/abs/2401.18059
  6. Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: Guaranteeing Well-Connected Communities. Scientific Reports 9:5233, 2019. Up to 25% of Louvain communities badly connected, up to 16% disconnected; Leiden guarantees connectedness. nature.com/articles/s41598-019-41695-z
  7. Baek J, Aji AF, Saffari A. Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering (KAPING). NLRSE @ ACL, 2023. Zero-shot KG-fact prompting outperforms zero-shot baselines by up to 48% on average. arxiv.org/abs/2306.04136
  8. Wu J, Zhu J, Qi Y, Chen J, Xu M, Menolascina F, Jin Y, Grau V. Medical Graph RAG: Evidence-based Medical Large Language Model via Graph Retrieval-Augmented Generation. ACL 2025 (Long Papers, pp. 28443–28467); arXiv:2408.04187. aclanthology.org/2025.acl-long.1381
  9. Development and validation of Retrieval Augmented Generation (RAG) and GraphRAG for complex clinical cases (CKD). medRxiv preprint, 2025. GraphRAG achieved highest patient-specificity via multi-hop walks across NICE CKD guideline KG. medrxiv.org/content/10.1101/2025.11.25.25341010v1
  10. Han H, et al. RAG vs. GraphRAG: A Systematic Evaluation and Key Insights. arXiv:2502.11371, 2025. arxiv.org/abs/2502.11371
  11. When to Use Graphs in RAG: A Comprehensive Analysis (GraphRAG-Bench). arXiv:2506.05690, 2025. arxiv.org/abs/2506.05690
  12. LlamaIndex contributors. Property Graph Index — KG retriever framework. Documentation. Four composable retrievers (LLMSynonym, VectorContext, TextToCypher, CypherTemplate). developers.llamaindex.ai/.../lpg_index_guide
  13. Xiong G, Jin Q, Lu Z, Zhang A. Benchmarking Retrieval-Augmented Generation for Medicine (MedRAG / MIRAGE). ACL Findings 2024. 7,663 questions across 5 medical QA datasets; up to +18% over CoT. arxiv.org/abs/2402.13178
— · § · — Preliminary manuscript · Graphcore v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond