Medigraph: A Patient-Centric Clinical Knowledge Graph for Multi-Hop Reasoning over FHIR
Per-patient knowledge graphs built from FHIR resources, clinical notes, and standard ontologies — enabling the temporal, causal, multi-hop queries that vanilla RAG and long-context LLMs both miss.
Abstract General-purpose clinical knowledge graphs are mature: PrimeKG[1] integrates 20 resources into 17,080 diseases with 4,050,249 relationships; Hetionet v1.0[2] contains 47,031 nodes and 2,250,197 relationships; UMLS[3] reconciles ~900,000 concepts from over 60 vocabularies. What does not exist is a reusable patient-centric KG construction pipeline that takes a FHIR bundle, links its references to standard ontologies (SNOMED CT / RxNorm[5] / LOINC[4]), augments structured resources with extractions from free-text notes (cTAKES[6] / MedCAT[7]), and exposes the result as a multi-hop queryable graph. Medigraph fills that gap. It composes with Atrium (Project 01) as the FHIR substrate and serves as the upstream KG for GraphCore (Project 17). The architectural inspiration is GraphCare[10] — personalized concept KGs with a Bi-attention GNN demonstrated on MIMIC-III/IV — and Rotmensch & Sontag's 2017 disease-symptom KG learned from 273,174 patient records[13]. Pass criterion: graph completeness ≥ 95% against gold-annotated test patients and multi-hop query F1 ≥ 0.75 on a 100-question evaluation suite.
§ 1 Introduction
A patient's clinical history is a graph. Conditions cause investigations; investigations yield observations; observations modify medications; medications interact with each other and with new conditions. FHIR R4 represents this faithfully at the resource level — each Observation references its subject, encounter, and code — but the resulting object graph is not directly queryable for the kinds of multi-hop questions clinical reasoning actually requires. "Has anything started after the 2022 stroke caused or worsened the current creatinine trend?" is unanswerable with either vanilla RAG (no graph awareness) or long-context LLMs (Longitude, Project 04, shows where they fail).
The literature on clinical knowledge graphs has matured along three tracks: domain-wide KGs (PrimeKG[1], Hetionet[2]) that capture biomedical knowledge in the abstract; ontology backbones (UMLS[3], LOINC[4], RxNorm[5]) that supply the node-typing vocabulary; and EHR-derived KGs[11][13] that learn structure from real clinical records. Medigraph is the integrator — a pipeline that produces a single patient's KG from these established components.
1.1 Contributions
- An open pipeline that constructs a patient-centric knowledge graph from a FHIR bundle plus optional clinical notes, with node typing via UMLS-anchored ontologies.
- A reusable graph schema with explicit temporal and causal edge types, designed for multi-hop traversal by both LLM agents and graph queries.
- A 100-question evaluation suite measuring graph completeness, ontology binding accuracy, and multi-hop query F1.
§ 2 Background and Related Work
2.1 Clinical Knowledge Graph Landscape
PrimeKG (Chandak, Huang & Zitnik, Scientific Data 2023)[1] integrates 20 resources covering 17,080 diseases and 4,050,249 relationships across ten biological scales — the strongest published domain-wide medical KG. Hetionet (Himmelstein, Baranzini et al., eLife 2017)[2] precedes it: 47,031 nodes of 11 types, 2,250,197 relationships of 24 types, used canonically for drug repurposing. Both are population-level resources; neither is a patient-specific construct. Medigraph treats them as background corpora to which a per-patient subgraph can be linked.
2.2 Terminology Backbones
Bodenreider's UMLS paper[3] (Nucleic Acids Research 2004) remains the definitive description of UMLS: over 2 million names for approximately 900,000 concepts from 60+ biomedical vocabularies, with 12 million inter-concept relations. LOINC[4] provides universal codes for laboratory and clinical observations transmitted in HL7. RxNorm[5] normalises clinical drug names using an ingredient-strength-dose-form pattern (e.g., "Naproxen 250 MG Oral Tablet"). Medigraph's nodes are typed by these three backbones, with UMLS CUIs as the canonical concept identifier.
2.3 Clinical NLP for Graph Construction
cTAKES (Savova et al., JAMIA 2010)[6] reports an NER F-score of 0.715 on exact spans and 0.824 on overlapping spans for clinical free-text — the conservative baseline. MedCAT (Kraljevic et al., AI in Medicine 2021)[7] self-supervised training on ~8.8B words from ~17M clinical records across three London hospitals yields UMLS concept extraction F1 between 0.448 and 0.738 depending on concept type. Medigraph uses MedCAT as the default extractor with cTAKES as a sanity-check baseline.
2.4 Graph Neural Network Foundations
GraphSAGE (Hamilton, Ying & Leskovec, NeurIPS 2017)[8] introduced inductive node embedding learning by sampling and aggregating from a node's local neighbourhood — directly applicable to patient graphs where every patient introduces unseen nodes. GAT (Veličković et al., ICLR 2018)[9] added masked self-attention so different neighbours can carry different weights, important for clinical reasoning where one out-of-range lab matters more than ten in-range ones.
2.5 Patient-Centric KG Prior Art
GraphCare (Jiang, Xiao, Cross & Sun, ICLR 2024)[10] is the closest existing work: LLM-extracted concept KGs combined with a Bi-attention Augmented GNN; outperforms baselines on MIMIC-III and MIMIC-IV mortality, readmission, length-of-stay, and drug-recommendation tasks. Rotmensch et al. (Scientific Reports 2017)[13] learned a 157-disease / 491-symptom KG from 273,174 de-identified patient records using noisy-OR Bayesian networks. Murali et al.[11] (JBI 2023) review the broader EHR-KG landscape. TRANS (Chen et al., IJCAI 2024)[12] introduces temporal heterogeneous graphs for EHR prediction with visit-level precision@10 of 65.68%.
§ 3 Proposed Approach
3.1 Construction Pipeline
3.2 Graph Schema
Nodes are typed by UMLS CUI category (Disease, Drug, Procedure, Observation-Code, Encounter, Patient, Note, Provider). Edges fall into eight types: has_finding, prescribed, contraindicates, causes, precedes, follows, mentioned_in, cross_refs. Every edge carries a provenance attribute pointing back to the FHIR resource or note span that justified it — non-negotiable for downstream evaluation.
3.3 Query Interface
Two query surfaces. The first is direct Cypher (or SPARQL for RDF deployments) for deterministic graph traversal. The second is an LLM-translation layer that turns natural-language clinical questions into graph queries — the obvious failure-mode here is Cypher hallucination, and Medigraph's evaluation explicitly measures it.
§ 4 Evaluation Protocol
| Metric | Definition | Target |
|---|---|---|
| Graph completeness | Fraction of gold-annotated facts captured as graph nodes/edges | ≥ 0.95 |
| Ontology-binding accuracy | Fraction of nodes correctly linked to UMLS CUI | ≥ 0.90 |
| Multi-hop F1 | F1 on a 100-question multi-hop query suite (≥ 3 graph hops) | ≥ 0.75 |
| Cypher hallucination rate | Fraction of LLM-translated queries that reference non-existent nodes or relations | ≤ 0.05 |
§ 5 Expected Contributions
- Pipeline. An open, reproducible patient-KG construction pipeline that takes FHIR bundles to typed graphs.
- Schema. A reusable graph schema with explicit provenance per edge — the property the clinical setting requires.
- Benchmark. A 100-task multi-hop clinical query evaluation suite.
§ 6 Limitations and Risks
Patient KGs only as good as the input. FHIR completeness varies; clinical notes are noisy; UMLS coverage of niche specialties is uneven. MedCAT's reported F1 range of 0.448–0.738[7] is the practical ceiling for note-derived concept extraction. The causal-edge type is inherently soft — Medigraph emits it conservatively with provenance, never as ground truth.
§ 7 Conclusion
Medigraph is the missing patient-centric layer between domain-wide medical KGs[1][2] and the per-patient FHIR substrate. Every component exists in the literature; the contribution is the integration, the schema, and the evaluation harness that GraphCore (Project 17) and downstream agents will build on.
References
- Chandak P, Huang K, Zitnik M. Building a knowledge graph to enable precision medicine (PrimeKG). Scientific Data 10:67, 2023. 20 resources, 17,080 diseases, 4,050,249 relationships. nature.com/articles/s41597-023-01960-3
- Himmelstein DS, Baranzini SE, et al. Systematic integration of biomedical knowledge prioritizes drugs for repurposing (Hetionet). eLife 6:e26726, 2017. 47,031 nodes / 11 types, 2,250,197 relationships / 24 types. elifesciences.org/articles/26726
- Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research 32(suppl_1):D267–D270, 2004. academic.oup.com/nar/article/32/suppl_1/D267/2505235
- McDonald CJ, Huff SM, et al. LOINC, a Universal Standard for Identifying Laboratory Observations: A 5-Year Update. Clinical Chemistry 49(4):624–633, 2003. academic.oup.com/clinchem/article/49/4/624
- Nelson SJ, Zeng K, Kilbourne J, Powell T, Moore R. Normalized names for clinical drugs: RxNorm at 6 years. JAMIA 18(4):441–448, 2011. academic.oup.com/jamia/article/18/4/441
- Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES). JAMIA 17(5):507–513, 2010. NER F-score 0.715 exact / 0.824 overlapping. academic.oup.com/jamia/article/17/5/507
- Kraljevic Z, Searle T, et al. Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit. Artificial Intelligence in Medicine 117:102083, 2021. pubmed.ncbi.nlm.nih.gov/34127232
- Hamilton WL, Ying R, Leskovec J. Inductive Representation Learning on Large Graphs (GraphSAGE). NeurIPS, 2017. arxiv.org/abs/1706.02216
- Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph Attention Networks. ICLR, 2018. arxiv.org/abs/1710.10903
- Jiang P, Xiao C, Cross A, Sun J. GraphCare: Enhancing Healthcare Predictions with Personalized Knowledge Graphs. ICLR, 2024. openreview.net/forum?id=tVTN7Zs0ml
- Murali L, Gopakumar G, Viswanathan DM, Nedungadi P. Towards electronic health record-based medical knowledge graph construction, completion, and applications: A literature study. Journal of Biomedical Informatics 143:104403, 2023. sciencedirect.com/science/article/pii/S1532046423001247
- Chen J, Yin C, Wang Y, Zhang P. Predictive Modeling with Temporal Graphical Representation on Electronic Health Records (TRANS). IJCAI, 2024. Visit-level precision@10 65.68% on MIMIC-IV. ijcai.org/proceedings/2024/637
- Rotmensch M, Halpern Y, Tlimat A, Horng S, Sontag D. Learning a Health Knowledge Graph from Electronic Medical Records. Scientific Reports 7:5994, 2017. 157 diseases / 491 symptoms from 273,174 records. nature.com/articles/s41598-017-05778-z
C. Takeoff AI · Set in EB Garamond