Dossier №01 · Project 16 · Medigraph

Medigraph: A Patient-Centric Clinical Knowledge Graph for Multi-Hop Reasoning over FHIR

Per-patient knowledge graphs built from FHIR resources, clinical notes, and standard ontologies — enabling the temporal, causal, multi-hop queries that vanilla RAG and long-context LLMs both miss.

Chandra Takeoff AI · Healthcare AI Engineering Compiled · May 2026

Abstract General-purpose clinical knowledge graphs are mature: PrimeKG[1] integrates 20 resources into 17,080 diseases with 4,050,249 relationships; Hetionet v1.0[2] contains 47,031 nodes and 2,250,197 relationships; UMLS[3] reconciles ~900,000 concepts from over 60 vocabularies. What does not exist is a reusable patient-centric KG construction pipeline that takes a FHIR bundle, links its references to standard ontologies (SNOMED CT / RxNorm[5] / LOINC[4]), augments structured resources with extractions from free-text notes (cTAKES[6] / MedCAT[7]), and exposes the result as a multi-hop queryable graph. Medigraph fills that gap. It composes with Atrium (Project 01) as the FHIR substrate and serves as the upstream KG for GraphCore (Project 17). The architectural inspiration is GraphCare[10] — personalized concept KGs with a Bi-attention GNN demonstrated on MIMIC-III/IV — and Rotmensch & Sontag's 2017 disease-symptom KG learned from 273,174 patient records[13]. Pass criterion: graph completeness ≥ 95% against gold-annotated test patients and multi-hop query F1 ≥ 0.75 on a 100-question evaluation suite.

§ 1 Introduction

A patient's clinical history is a graph. Conditions cause investigations; investigations yield observations; observations modify medications; medications interact with each other and with new conditions. FHIR R4 represents this faithfully at the resource level — each Observation references its subject, encounter, and code — but the resulting object graph is not directly queryable for the kinds of multi-hop questions clinical reasoning actually requires. "Has anything started after the 2022 stroke caused or worsened the current creatinine trend?" is unanswerable with either vanilla RAG (no graph awareness) or long-context LLMs (Longitude, Project 04, shows where they fail).

The literature on clinical knowledge graphs has matured along three tracks: domain-wide KGs (PrimeKG[1], Hetionet[2]) that capture biomedical knowledge in the abstract; ontology backbones (UMLS[3], LOINC[4], RxNorm[5]) that supply the node-typing vocabulary; and EHR-derived KGs[11][13] that learn structure from real clinical records. Medigraph is the integrator — a pipeline that produces a single patient's KG from these established components.

1.1 Contributions

An open pipeline that constructs a patient-centric knowledge graph from a FHIR bundle plus optional clinical notes, with node typing via UMLS-anchored ontologies.
A reusable graph schema with explicit temporal and causal edge types, designed for multi-hop traversal by both LLM agents and graph queries.
A 100-question evaluation suite measuring graph completeness, ontology binding accuracy, and multi-hop query F1.

§ 2 Background and Related Work

2.1 Clinical Knowledge Graph Landscape

PrimeKG (Chandak, Huang & Zitnik, Scientific Data 2023)[1] integrates 20 resources covering 17,080 diseases and 4,050,249 relationships across ten biological scales — the strongest published domain-wide medical KG. Hetionet (Himmelstein, Baranzini et al., eLife 2017)[2] precedes it: 47,031 nodes of 11 types, 2,250,197 relationships of 24 types, used canonically for drug repurposing. Both are population-level resources; neither is a patient-specific construct. Medigraph treats them as background corpora to which a per-patient subgraph can be linked.

2.2 Terminology Backbones

Bodenreider's UMLS paper[3] (Nucleic Acids Research 2004) remains the definitive description of UMLS: over 2 million names for approximately 900,000 concepts from 60+ biomedical vocabularies, with 12 million inter-concept relations. LOINC[4] provides universal codes for laboratory and clinical observations transmitted in HL7. RxNorm[5] normalises clinical drug names using an ingredient-strength-dose-form pattern (e.g., "Naproxen 250 MG Oral Tablet"). Medigraph's nodes are typed by these three backbones, with UMLS CUIs as the canonical concept identifier.

2.3 Clinical NLP for Graph Construction

cTAKES (Savova et al., JAMIA 2010)[6] reports an NER F-score of 0.715 on exact spans and 0.824 on overlapping spans for clinical free-text — the conservative baseline. MedCAT (Kraljevic et al., AI in Medicine 2021)[7] self-supervised training on ~8.8B words from ~17M clinical records across three London hospitals yields UMLS concept extraction F1 between 0.448 and 0.738 depending on concept type. Medigraph uses MedCAT as the default extractor with cTAKES as a sanity-check baseline.

2.4 Graph Neural Network Foundations

GraphSAGE (Hamilton, Ying & Leskovec, NeurIPS 2017)[8] introduced inductive node embedding learning by sampling and aggregating from a node's local neighbourhood — directly applicable to patient graphs where every patient introduces unseen nodes. GAT (Veličković et al., ICLR 2018)[9] added masked self-attention so different neighbours can carry different weights, important for clinical reasoning where one out-of-range lab matters more than ten in-range ones.

2.5 Patient-Centric KG Prior Art

GraphCare (Jiang, Xiao, Cross & Sun, ICLR 2024)[10] is the closest existing work: LLM-extracted concept KGs combined with a Bi-attention Augmented GNN; outperforms baselines on MIMIC-III and MIMIC-IV mortality, readmission, length-of-stay, and drug-recommendation tasks. Rotmensch et al. (Scientific Reports 2017)[13] learned a 157-disease / 491-symptom KG from 273,174 de-identified patient records using noisy-OR Bayesian networks. Murali et al.[11] (JBI 2023) review the broader EHR-KG landscape. TRANS (Chen et al., IJCAI 2024)[12] introduces temporal heterogeneous graphs for EHR prediction with visit-level precision@10 of 65.68%.

§ 3 Proposed Approach

3.1 Construction Pipeline

Figure 1 · Medigraph construction pipeline

Figure 1. Medigraph composes four stages: (1) direct lift of FHIR-resource cross-references into a base graph; (2) MedCAT[7] NER + UMLS linking on free-text notes to add concept nodes; (3) typed-edge inference for temporal and causal relations (precedes, suggests, contraindicates, etc.); (4) optional cross-linkage to PrimeKG[1] for population-level disease and drug context. The output is a Neo4j or RDF graph queryable in Cypher / SPARQL by an LLM agent or a GNN head (GraphSAGE[8] / GAT[9]).

3.2 Graph Schema

Nodes are typed by UMLS CUI category (Disease, Drug, Procedure, Observation-Code, Encounter, Patient, Note, Provider). Edges fall into eight types: has_finding, prescribed, contraindicates, causes, precedes, follows, mentioned_in, cross_refs. Every edge carries a provenance attribute pointing back to the FHIR resource or note span that justified it — non-negotiable for downstream evaluation.

3.3 Query Interface

Two query surfaces. The first is direct Cypher (or SPARQL for RDF deployments) for deterministic graph traversal. The second is an LLM-translation layer that turns natural-language clinical questions into graph queries — the obvious failure-mode here is Cypher hallucination, and Medigraph's evaluation explicitly measures it.

§ 4 Evaluation Protocol

**Table 1.** Medigraph evaluation metrics.
Metric	Definition	Target
Graph completeness	Fraction of gold-annotated facts captured as graph nodes/edges	≥ 0.95
Ontology-binding accuracy	Fraction of nodes correctly linked to UMLS CUI	≥ 0.90
Multi-hop F1	F1 on a 100-question multi-hop query suite (≥ 3 graph hops)	≥ 0.75
Cypher hallucination rate	Fraction of LLM-translated queries that reference non-existent nodes or relations	≤ 0.05

Pass criterion Medigraph v0.1 succeeds when graph completeness ≥ 95% on the held-out gold cohort, ontology binding ≥ 90% (UMLS-CUI exact match), multi-hop F1 ≥ 0.75, and LLM-translated Cypher hallucination rate ≤ 5%.

§ 5 Expected Contributions

Pipeline. An open, reproducible patient-KG construction pipeline that takes FHIR bundles to typed graphs.
Schema. A reusable graph schema with explicit provenance per edge — the property the clinical setting requires.
Benchmark. A 100-task multi-hop clinical query evaluation suite.

§ 6 Limitations and Risks

Patient KGs only as good as the input. FHIR completeness varies; clinical notes are noisy; UMLS coverage of niche specialties is uneven. MedCAT's reported F1 range of 0.448–0.738[7] is the practical ceiling for note-derived concept extraction. The causal-edge type is inherently soft — Medigraph emits it conservatively with provenance, never as ground truth.

§ 7 Conclusion

Medigraph is the missing patient-centric layer between domain-wide medical KGs[1][2] and the per-patient FHIR substrate. Every component exists in the literature; the contribution is the integration, the schema, and the evaluation harness that GraphCore (Project 17) and downstream agents will build on.

References

Chandak P, Huang K, Zitnik M. Building a knowledge graph to enable precision medicine (PrimeKG). Scientific Data 10:67, 2023. 20 resources, 17,080 diseases, 4,050,249 relationships. nature.com/articles/s41597-023-01960-3
Himmelstein DS, Baranzini SE, et al. Systematic integration of biomedical knowledge prioritizes drugs for repurposing (Hetionet). eLife 6:e26726, 2017. 47,031 nodes / 11 types, 2,250,197 relationships / 24 types. elifesciences.org/articles/26726
Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research 32(suppl_1):D267–D270, 2004. academic.oup.com/nar/article/32/suppl_1/D267/2505235
McDonald CJ, Huff SM, et al. LOINC, a Universal Standard for Identifying Laboratory Observations: A 5-Year Update. Clinical Chemistry 49(4):624–633, 2003. academic.oup.com/clinchem/article/49/4/624
Nelson SJ, Zeng K, Kilbourne J, Powell T, Moore R. Normalized names for clinical drugs: RxNorm at 6 years. JAMIA 18(4):441–448, 2011. academic.oup.com/jamia/article/18/4/441
Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES). JAMIA 17(5):507–513, 2010. NER F-score 0.715 exact / 0.824 overlapping. academic.oup.com/jamia/article/17/5/507
Kraljevic Z, Searle T, et al. Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit. Artificial Intelligence in Medicine 117:102083, 2021. pubmed.ncbi.nlm.nih.gov/34127232
Hamilton WL, Ying R, Leskovec J. Inductive Representation Learning on Large Graphs (GraphSAGE). NeurIPS, 2017. arxiv.org/abs/1706.02216
Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph Attention Networks. ICLR, 2018. arxiv.org/abs/1710.10903
Jiang P, Xiao C, Cross A, Sun J. GraphCare: Enhancing Healthcare Predictions with Personalized Knowledge Graphs. ICLR, 2024. openreview.net/forum?id=tVTN7Zs0ml
Murali L, Gopakumar G, Viswanathan DM, Nedungadi P. Towards electronic health record-based medical knowledge graph construction, completion, and applications: A literature study. Journal of Biomedical Informatics 143:104403, 2023. sciencedirect.com/science/article/pii/S1532046423001247
Chen J, Yin C, Wang Y, Zhang P. Predictive Modeling with Temporal Graphical Representation on Electronic Health Records (TRANS). IJCAI, 2024. Visit-level precision@10 65.68% on MIMIC-IV. ijcai.org/proceedings/2024/637
Rotmensch M, Halpern Y, Tlimat A, Horng S, Sontag D. Learning a Health Knowledge Graph from Electronic Medical Records. Scientific Reports 7:5994, 2017. 157 diseases / 491 symptoms from 273,174 records. nature.com/articles/s41598-017-05778-z

— · § · — Preliminary manuscript · Medigraph v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond