Back to Dossier
Paper 01 / 10 Preliminary Manuscript · v0.1 May 2026
Dossier №01 · Project 01 · Atrium

Atrium: A Reference Model Context Protocol Server for FHIR Clinical Data with SMART-on-FHIR Launch

A SMART app for the agent era — production-grade infrastructure for grounding frontier language models in a patient's actual electronic health record.

Abstract Frontier language models lack a standardised mechanism for grounding their reasoning in clinical data hosted in healthcare's dominant interchange format, FHIR. We propose Atrium, an open-source TypeScript reference implementation of the Model Context Protocol[1] that exposes a FHIR R4 endpoint as MCP tools and resources, with authorization handled by the SMART-on-FHIR launch framework[2]. In SMART terminology, Atrium is a back-end SMART app: an OAuth-scoped client of a FHIR server whose tools are surfaced to a language model rather than to a clinician's browser. The server covers seven FHIR R4[3] resource types with paginated queries, code-system search, longitudinal slicing, OAuth-flow authentication, and append-only audit logging. A synthea-seed companion produces a 10,000-patient synthetic cohort[4] for evaluation. We define a fifty-question clinical question-answering harness whose pass criterion is 47/50 correctness with zero invented patient facts — a grounding-fidelity bar identified as the dominant failure mode in recent healthcare LLM systematic reviews[10]. Atrium is designed to be load-bearing infrastructure for downstream agentic clinical applications and a reproducible target for grounding evaluation that the closest prior art[9] does not provide.

§ 1 Introduction

The Model Context Protocol[1], released as an open specification in late 2024, defines a JSON-RPC interface between a language-model host and external systems through tools (callable functions) and resources (addressable data). In a clinical context, an MCP-equipped model can query a patient's medical history during reasoning rather than relying on context-window prefills or implicit retrieval. The protocol turns the question of "how does the model access clinical data" from an application-level concern into an infrastructure one.

Yet despite the entrenchment of HL7 FHIR[3] as the de facto clinical interchange standard — codified in the United States by the 21st Century Cures Act and the ONC's TEFCA framework — no production-grade reference MCP server for FHIR data exists in the public domain. The closest prior work[9] demonstrates feasibility on a sandbox endpoint but offers neither audit logging, multi-resource search, nor a reproducible evaluation harness.

This gap is consequential. Each clinical AI team currently rebuilds the same plumbing: SMART OAuth flows, resource pagination, terminology binding, audit trails. The duplication is wasteful, and the resulting non-interoperable stacks produce agents that cannot port between sites. A canonical reference server with a published evaluation harness changes the equilibrium.

1.1 Contributions

  1. A reference MCP server covering seven FHIR R4 resource types (Patient, Observation, Condition, MedicationRequest, Encounter, DiagnosticReport, AllergyIntolerance) with SMART OAuth authentication and FHIR-format AuditEvent logging.
  2. A synthea-seed companion utility that bundles a 10,000-patient cohort generated by Synthea[4] and produces a paired evaluation manifest.
  3. A fifty-task clinical QA evaluation suite whose pass criterion separates retrieval accuracy from fact-invention rate, addressing the principal grounding failure mode identified by Bedi et al.[10]

§ 2 Background and Related Work

2.1 FHIR Data and the SMART-on-FHIR Launch Framework

It is important to distinguish two separate layers that are routinely conflated. FHIR[3] is the data layer — a RESTful resource model that reformulates healthcare interchange around approximately 145 resource types (Patient, Observation, Condition, MedicationRequest, etc.), each a strongly-typed JSON document with terminology binding to LOINC, SNOMED CT, and RxNorm. FHIR R4 is the current production version. SMART on FHIR[2] is an entirely separate application-launch and authorization framework that sits on top of FHIR. SMART defines how a third-party application — a SMART app — discovers a FHIR endpoint, performs an OAuth 2.0 authorization-code-with-PKCE flow, requests scoped permissions (patient/Observation.read, user/Practitioner.read, etc.), and receives a time-bounded access token. SMART does not specify data formats; FHIR does. Atrium therefore reads FHIR data via a SMART-on-FHIR launch — the two are complementary, not synonymous.

2.2 Synthetic Patient Generation

Synthea[4] generates clinically realistic synthetic patient records covering longitudinal events from birth through death across ten common chronic disease modules. Synthea bundles validate against FHIR R4 profiles and have become the canonical evaluation dataset for FHIR tooling that cannot lawfully redistribute PHI from MIMIC-IV or comparable sources. Atrium seeds a 10,000-patient cohort generated by Synthea with Massachusetts demographics; Synthea's reference SyntheticMass dataset is a separate ~1M-patient release that can be substituted at scale.

2.3 Tool-Augmented Language Models

Modern LLM agents alternate in-context reasoning with external tool calls. Toolformer[5] demonstrated that LLMs can self-supervise their tool use; ReAct[6] established the now-canonical interleaved reason-act-observe loop. Gorilla[7] surfaced a scaling concern: as tool surfaces grow, models hallucinate API signatures unless equipped with explicit retrieval. MCP addresses this at the protocol layer by exposing tools with structured schemas a client can discover at runtime — which makes the protocol particularly well-matched to FHIR's catalogue of well-typed resources.

2.4 LLMs in Clinical Reasoning

Med-PaLM[8] established that frontier LLMs reach expert-level performance on MultiMedQA when given grounded clinical context. Subsequent literature has converged on a consistent observation: knowledge is rarely the bottleneck; grounding is. Models invent specifics — lab values, medication names, dates — that are not present in the source record. Bedi et al.'s systematic review[10] identifies real-patient-data grounding and fairness analysis as the principal evaluation gaps in healthcare LLM tooling. Atrium's evaluation harness is designed against this exact failure mode.

2.5 Prior MCP Implementations for Healthcare

Ehtesham et al.[9] describe an MCP-FHIR bridge evaluated on the SMART Health IT sandbox — the closest existing prior art. We position Atrium as an evolution rather than a replacement: we adopt their architectural pattern, extend resource coverage to seven types with longitudinal slicing, add FHIR-native audit logging required for HIPAA defensibility, and contribute a quantitative grounding-fidelity evaluation harness that the existing work lacks.

§ 3 Proposed Approach

3.1 Architecture

Atrium is implemented as a stateless Node.js (v22) MCP server in TypeScript, using the Anthropic MCP SDK and Hono for HTTP transport. The server connects to a HAPI FHIR R4 endpoint[12] via the canonical FHIR client library. The current MCP specification (revision 2025-03-26) defines exactly two transports — stdio for in-process subprocess invocation and Streamable HTTP for networked deployments[11]; the older HTTP+SSE transport from the 2024-11-05 revision is deprecated. Atrium supports both current transports. Streamable HTTP additionally requires Origin-header validation and SHOULD bind to 127.0.0.1 for local deployments — a non-trivial security concern given Hasan et al.[13] recently audited 1,899 open-source MCP servers and found 5.5% exhibit MCP-specific tool-poisoning vulnerabilities, and Radosevich & Halloran[14] demonstrated that frontier LLMs can be coerced via MCP tools into malicious code execution and credential theft. Atrium therefore validates Origin on every Streamable HTTP request and runs its audit log through a tamper-evident hash chain.

Figure 1 · System architecture
Claude MCP Host JSON-RPC 2.0 stdio / streamable Atrium MCP Server · Node 22 SMART OAuth patient/*.rs scope Tool Surface 7 FHIR R4 types Audit Log FHIR AuditEvent HIPAA §164.312(b) HAPI FHIR R4 endpoint REST + JPA Synthea 10k patients seeded seeds FHIR REST 1 2 3
Figure 1. (1) A Claude MCP host connects over one of the two transports defined in the MCP specification revision 2025-03-26[11]. (2) Atrium runs as a stateless Node.js MCP server with three layered concerns — SMART OAuth scope enforcement using the v2.2.0 scope syntax[15] (e.g. patient/Observation.rs), the FHIR R4 tool surface, and append-only audit logging emitted as FHIR AuditEvent[16] resources satisfying HIPAA Security Rule §164.312(b)[17]. (3) A HAPI FHIR R4 endpoint[12] provides the underlying resource store, seeded for evaluation by Synthea[4]. Hover any node to highlight its role.
SMART v2.2.0 scope syntax — the permission alphabet

v2.2.0 introduced a five-letter permission alphabet: c create, r read, u update, d delete, s search. A scope of patient/Observation.rs grants read-plus-search but not create/update/delete on Observation for the launched patient context. The older .read shorthand maps to .rs; .write to .cud; .* to .cruds[15].

Each FHIR resource type is exposed as a paired construct — a read-only resource for direct addressing and a parameterized tool for search. For Observation:

resource: fhir://patients/{patient_id}/observations?category={c}&date={start..end} tool: search_observations(patient_id, code, date_range, limit) tool: get_observation_trend(patient_id, code, window)

The third form — get_observation_trend — is a derived tool that returns time-ordered values for a specified LOINC code (e.g., HbA1c trends over a five-year window), avoiding a common failure pattern in which the LLM stitches together a longitudinal trend from individually retrieved Observations and miscomputes the chronology.

3.2 Authentication

Atrium implements the standard SMART on FHIR launch flow[2]: authorization code grant with PKCE, scoped to specific FHIR resource categories (patient/Observation.read, patient/Condition.read, etc.). Access tokens are short-lived (60 minutes) with refresh, and the server maintains no persistent session state. Scope enforcement happens at the MCP tool boundary — a request out of scope returns a structured MCP error rather than silently downgrading.

3.3 Audit Logging

Every resource read and tool invocation produces an append-only audit record containing: timestamp, MCP client identifier, scope used, patient identifier (one-way hashed with a per-deployment salt), resource type accessed, and the count of resources returned. Records are exportable as FHIR AuditEvent resources, satisfying the HL7 Audit Event guidance for HIPAA-defensible access logging.

3.4 Synthea Seeding

The synthea-seed companion CLI generates 10,000 Synthea patients with Massachusetts-style demographics[4], loads them into the HAPI FHIR sandbox, and produces an evaluation manifest that maps patient identifiers to known-correct clinical facts (e.g., "patient P1234 has type 2 diabetes mellitus with most recent HbA1c of 7.8% on 2024-03-15"). The manifest is the source of truth against which the evaluation harness scores.

§ 4 Evaluation Protocol

We construct a held-out evaluation suite of fifty clinical questions, each tied to a specific synthetic patient and a specific gold answer. Questions cover all seven exposed resource types and span two complexity tiers:

The evaluation procedure for a given LLM client:

  1. Connect the client (default: Claude Opus 4.7) to a fresh Atrium instance with the standard tool surface available and no patient context preloaded.
  2. Submit each question; the model is expected to use Atrium tools to resolve the answer.
  3. Score on two orthogonal axes:
    • Correctness: binary agreement with the gold answer; for numerical values, a tolerance of ±1 unit-in-the-last-place is allowed.
    • Grounding fidelity: a second-pass blind audit by an independent reviewer identifies any clinical fact (lab value, medication name, diagnosis, date) appearing in the response that is not retrievable from Atrium-exposed FHIR data. This is the fact-invention rate.
Pass criterion Atrium passes its v0.1 evaluation if a target frontier model achieves 47 of 50 correct (94%) with zero fact inventions over the held-out set. The grounding bar is the harder of the two; Bedi et al.[10] note that even high-accuracy clinical LLMs routinely invent specifics, so a 0/50 invention rate is the genuinely novel contribution.

4.1 Baselines and Comparisons

We compare against three baselines: (i) the same model with the entire patient bundle pasted into context (the naive long-context baseline); (ii) a basic FHIR-search REST client wrapped as a function-calling tool (without MCP discovery); and (iii) the Ehtesham et al. MCP-FHIR bridge[9] where compatible. The expected pattern is that bundle-paste yields high correctness but high invention; basic function-calling yields lower correctness; and only Atrium yields high correctness with zero invention.

§ 5 Expected Contributions

  1. Infrastructure. A production-grade open-source reference MCP-FHIR server that downstream healthcare AI projects can adopt without rebuilding the plumbing.
  2. Methodology. A reproducible grounding-fidelity evaluation that other MCP healthcare servers can run against the same Synthea cohort — establishing a comparable benchmark.
  3. Ecosystem. Submission to Anthropic's official MCP integrations registry, lowering the cost of healthcare AI experimentation for subsequent teams and creating direct visibility with the protocol's authors.

§ 6 Limitations and Risks

Atrium is a sandbox-grade tool. It does not satisfy production HIPAA requirements without organisation-level controls (Business Associate Agreement, network segmentation, encryption at rest in the underlying FHIR store, and formal audit-log review processes). The Synthea cohort[4] also lacks the noise patterns of real EHR data — coding inconsistencies, free-text spilling into structured fields, conflicting resources, deprecated codes — that complicate real-world grounding. Future work should extend the evaluation harness to MIMIC-IV-derived FHIR bundles under appropriate Data Use Agreement.

A second concern is benchmark gaming. A 47/50 pass rate on a fifty-question harness is a coarse signal; downstream users should not treat passing as a certification of clinical safety. The evaluation is designed to surface fact-invention rather than to certify, and the harness should grow over time.

§ 7 Conclusion

Atrium argues that the missing piece in healthcare LLM infrastructure is not a smarter model but a canonical, audited, reproducibly-evaluated server for clinical data. By implementing the published MCP specification[1] over a standards-conformant SMART-on-FHIR[2] stack, and pairing it with a grounding-fidelity evaluation that targets the dominant clinical-LLM failure mode[10], we expect Atrium to function both as practical infrastructure and as a reusable evaluation target for the next wave of agentic clinical applications.

References

  1. Anthropic and Model Context Protocol Steering Committee. Model Context Protocol Specification (2025-11-25 revision). Canonical protocol document, 2025. modelcontextprotocol.io/specification/2025-11-25
  2. Mandel JC, Kreda DA, Mandl KD, Kohane IS, Ramoni RB. SMART on FHIR: a standards-based, interoperable apps platform for electronic health records. Journal of the American Medical Informatics Association, 23(5):899–908, 2016. doi.org/10.1093/jamia/ocv189
  3. Bender D, Sartipi K. HL7 FHIR: An Agile and RESTful approach to healthcare information exchange. IEEE International Symposium on Computer-Based Medical Systems (CBMS), 2013. ieeexplore.ieee.org/document/6627810
  4. Walonoski J, Kramer M, Nichols J, Quina A, Moesel C, Hall D, Duffett C, Dube K, Gallagher T, McLachlan S. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association, 25(3):230–238, 2018. doi.org/10.1093/jamia/ocx079
  5. Schick T, Dwivedi-Yu J, Dessì R, Raileanu R, Lomeli M, Zettlemoyer L, Cancedda N, Scialom T. Toolformer: Language Models Can Teach Themselves to Use Tools. Advances in Neural Information Processing Systems (NeurIPS), 2023. arxiv.org/abs/2302.04761
  6. Yao S, Zhao J, Yu D, Du N, Shafran I, Narasimhan K, Cao Y. ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations (ICLR), 2023. arxiv.org/abs/2210.03629
  7. Patil SG, Zhang T, Wang X, Gonzalez JE. Gorilla: Large Language Model Connected with Massive APIs. Advances in Neural Information Processing Systems (NeurIPS), 2024. arxiv.org/abs/2305.15334
  8. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, et al. Large language models encode clinical knowledge. Nature, 620:172–180, 2023. doi.org/10.1038/s41586-023-06291-2
  9. Ehtesham A, et al. Enhancing Clinical Decision Support and EHR Insights through LLMs and the Model Context Protocol: An Open-Source MCP-FHIR Framework. arXiv preprint, 2025. arxiv.org/abs/2506.13800
  10. Bedi S, Liu Y, Orr-Ewing L, et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA, 333(4):319–328, 2025. jamanetwork.com/journals/jama/fullarticle/2825147
  11. Anthropic and MCP contributors. Model Context Protocol Specification, Basic — Transports (revision 2025-03-26). Defines stdio and Streamable HTTP as the two standard transports; Streamable HTTP replaces the deprecated HTTP+SSE transport from revision 2024-11-05. modelcontextprotocol.io/specification/2025-03-26/basic/transports
  12. HAPI FHIR contributors. HAPI FHIR JPA Server Architecture documentation. Open-source Java reference FHIR R4 server with REST endpoint, server interceptor chain, resource providers, and JPA storage. hapifhir.io/hapi-fhir/docs/server_jpa/architecture.html
  13. Hasan MM, Li H, Fallahzadeh E, Rajbahadur GK, Adams B, Hassan AE. Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers. arXiv preprint, 2025. Empirical study of 1,899 open-source MCP servers reporting 5.5% exhibit MCP-specific tool-poisoning vulnerabilities. arxiv.org/abs/2506.13538
  14. Radosevich N, Halloran J. MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits. arXiv preprint, 2025. Demonstrates frontier LLMs coerced via MCP tools into malicious code execution, remote access, and credential theft. arxiv.org/abs/2504.03767
  15. HL7 International. SMART App Launch Framework v2.2.0, Scopes and Launch Context. Defines the five-letter permission alphabet (c/r/u/d/s) and the v1 → v2 scope-syntax mapping. hl7.org/fhir/smart-app-launch/scopes-and-launch-context.html
  16. HL7 International. FHIR R4 — AuditEvent resource. Mandatory elements (type, recorded, agent, source); derived from IHE-ATNA and RFC 3881. hl7.org/fhir/R4/auditevent.html
  17. Office for Civil Rights, U.S. Department of Health and Human Services. HIPAA Security Rule §164.312(b) — Audit Controls (45 CFR). "Implement hardware, software, and/or procedural mechanisms that record and examine activity in information systems that contain or use electronic protected health information." ecfr.gov/current/title-45/.../164.312
  18. Asgari E, Montaña-Brown N, Dubois M, et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digital Medicine, 2025. Reports 1.47% hallucination and 3.45% omission across 12,999 clinician-annotated sentences in 450 notes — strongest published grounding-baseline number. nature.com/articles/s41746-025-01670-7
  19. Masayoshi M, Hashimoto T, Yokoyama K, et al. EHR-MCP: Real-world Evaluation of Clinical Information Retrieval by Large Language Models via Model Context Protocol. arXiv preprint, 2025. GPT-class LLM via MCP achieved near-perfect accuracy on 5 of 6 clinical retrieval tasks, failing only on time-dependent calculations. arxiv.org/abs/2509.15957
— · § · — Preliminary manuscript · Atrium v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond