Safeguard: A Hybrid Deterministic-Plus-LLM Validator for Total Parenteral Nutrition Order Safety
Pre-dispatch safety review for one of medicine's highest-risk pharmacy workflows. Anchored in ASPEN guidelines and the calcium-phosphate solubility curve.
Abstract Total parenteral nutrition (TPN) is a high-risk pharmacy workflow whose error pattern has been characterised by the American Society for Parenteral and Enteral Nutrition (ASPEN)[1] and codified in their 2014 ordering and review guidelines[2]. The 1994 FDA precipitation alert[4] documents two deaths from calcium-phosphate precipitation; subsequent literature has quantified the solubility curve[5] and the peripheral-versus-central osmolarity thresholds[6]. Safeguard implements pre-dispatch safety validation as a two-layer system: a deterministic chemistry engine that enforces solubility, osmolarity, and lipid-additive limits, and an LLM reasoning layer that interprets soft ASPEN-guideline checks. The LLM is bounded by the deterministic gate — never the other way round — a design motivated by recent evidence[11] that LLMs perform inconsistently on drug-drug interaction detection in isolation. Pass criterion: ≥ 95% catch of seeded unsafe orders with < 5% false-positive rate.
§ 1 Introduction
TPN sits in a peculiar position in clinical pharmacy. The orders are individually customised — every patient's electrolyte targets, glucose load, lipid emulsion, and trace-element profile is unique — and they are calculated under time pressure, often by clinicians who are not pharmacists. The error opportunities are correspondingly broad: a calcium-phosphate combination that precipitates in the bag, an osmolarity too high for the chosen vascular route, an electrolyte delta that ignores the patient's current potassium. The 1994 FDA precipitation alert[4] remains the sentinel example: two deaths from a precipitate that formed because the order's calcium-phosphate product exceeded the published solubility limit.
Existing clinical decision support reduces medication errors in general[8], but TPN-specific tooling is sparse in the public domain. The empirical error rate is concrete: Sacks et al. (Pharmacotherapy 2009)[12] documented 15.6 medication errors per 1,000 PN prescriptions at a large university teaching hospital, with 1% prescription-stage, 39% transcription-stage, 24% preparation-stage, and 35% administration-stage; 8% caused temporary patient harm. The ASPEN Adverse Event and Error Reporting Program[13] documents that nearly two-thirds of survey respondents observe 1–5 PN-related errors per month, with 71% involving PN electrolytes. The ASPEN safety consensus[1] defines the error categories; the 2014 ordering guideline[2] defines the workflow; the 2022 compatibility review[3] documents the trace-element incompatibilities; the MacKay 2011 study[14] provides practice-based validation of the equation approach later codified by Anderson 2022. What is missing is an open implementation that operationalises these into a pre-dispatch validator with documented LLM augmentation.
Recent empirical evidence supports the augmentation case directly. A 2025 Cell Reports Medicine study[15] evaluated an LLM-augmented clinical decision support system across 16 clinical specialties and found that co-pilot mode (pharmacist + LLM-CDSS) increased mean drug-related problem identification accuracy by 32.6% over pharmacist alone, with a 1.5× accuracy improvement for serious-harm errors specifically. The Rx-LLM benchmark suite[16] (medRxiv 2025) provides a directly applicable validation framework — drug formulation matching, order generation, route matching, DDI detection, renal dosing, indication matching — that Safeguard adopts for its evaluation harness.
1.1 Contributions
- An open-source TPN safety validator combining a deterministic chemistry engine with an LLM reasoning layer over the ASPEN ordering and compatibility guidelines[2][3].
- A reference safety-check rule set covering: (i) calcium-phosphate solubility via the Anderson et al.[5] parameterised curve, (ii) osmolarity limits by route per Dugan et al.[6] and ASPEN, (iii) trace-element compatibility per Boullata et al.[3], (iv) neonatal-specific checks per Robinson et al.[7].
- An evaluation corpus of synthetic orders seeded with the documented adverse-event patterns from the literature, plus a control set of safe orders.
§ 2 Background and Related Work
2.1 The ASPEN Safety Framework
Ayers et al.[1] describe TPN safety as a four-stage error cascade: prescribing → order review → compounding → administration. Each stage has documented failure modes; pre-dispatch order review is the most leveraged intervention point because it intercepts errors before compounding committed resources. Boullata et al.[2] formalise the standardised order-review workflow, including the canonical peripheral-PN osmolarity ceiling of 900 mOsm/L (above which phlebitis risk is unacceptable) and structured electrolyte-compatibility checks.
2.2 Calcium-Phosphate Precipitation
The FDA precipitation alert[4] is the foundational case. Two patients died after calcium-phosphate precipitates formed in their TPN bags. Subsequent quantitative work by Anderson et al.[5] produced a parameterised solubility equation as a function of pH, amino-acid concentration, and temperature. Safeguard's deterministic chemistry engine evaluates the Anderson curve directly against every order's Ca/PO4 load, dextrose concentration, and amino-acid percentage — a hard gate that does not depend on the LLM.
2.3 Trace-Element Compatibility
Boullata et al.'s 2022 review[3] catalogues trace-element incompatibilities — iron phosphate precipitation, copper-cysteinate complexing, selenite reduction by reducing sugars, lipid-emulsion destabilisation under high-electrolyte loads. These are codifiable rules; Safeguard encodes them as deterministic constraints checked alongside the solubility curve.
2.4 Neonatal Considerations
Preterm infants have substantially different TPN requirements: higher Ca/PO4 needs in small fluid volumes, narrower osmolarity tolerance, specific lipid-emulsion selection criteria. Robinson et al.[7] provide the current ASPEN neonatal guideline; Safeguard's neonatal mode swaps the relevant thresholds rather than applying adult limits.
2.5 Why LLMs Alone Are Insufficient
Sicard et al.[11] (2025) benchmarked frontier LLMs on drug-drug interaction detection from prescription pairs and reported inconsistent performance: some interactions caught reliably, others missed, with the failure pattern unrelated to clinical severity. For a workflow with documented fatalities, this inconsistency is disqualifying as a sole defence. Safeguard's architecture is deliberately constructed so that the LLM cannot release an order the deterministic engine has flagged. The LLM can add additional concerns (soft ASPEN guideline checks, narrative review) but cannot override hard gates. Singhal et al.[10] establish that LLMs are competent on medical reasoning when bounded by structure — exactly the regime we use.
§ 3 Proposed Approach
3.1 Two-Layer Architecture
A TPN order arrives as a structured input: macronutrients (dextrose %, amino-acid %, lipid g/kg/day), electrolytes (mEq/L for Na, K, Ca, Mg, PO4), trace elements, vitamins, total volume, vascular route, patient demographics, and the patient's current relevant labs. The order passes through two sequential layers:
- Layer 1 — Deterministic Chemistry Gate. Five hard checks run in parallel:
- Ca-P solubility via the Anderson curve[5], with a 10% safety margin.
- Osmolarity against route limits: peripheral PN must not exceed the ASPEN-codified 900 mOsm/L threshold[2]; Dugan et al.'s pediatric study[6] separately recommends staying ≤ 1,000 mOsm/L. No upper limit for central administration (warning only above 1,800).
- Electrolyte deltas against the patient's current labs (e.g., refusing potassium > 60 mEq/L bag concentration if serum K is already 5.5).
- Trace-element compatibility per the Boullata catalogue[3].
- Neonatal envelope (if patient age < 1 year) per Robinson et al.[7].
BLOCKverdict with structured reason codes. The order does not proceed to Layer 2. - Layer 2 — LLM Soft Review. If Layer 1 passes, an LLM (Claude Opus 4.7) reviews the order against the broader ASPEN ordering guideline[2] for issues that are guideline-based rather than chemistry-based: order completeness, labelling, monitoring plan, route-of-administration justification. The LLM emits
WARN-level findings; it cannot upgrade them toBLOCK.
BLOCK verdict before the LLM is invoked. Layer 2 applies LLM-mediated reasoning over the ASPEN ordering guideline for soft concerns; it cannot upgrade findings to BLOCK. The architecture concretely answers the failure mode documented by Sicard et al.[11] where LLMs show inconsistent DDI performance under direct prompting.
3.2 Medication Normalisation
The patient's active medication list is normalised to RxNorm[9] before interaction checking, so that drug names supplied by the EHR (brand names, salts, dose forms) resolve to canonical clinical drugs. This is a prerequisite for the additive-compatibility checks documented in Boullata's 2022 review[3].
3.3 Output
Safeguard emits a structured review object — verdict (BLOCK | WARN | OK), list of findings (each with a reason code, severity, citation to the relevant guideline, and recommended remediation), and audit metadata. The output is mappable to a FHIR DetectedIssue resource bundle for downstream EHR integration. Concretely, Safeguard exposes itself as a CDS Hooks 2.0 service[17] on the order-select and order-sign hooks; the JSON-over-HTTPS response returns recommendations as CDS "cards" with severity, suggestion, and override-reasoning fields. This puts Safeguard inside the standard ordering workflow rather than as a separate review step — operationally important per Kawamoto et al.'s CDSS systematic review[19] which reported 68% overall success across CDS trials and 94% success for systems with all four critical features (automatic delivery within workflow, recommendations rather than assessments, point-of-care, and computer-based). The OpenCDS reference implementation[18] (Apache 2-licensed, deployed across all 50 US states at 40,000+ facilities) provides the FHIR-and-CDS-Hooks plumbing as a drop-in.
§ 4 Evaluation Protocol
4.1 Evaluation Corpus
A synthetic corpus of 200 orders is constructed in two strata:
- Unsafe stratum (100 orders) seeded with documented adverse-event patterns: 30 orders exceeding the Anderson[5] Ca-P solubility curve; 20 with osmolarity violating the ASPEN-codified 900 mOsm/L peripheral limit[2]; 20 with electrolyte deltas inappropriate for the simulated patient labs; 15 with trace-element incompatibilities[3]; 15 neonatal orders outside the Robinson[7] envelope.
- Safe stratum (100 orders) drawn from realistic clinical scenarios: ICU adult, ambulatory home-PN, pediatric, neonatal, cardiac post-op, oncology.
4.2 Metrics
| Metric | Definition | Target |
|---|---|---|
| Catch rate | Fraction of unsafe-stratum orders that produce BLOCK. | ≥ 0.95 |
| False-positive rate | Fraction of safe-stratum orders that produce BLOCK. | < 0.05 |
| Reason-code accuracy | Fraction of catches whose flagged reason matches the seeded failure mode. | ≥ 0.90 |
| Layer-2 WARN precision | Fraction of WARN findings on safe orders judged clinically meaningful by an audit reviewer. | ≥ 0.70 |
§ 5 Expected Contributions
- Open implementation. A working, MIT-licensed TPN safety validator — to our knowledge the first such tool in the public domain at production quality.
- Rule set. Codified ASPEN ordering[2], compatibility[3], and neonatal[7] guidelines as executable safety checks, machine-readable and reusable.
- Architecture demonstration. A concrete example of the bounded-LLM pattern: hard gates deterministic, soft reasoning LLM-driven. Generalisable to other high-stakes clinical workflows.
§ 6 Limitations and Risks
Safeguard validates orders against published guidelines; it does not validate the guidelines themselves. The Anderson curve[5] has documented sensitivity to amino-acid product brand and dextrose concentration, and our implementation uses the published parameterisation — institution-specific deviations require local calibration. The synthetic evaluation corpus, even when seeded from documented adverse events, cannot capture the full distribution of real ordering errors. A v0.2 effort should extend evaluation to a retrospective audit of historical orders at a partnering institution under IRB.
A separate concern: Safeguard is a prototype, not a regulated medical device. Any clinical deployment requires institution-specific QA, integration with the EHR's signed-order workflow, and regulatory pathway analysis. The MIT-licensed release explicitly states this in the README.
§ 7 Conclusion
Safeguard fills a niche that has been technically tractable since the FDA's 1994 precipitation alert[4] but has not been built in the open. The codified ASPEN rule set, the parameterised solubility curve, the RxNorm-normalised drug-interaction layer — every piece exists in the literature; only the integration is new. The bounded-LLM architecture, which keeps hard gates deterministic and confines the LLM to soft guideline review, is the design pattern healthcare AI needs more broadly: structure where structure exists, language where it does not.
References
- Ayers P, Adams S, Boullata J, et al. A.S.P.E.N. Parenteral Nutrition Safety Consensus Recommendations. JPEN J Parenter Enteral Nutr, 38(3):296–333, 2014. pubmed.ncbi.nlm.nih.gov/24280129
- Boullata JI, Gilbert K, Sacks G, et al. A.S.P.E.N. Clinical Guidelines: Parenteral Nutrition Ordering, Order Review, Compounding, Labeling, and Dispensing. JPEN, 38(3):334–377, 2014. doi/10.1177/0148607114521833
- Boullata JI, et al. Parenteral Nutrition Compatibility and Stability: A Comprehensive Review. JPEN, 46(2):273–299, 2022. doi/abs/10.1002/jpen.2306
- McKinnon BT (republishing the 1994 FDA alert). FDA Safety Alert: Hazards of Precipitation Associated With Parenteral Nutrition. Nutr Clin Pract, 11(2):59–65, 1996. pubmed.ncbi.nlm.nih.gov/8788339
- Anderson C, et al. Calcium and Phosphate Solubility Curve Equation for Determining Precipitation Limits in Compounding Parenteral Nutrition. Hosp Pharm, 57(6):698–705, 2022. pmc.ncbi.nlm.nih.gov/articles/PMC9631008
- Dugan S, et al. Maximum Tolerated Osmolarity for Peripheral Administration of Parenteral Nutrition in Pediatric Patients. JPEN, 38(7):847–851, 2014. doi/10.1177/0148607113495569
- Robinson DT, et al. Guidelines for Parenteral Nutrition in Preterm Infants: The American Society for Parenteral and Enteral Nutrition. JPEN, 47(7):830–858, 2023. pubmed.ncbi.nlm.nih.gov/37610837
- Kaushal R, Shojania KG, Bates DW. Effects of Computerized Physician Order Entry and Clinical Decision Support Systems on Medication Safety: A Systematic Review. Arch Intern Med, 163(12):1409–1416, 2003. jamanetwork.com/.../fullarticle/215756
- Nelson SJ, Zeng K, Kilbourne J, Powell T, Moore R. Normalized Names for Clinical Drugs: RxNorm at 6 Years. J Am Med Inform Assoc, 18(4):441–448, 2011. academic.oup.com/jamia/article/18/4/441/734170
- Singhal K, Azizi S, Tu T, et al. Large Language Models Encode Clinical Knowledge. Nature, 620:172–180, 2023. nature.com/articles/s41586-023-06291-2
- Sicard J, et al. Can Large Language Models Detect Drug-Drug Interactions Leading to Adverse Drug Reactions? Ther Adv Drug Saf, 16, 2025. pmc.ncbi.nlm.nih.gov/articles/PMC12084699
- Sacks GS, Rough S, Kudsk KA. Frequency and severity of harm of medication errors related to the parenteral nutrition process in a large university teaching hospital. Pharmacotherapy 29(8):966–974, 2009. 15.6 medication errors per 1,000 PN prescriptions; 1% prescription, 39% transcription, 24% preparation, 35% administration; 8% caused temporary patient harm. pubmed.ncbi.nlm.nih.gov/19637950
- ASPEN. Parenteral Nutrition Adverse Event and Error Reporting Program. ASPEN/ISMP collaboration. Nearly two-thirds of respondents observe 1–5 PN-related errors per month; 71% involve PN electrolytes. nutritioncare.org/.../Parenteral_Nutrition_Adverse_Event_and_Error_Reporting_Program
- MacKay M, Anderson C, Boehme S, et al. Practice-Based Validation of Calcium and Phosphorus Solubility Limits for Pediatric Parenteral Nutrition Solutions. Nutr Clin Pract 26(6):708–713, 2011. aspenjournals.onlinelibrary.wiley.com/doi/10.1177/0884533611426435
- Large language model as clinical decision support system augments medication safety in 16 clinical specialties. Cell Reports Medicine, 2025. Co-pilot mode (pharmacist + LLM-CDSS) increased mean drug-related problem identification accuracy by 32.6% over pharmacist alone; 1.5× accuracy improvement for serious-harm errors. pmc.ncbi.nlm.nih.gov/articles/PMC12629785
- Rx-LLM Benchmark Suite. medRxiv preprint, 2025. Benchmarks LLMs on drug formulation matching, order generation, route matching, DDI detection, renal dosing, indication matching. medrxiv.org/content/10.64898/2025.12.01.25341004v2.full
- HL7. CDS Hooks 2.0 Specification. Defines seven hooks;
order-selectandorder-signare the integration points for a TPN safety service. JSON over HTTPS with JWT auth. cds-hooks.hl7.org/2.0 - OpenCDS Consortium (University of Utah, et al.). OpenCDS architecture and deployment. Apache-2-licensed open-source CDS engine deployed across all 50 US states at 40,000+ healthcare facilities; supports CDS Hooks via FHIR — drop-in deployment path. opencds.org
- Kawamoto K, Houlihan CA, Balas EA, Lobach DF. Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success. BMJ 330:765, 2005. CDS systems with all four critical features (automatic delivery, recommendations, point-of-care, computer-based) succeeded in 94% of trials; 68% overall. pubmed.ncbi.nlm.nih.gov/15767266
C. Takeoff AI · Set in EB Garamond