Back to Dossier
Paper 11 / 15 Preliminary Manuscript · v0.1 May 2026
Dossier №01 · Project 11 · Triagemind

Triagemind: A Multi-Agent Emergency Department Triage System with Calibrated Uncertainty and Red-Flag Surfacing

A perception–reasoning–handoff agent triple that produces an ESI level, an explicit confidence score, and a structured handoff — pinned against the published 32.2% ED mistriage rate.

Abstract Sax et al. (JAMA Network Open, 2023) audited 5.3 million U.S. ED encounters under the Emergency Severity Index v4 and found a 32.2% mistriage rate — 3.3% under-triage and 28.9% over-triage; ESI sensitivity for high-acuity illness was only 65.9%[2]. Machine-learning approaches reduce error: Levin et al. reported AUC 0.73–0.92 across outcomes versus ESI on 172,726 visits[3], and Hong et al. reached AUC 0.87 for hospital-admission prediction at triage[4]. Triagemind applies the multi-agent pattern formalised in MDAgents[9] to the ED-triage problem: a perception agent extracts structured features from the patient's free-text chief complaint and vitals; a reasoning agent predicts ESI level with temperature-scaled[7] calibrated probabilities; a red-flag agent runs deterministic screens (qSOFA[11], BE-FAST[12], atypical-MI[13]) in parallel; a handoff agent produces the structured note the receiving clinician reads. Selective prediction is non-negotiable — Savage et al.[10] demonstrate that LLM uncertainty proxies show modest discrimination (AUC 0.68–0.79) with persistent miscalibration, so abstention is the safety primitive. Pass criterion: under-triage rate ≤ 1.0% (vs ESI's 3.3% baseline) at expected calibration error ≤ 0.05.

§ 1 Introduction

The ED triage decision is among the highest-leverage acts in modern medicine. The Emergency Severity Index v4[1] is the dominant five-level acuity-and-resource framework in U.S. EDs, but its real-world performance is more degraded than commonly acknowledged. Sax et al.'s 5.3-million-encounter audit[2] establishes the operational baseline: only two of three high-acuity patients are correctly triaged at the front door. The asymmetric cost — an under-triaged sepsis patient versus an over-triaged ankle sprain — makes the under-triage rate the headline metric.

Triagemind is built on three load-bearing observations. First, machine-learning triage has matured: Levin's e-triage at Johns Hopkins[3] and Hong's Yale work[4] demonstrated that models trained on ED arrival data substantially outperform ESI on hard clinical outcomes. Second, calibration is the prerequisite for clinical deployment: Guo et al.[7] showed that modern neural networks are systematically over-confident, and uncalibrated probabilities make selective prediction impossible. Third, the most reliable ED-safety primitives are still deterministic — qSOFA[11], BE-FAST[12], and atypical-MI flags[13] are well-validated rules that cannot be allowed to fall through model uncertainty.

1.1 Contributions

  1. A four-agent ED-triage architecture (perception, reasoning, red-flag, handoff) following the MDAgents[9] adaptive-collaboration pattern.
  2. A calibrated reasoning model with temperature scaling[7] and selective prediction, with explicit abstention thresholds for clinical safety.
  3. An open evaluation harness measuring under-triage, over-triage, expected calibration error (ECE), and red-flag recall — pinned against Sax et al.'s 32.2% mistriage baseline.

§ 2 Background and Related Work

2.1 ESI v4 and Mistriage Evidence

The ESI[1] stratifies on threats to life or limb (levels 1–2) and predicted resource intensity (levels 3–5). Sax et al.[2] ran a retrospective deterministic-rules-against-outcomes audit on 5,315,176 encounters across 21 EDs and found 32.2% mistriage (3.3% under, 28.9% over), with disproportionate under-triage in older adults and Black patients — an equity dimension Triagemind's evaluation must measure. Comparable systems fare similarly: a 2025 Manchester-vs-ESI comparison[5] showed only moderate agreement between systems (kappa 0.51) and substantial distributional differences in how each system stratifies presentations. The Canadian CTAS[6] reports inter-rater kappa from 0.46 (unweighted) to 0.77 (weighted).

2.2 ML for ED Triage

Levin et al.[3] trained e-triage on 172,726 visits using gradient boosting; AUC ranged 0.73–0.92 for critical-care, hospitalisation, and procedure-need outcomes, and the model re-stratified the >65% of patients ESI lumps into level 3 — the dominant clinical pain point. Hong et al.[4] demonstrated that hospital-admission prediction at triage reaches AUC 0.87 with XGBoost on 972 variables, and that adding patient history significantly outperforms triage-time data alone. Both findings inform Triagemind's reasoning-agent feature set.

2.3 Calibration and Selective Prediction

Guo et al.[7] introduced temperature scaling as a single-parameter post-hoc calibration that restores near-perfect calibration on most benchmarks in milliseconds; Platt scaling[8] is its two-parameter sigmoid antecedent. For LLM-based clinical reasoning, Savage et al.[10] demonstrate that calibrated uncertainty is non-optional — common LLM uncertainty proxies show only modest discrimination (AUC 0.68–0.79) with consistent over-confidence on verbalised confidence estimation. Triagemind operationalises selective prediction by gating the agent's ESI output on the calibrated probability of the predicted class.

2.4 Multi-Agent Clinical Reasoning

MDAgents[9] (NeurIPS 2024) demonstrates that adaptive collaboration between specialist LLMs produces up to 4.2% accuracy improvement over single-agent baselines on 7 of 10 medical benchmarks. Triagemind inherits the architectural pattern with a fixed four-agent topology rather than adaptive selection — appropriate for the time-bounded triage decision.

2.5 Validated Red-Flag Screens

Three published screens anchor Triagemind's deterministic safety layer: qSOFA[11] (altered mentation + SBP ≤ 100 + RR ≥ 22; ~45–60% pooled sensitivity for in-hospital mortality in ED meta-analyses), BE-FAST[12] (Balance + Eyes + Face + Arm + Speech + Time; raises stroke detection sensitivity from 85.9% to 95.6%), and the Canto et al. atypical-MI evidence[13] showing MI without chest pain disproportionately affects women and older patients with higher in-hospital mortality.

§ 3 Proposed Approach

3.1 Four-Agent Architecture

Figure 1 · Triagemind architecture
Patient input CC · vitals · age free text AGENT 1 · PERCEPTION Feature extraction CC → SNOMED + vitals struct + history retrieval AGENT 2 · REASONING ESI prediction XGBoost + temp scaling P(ESI=k) calibrated AGENT 3 · RED-FLAG Deterministic screens qSOFA · BE-FAST atypical-MI · NEWS2 Agent 4 Handoff ESI: 2 (P=0.82) RED FLAG: qSOFA+ Why: tachypnea + hypotension CONFIDENCE: HIGH Clinician accept / override + feedback Abstain P < threshold → human triage
Figure 1. Triagemind's four-agent ED triage architecture. Agent 1 (Perception) structures the free-text chief complaint and vitals; Agent 2 (Reasoning) predicts the ESI level with temperature-scaled[7] calibrated probabilities trained on the Levin/Hong[3][4] feature pattern; Agent 3 (Red-flag) runs deterministic screens — qSOFA[11], BE-FAST[12], atypical-MI[13] — in parallel; Agent 4 (Handoff) produces the structured note the receiving clinician reads. The selective-prediction gate sends low-confidence cases to a human triage nurse rather than committing an ESI level.

3.2 Selective Prediction Gate

The reasoning agent emits calibrated probabilities for ESI ∈ {1,2,3,4,5}. The selective gate compares the predicted-class probability pmax to a threshold τ chosen on a held-out validation set to enforce the under-triage budget. If pmax < τ, Triagemind abstains and the case is routed to a human triage nurse with the perception output pre-populated.

3.3 Red-Flag Override

A positive deterministic screen — qSOFA ≥ 2, BE-FAST positive, or atypical-MI rule fired — forces the ESI prediction to ≤ 2 regardless of the reasoning agent's output. This is the asymmetric-cost design: deterministic safety rules cannot be overruled by model uncertainty.

§ 4 Evaluation Protocol

Table 1. Triagemind evaluation metrics.
MetricDefinitionTarget
Under-triage rateESI assigned ≥ 3 when actual outcome was high-acuity≤ 1.0% (vs ESI 3.3%[2])
Over-triage rateESI assigned ≤ 2 when actual outcome was low-acuity≤ 15% (vs ESI 28.9%)
ECEExpected calibration error of predicted ESI distribution≤ 0.05
Red-flag recallSensitivity for any of qSOFA+, BE-FAST+, atypical-MI≥ 0.95
Abstention rateFraction of cases below selective-prediction threshold≤ 15%
Equity auditUnder-triage rate stratified by race, age, languageΔ ≤ 1.0 pp
Pass criterion Triagemind v0.1 succeeds if under-triage rate is ≤ 1.0% (a meaningful improvement on Sax et al.'s 3.3% ESI baseline[2]) at ECE ≤ 0.05 and red-flag recall ≥ 0.95 on a 10,000-encounter held-out test set. The equity audit is a hard requirement, not a nice-to-have — Sax found disproportionate under-triage in Black patients, and Triagemind cannot ship if it preserves that gap.

§ 5 Expected Contributions

  1. System. An open multi-agent ED-triage architecture with selective prediction, calibrated probabilities, and deterministic red-flag overrides.
  2. Methodology. A reproducible evaluation harness pinned against the published 32.2% ESI mistriage baseline[2].
  3. Equity finding. A documented under-triage stratification across race, age, and language — the first such audit for an open ED-triage agent.

§ 6 Limitations and Risks

Triage models are subject to specification gaming: the gold label (admit/critical-care/death) is a downstream proxy for the unobservable "true acuity at front door," and the model can learn shortcuts that correlate with disposition rather than illness severity. Equity audits address this only partially. A v0.2 effort should include prospective evaluation with clinician-confirmed acuity adjudication rather than purely retrospective outcome labels.

A separate risk: a calibrated ML triage that performs well on aggregate metrics can still under-triage individual rare presentations. The deterministic red-flag layer is the bulwark against this — but the red-flag screens themselves have published sensitivities well under 100% (BE-FAST 95.6% for ischemic stroke detection, qSOFA approximately 45–60% pooled sensitivity for in-hospital mortality). Triagemind is decision support, not autonomous triage.

§ 7 Conclusion

Triagemind targets a clinical problem with a hard, public baseline (32.2% mistriage[2]) using a multi-agent architecture[9] with calibrated uncertainty[7] and deterministic safety overrides. The combination is buildable today on published evidence, and the success criterion — halving under-triage while maintaining equity — is the kind of result the field can act on.

References

  1. Gilboy N, Tanabe T, Travers D, Rosenau AM. Emergency Severity Index (ESI): A Triage Tool for Emergency Department Care, Version 4 — Implementation Handbook. AHRQ Publication, 2011/2012. esihandbk.pdf
  2. Sax DR, Warton EM, et al. Evaluation of Version 4 of the Emergency Severity Index in US Emergency Departments for the Rate of Mistriage. JAMA Network Open, 2023. pmc.ncbi.nlm.nih.gov/articles/PMC10024207
  3. Levin S, Toerper M, Hamrock E, et al. Machine-Learning-Based Electronic Triage More Accurately Differentiates Patients With Respect to Clinical Outcomes Compared With the Emergency Severity Index. Annals of Emergency Medicine, 2018. pubmed.ncbi.nlm.nih.gov/28888332
  4. Hong WS, Haimovich AD, Taylor RA. Predicting hospital admission at emergency department triage using machine learning. PLOS ONE, 2018. journals.plos.org/plosone/article?id=10.1371/journal.pone.0201016
  5. Comparative evaluation of the Manchester Triage System and Emergency Severity Index in predicting critical events in the ED. BMC Emergency Medicine, 2025. link.springer.com/article/10.1186/s12873-025-01420-8
  6. Bullard MJ, Musgrave E, Warren D, et al. Revisions to the Canadian Emergency Department Triage and Acuity Scale (CTAS) Guidelines 2016. CJEM, 2017. cambridge.org/core/...E2CB3E2063C54E11259313FA4FEAE495
  7. Guo C, Pleiss G, Sun Y, Weinberger KQ. On Calibration of Modern Neural Networks. ICML, 2017. arxiv.org/abs/1706.04599
  8. Platt J. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. Advances in Large Margin Classifiers, 1999. en.wikipedia.org/wiki/Platt_scaling
  9. Kim Y, et al. MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making. NeurIPS, 2024. neurips.cc/virtual/2024/poster/96041
  10. Savage T, Wang J, et al. Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment. JAMIA, 2025. pmc.ncbi.nlm.nih.gov/articles/PMC11648734
  11. Seymour CW, Liu VX, et al. Assessment of Clinical Criteria for Sepsis (qSOFA) — Third International Consensus Definitions (Sepsis-3). JAMA, 2016. pubmed.ncbi.nlm.nih.gov/26903335
  12. Aroor S, Singh R, Goldstein LB. BE-FAST (Balance, Eyes, Face, Arm, Speech, Time): Reducing the Proportion of Strokes Missed Using the FAST Mnemonic. Stroke 48(2):479–481, 2017. ahajournals.org/doi/10.1161/STROKEAHA.116.015169
  13. Canto JG et al. Symptom Presentation in Women and Men With Acute Coronary Syndromes — systematic review. Journal of the American Heart Association, 2020. ahajournals.org/doi/10.1161/JAHA.119.015539
— · § · — Preliminary manuscript · Triagemind v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond