Calline: A Voice-First After-Hours Nurse Triage Agent with Uncertainty-Gated Escalation
Real-time streaming ASR, a Claude conversational core, and sub-second TTS — deflecting safe cases, routing emergencies, and handing off to nurses on uncertainty.
Abstract Nurse-led telephone triage is safe in approximately 97% of routine after-hours contacts but only 89% of high-urgency contacts[2]; in expert review of missed acute coronary syndrome calls, 73.3% of cases were rated unsafe[3]. Symptom-checker apps fare worse: the BMJ 2015 audit of 23 symptom checkers across 45 vignettes found correct diagnosis listed first in only 34% and appropriate triage in 57%[5]. Calline is a voice-first after-hours triage agent whose explicit design constraint is to inherit the nurse-line safety profile, not the symptom-checker one. The architecture composes streaming Whisper-based ASR with local-agreement streaming (3.3s latency[7]), Silero VAD for endpointing[8], OpenAI's gpt-realtime as the conversational core (82.8% on Big Bench Audio[9]), and ElevenLabs Flash v2.5 TTS (~75 ms time-to-first-byte[10]). Three terminal dispositions: home self-care, route-to-nurse, route-to-911. Pass criterion: 60% safe deflection with zero ED-revisit-within-24h on a held-out call set.
§ 1 Introduction
After-hours nurse triage lines are an established mode of primary-care delivery. The Lattimer et al. BMJ 1998 RCT of 14,492 calls established that nurse triage produced equivalent rates of death and emergency admission versus GP-only management while reducing GP workload by approximately 50%[1]. The subsequent Huibers systematic review[2] sharpened the picture: triage is safe in 97% of routine contacts but only 89% of high-urgency contacts, and only 46% in high-risk simulated patient studies. The failure mode is asymmetric — missed acute coronary syndrome calls are 73.3% unsafe-rated by expert review versus 22.5% of matched controls[3].
This is the design constraint Calline accepts. The goal is not to replace nurses but to deflect routine calls safely while routing every high-risk presentation to a human nurse or to 911 — with the uncertainty-gating logic that nurse-line safety depends on. The opposite design — a symptom-checker that triages everyone — is what produces the BMJ 2015 numbers[5]: 34% top diagnosis, 57% appropriate triage, ranging 80% on emergent vignettes down to 33% on self-care.
1.1 Contributions
- A reproducible voice-first triage architecture composed of named open or commercial components, each with published latency and accuracy characteristics.
- An explicit three-disposition gate (deflect, escalate to nurse, route to 911) with uncertainty-based routing.
- An evaluation harness that measures deflection rate, escalation accuracy, and ED-revisit-within-24h — the safety metric the literature treats as the binding constraint.
§ 2 Background and Related Work
2.1 Nurse Triage Line Safety
The empirical case for nurse triage is well-established but bounded. Lattimer et al.[1] showed equivalent mortality and emergency admission rates vs GP-only at half the GP workload. Wheeler et al.[4] reviewed three decades of triage-line evidence and reported ~95% appropriate disposition rates with decision-support tools — but Huibers[2] documents the 89% / 46% safety drop on urgent and simulated high-risk cases respectively. Erkelens et al.'s ACS case-control study[3] is the load-bearing evidence that the failure mode is concentrated in time-critical presentations the model must escalate, not deflect.
2.2 Symptom Checkers and Their Limits
Semigran et al.'s BMJ 2015 audit[5] is the canonical evidence that pure rule-based symptom checkers under-perform clinicians: 34% top diagnosis, 57% triage appropriateness. The more recent Hammoud et al. JMIR AI 2024 study[6] shows next-generation symptom checkers (Avey) can match primary-care physicians on diagnostic accuracy across 400 peer-reviewed vignettes — a substantial improvement but still operating in a vignette regime, not telephone audio.
2.3 Voice Agent Technical Stack
Calline's voice stack is built on named components with documented characteristics. Macháček et al.'s "Whisper Streaming"[7] achieves 3.3 s end-to-end latency on long-form audio using a local-agreement streaming policy. Silero VAD[8] reaches 87.7% true-positive rate at 5% false-positive rate with sub-millisecond per-chunk inference. OpenAI's gpt-realtime[9] scores 82.8% on Big Bench Audio (a +17.2-point gain over the prior 65.6% baseline) and 30.5% on MultiChallenge. ElevenLabs Flash v2.5[10] reports ~75 ms time-to-first-byte for streaming TTS.
2.4 Out-of-Scope Detection and Escalation
Castillo-López et al.[11] demonstrate that a hybrid BERT-confidence-gated routing to an LLM improves out-of-scope F1 by approximately 5 points over fine-tuned BERT baselines on multi-party dialogue. Calline uses the same hybrid pattern: a lightweight classifier on the partial transcript triggers escalation when call topic deviates from triage scope (e.g., suicidal ideation, mental health crisis, requests for prescription refills).
2.5 Regulatory and Billing Context
HHS OCR[12] clarifies that the HIPAA Security Rule applies to VoIP / cellular / SIP audio (Twilio-class infrastructure) but not to traditional analog landlines — Calline's architecture requires a BAA and encryption. The CMS billing landscape changed materially in January 2025: CPT 99441–99443 were deleted[13] and replaced with new audio-only E/M codes (98008–98015) which Medicare does not currently adopt; current billing uses 99202–99215 with modifier 93 or FQ. The NCSBN position paper[14] establishes that telephone triage is the practice of nursing in all 50 U.S. jurisdictions, with licensure required in the state where the patient is located.
§ 3 Proposed Approach
3.1 Voice Pipeline
3.2 Disposition Gate
Three terminal outcomes, evaluated continuously throughout the call: deflect (home self-care; agent provides closed-loop instructions and offers a scheduled callback), escalate (human nurse takes over; conversation summary pre-populated), 911 (deterministic red-flag fired; agent instructs caller to hang up and dial 911 immediately, with an offer to conference the call). The deflect path requires both high confidence and absence of red flags; escalate is the default on uncertainty; 911 is the default on any red-flag positive.
§ 4 Evaluation Protocol
| Metric | Definition | Target |
|---|---|---|
| Deflection rate | Fraction of calls terminated in self-care disposition | ≥ 60% |
| 24h ED-revisit rate | Deflected callers presenting to ED within 24 hours | 0 (hard gate) |
| Red-flag recall | Sensitivity for deterministic red-flag presentations (ACS, stroke, sepsis, anaphylaxis) | ≥ 0.95 |
| Latency | End-to-end turn-around (caller stops → agent starts) | < 2.0 s p95 |
| Out-of-scope catch | Mental health crisis / scope-of-practice cases routed to human | ≥ 0.95 |
§ 5 Expected Contributions
- System. An open voice-first triage architecture with named components and documented latency.
- Evaluation. A reproducible 500-call evaluation harness pinned to the Huibers safety profile[2].
- Operating envelope. Documentation of the deflection-vs-safety tradeoff curve and the abstention rate required to hit zero ED revisits.
§ 6 Limitations and Risks
Voice-AI triage carries category-specific risks the literature already documents. Asymmetric failure costs (a missed sepsis call vs a routine deflection): Calline addresses this with the deterministic red-flag layer and zero-revisit gate, but the underlying red-flag screens still have published sensitivity under 100%. Equity and accent bias in ASR: Whisper's WER varies substantially by accent and demographic; Calline must measure WER by demographic stratum on a representative call set. Billing volatility: the January 2025 CPT changes[13] demonstrate that the regulatory environment changes faster than the technology, and any deployed Calline instance needs a quarterly billing review.
§ 7 Conclusion
Calline targets the gap between two well-documented modes of after-hours care: nurse triage lines (safe, expensive, capacity-bounded) and symptom-checker apps (cheap, capacity-unbounded, under-performing on triage[5]). The technical components needed to bridge that gap exist today with documented latency and accuracy[7][8][9][10]; the missing piece is the safety-pinned evaluation harness. Calline provides it.
References
- Lattimer V, George S, Thompson F, et al. Safety and effectiveness of nurse telephone consultation in out of hours primary care: randomised controlled trial. BMJ 317(7165):1054–9, 1998. pubmed.ncbi.nlm.nih.gov/9774295
- Huibers L, Smits M, Renaud V, Giesen P, Wensing M. Safety of telephone triage in out-of-hours care: a systematic review. Scandinavian Journal of Primary Health Care 29(4):198–209, 2011. pmc.ncbi.nlm.nih.gov/articles/PMC3308461
- Erkelens DC, Rutten FH, Wouters LT, et al. Missed Acute Coronary Syndrome During Telephone Triage at Out-of-Hours Primary Care: Lessons From A Case-Control Study. J Patient Saf 18(1):40–45, 2022. pmc.ncbi.nlm.nih.gov/articles/PMC8719497
- Wheeler SQ, Greenberg ME, Mahlmeister L, Wolfe N. Safety of clinical and non-clinical decision makers in telephone triage: a narrative review. J Telemed Telecare 21(6):305–22, 2015. pubmed.ncbi.nlm.nih.gov/26026188
- Semigran HL, Linder JA, Gidengil C, Mehrotra A. Evaluation of symptom checkers for self diagnosis and triage: audit study. BMJ 351:h3480, 2015. pubmed.ncbi.nlm.nih.gov/26157077
- Hammoud M, Douglas S, Darmach M, et al. Evaluating the Diagnostic Performance of Symptom Checkers: Clinical Vignette Study. JMIR AI 3:e46875, 2024. ai.jmir.org/2024/1/e46875
- Macháček D, Dabre R, Bojar O. Turning Whisper into Real-Time Transcription System. IWSLT 2023 / arXiv:2307.14743. arxiv.org/abs/2307.14743
- Silero Team. Silero VAD: pre-trained enterprise-grade Voice Activity Detector. Open-source release, 2024. github.com/snakers4/silero-vad
- OpenAI. Introducing gpt-realtime and Realtime API updates for production voice agents. August 2025. openai.com/index/introducing-gpt-realtime
- ElevenLabs. Models / Understanding latency (Flash v2.5). Documentation. elevenlabs.io/docs/eleven-api/concepts/latency
- Castillo-López G, de Chalendar G, Semmar N. Intent Recognition and Out-of-Scope Detection using LLMs in Multi-party Conversations. arXiv:2507.22289, 2025. Reports an approximately 5-point F1 improvement on out-of-scope detection over fine-tuned BERT baselines using a hybrid confidence-gated routing approach. arxiv.org/abs/2507.22289
- HHS Office for Civil Rights. Guidance on Audio-Only Telehealth under HIPAA. June 2022. hhs.gov/hipaa/.../hipaa-audio-telehealth
- AMA / CPT. 2025 Telemedicine Code Set — deletion of 99441–99443 and creation of 98008–98015. Effective January 2025. ama-assn.org/practice-management/cpt/how-ama-meets-need-new-telehealth-cpt-codes
- NCSBN. Position Paper on Telehealth Nursing Practice. National Council of State Boards of Nursing. ncsbn.org/public-files/14_Telehealth.pdf
C. Takeoff AI · Set in EB Garamond