Dossier №01 · Project 12 · Calline

Calline: A Voice-First After-Hours Nurse Triage Agent with Uncertainty-Gated Escalation

Real-time streaming ASR, a Claude conversational core, and sub-second TTS — deflecting safe cases, routing emergencies, and handing off to nurses on uncertainty.

Chandra Takeoff AI · Healthcare AI Engineering Compiled · May 2026

Abstract Nurse-led telephone triage is safe in approximately 97% of routine after-hours contacts but only 89% of high-urgency contacts[2]; in expert review of missed acute coronary syndrome calls, 73.3% of cases were rated unsafe[3]. Symptom-checker apps fare worse: the BMJ 2015 audit of 23 symptom checkers across 45 vignettes found correct diagnosis listed first in only 34% and appropriate triage in 57%[5]. Calline is a voice-first after-hours triage agent whose explicit design constraint is to inherit the nurse-line safety profile, not the symptom-checker one. The architecture composes streaming Whisper-based ASR with local-agreement streaming (3.3s latency[7]), Silero VAD for endpointing[8], OpenAI's gpt-realtime as the conversational core (82.8% on Big Bench Audio[9]), and ElevenLabs Flash v2.5 TTS (~75 ms time-to-first-byte[10]). Three terminal dispositions: home self-care, route-to-nurse, route-to-911. Pass criterion: 60% safe deflection with zero ED-revisit-within-24h on a held-out call set.

§ 1 Introduction

After-hours nurse triage lines are an established mode of primary-care delivery. The Lattimer et al. BMJ 1998 RCT of 14,492 calls established that nurse triage produced equivalent rates of death and emergency admission versus GP-only management while reducing GP workload by approximately 50%[1]. The subsequent Huibers systematic review[2] sharpened the picture: triage is safe in 97% of routine contacts but only 89% of high-urgency contacts, and only 46% in high-risk simulated patient studies. The failure mode is asymmetric — missed acute coronary syndrome calls are 73.3% unsafe-rated by expert review versus 22.5% of matched controls[3].

This is the design constraint Calline accepts. The goal is not to replace nurses but to deflect routine calls safely while routing every high-risk presentation to a human nurse or to 911 — with the uncertainty-gating logic that nurse-line safety depends on. The opposite design — a symptom-checker that triages everyone — is what produces the BMJ 2015 numbers[5]: 34% top diagnosis, 57% appropriate triage, ranging 80% on emergent vignettes down to 33% on self-care.

1.1 Contributions

A reproducible voice-first triage architecture composed of named open or commercial components, each with published latency and accuracy characteristics.
An explicit three-disposition gate (deflect, escalate to nurse, route to 911) with uncertainty-based routing.
An evaluation harness that measures deflection rate, escalation accuracy, and ED-revisit-within-24h — the safety metric the literature treats as the binding constraint.

§ 2 Background and Related Work

2.1 Nurse Triage Line Safety

The empirical case for nurse triage is well-established but bounded. Lattimer et al.[1] showed equivalent mortality and emergency admission rates vs GP-only at half the GP workload. Wheeler et al.[4] reviewed three decades of triage-line evidence and reported ~95% appropriate disposition rates with decision-support tools — but Huibers[2] documents the 89% / 46% safety drop on urgent and simulated high-risk cases respectively. Erkelens et al.'s ACS case-control study[3] is the load-bearing evidence that the failure mode is concentrated in time-critical presentations the model must escalate, not deflect.

2.2 Symptom Checkers and Their Limits

Semigran et al.'s BMJ 2015 audit[5] is the canonical evidence that pure rule-based symptom checkers under-perform clinicians: 34% top diagnosis, 57% triage appropriateness. The more recent Hammoud et al. JMIR AI 2024 study[6] shows next-generation symptom checkers (Avey) can match primary-care physicians on diagnostic accuracy across 400 peer-reviewed vignettes — a substantial improvement but still operating in a vignette regime, not telephone audio.

2.3 Voice Agent Technical Stack

Calline's voice stack is built on named components with documented characteristics. Macháček et al.'s "Whisper Streaming"[7] achieves 3.3 s end-to-end latency on long-form audio using a local-agreement streaming policy. Silero VAD[8] reaches 87.7% true-positive rate at 5% false-positive rate with sub-millisecond per-chunk inference. OpenAI's gpt-realtime[9] scores 82.8% on Big Bench Audio (a +17.2-point gain over the prior 65.6% baseline) and 30.5% on MultiChallenge. ElevenLabs Flash v2.5[10] reports ~75 ms time-to-first-byte for streaming TTS.

2.4 Out-of-Scope Detection and Escalation

Castillo-López et al.[11] demonstrate that a hybrid BERT-confidence-gated routing to an LLM improves out-of-scope F1 by approximately 5 points over fine-tuned BERT baselines on multi-party dialogue. Calline uses the same hybrid pattern: a lightweight classifier on the partial transcript triggers escalation when call topic deviates from triage scope (e.g., suicidal ideation, mental health crisis, requests for prescription refills).

2.5 Regulatory and Billing Context

HHS OCR[12] clarifies that the HIPAA Security Rule applies to VoIP / cellular / SIP audio (Twilio-class infrastructure) but not to traditional analog landlines — Calline's architecture requires a BAA and encryption. The CMS billing landscape changed materially in January 2025: CPT 99441–99443 were deleted[13] and replaced with new audio-only E/M codes (98008–98015) which Medicare does not currently adopt; current billing uses 99202–99215 with modifier 93 or FQ. The NCSBN position paper[14] establishes that telephone triage is the practice of nursing in all 50 U.S. jurisdictions, with licensure required in the state where the patient is located.

§ 3 Proposed Approach

3.1 Voice Pipeline

Figure 1 · Calline real-time voice pipeline

Figure 1. The Calline real-time voice pipeline. (1) Silero VAD[8] handles endpointing at 87.7% TPR / 5% FPR; (2) Whisper streaming[7] with local-agreement policy reaches 3.3 s end-to-end latency; (3) a BERT-confidence out-of-scope gate following Castillo-López et al.[11] routes off-topic content to human nurses immediately; (4) OpenAI gpt-realtime[9] serves as the conversational core (82.8% on Big Bench Audio); (5) ElevenLabs Flash v2.5 TTS[10] emits the spoken reply with ~75 ms time-to-first-byte. The disposition gate enforces the three-way decision pinned against the Huibers safety profile[2].

3.2 Disposition Gate

Three terminal outcomes, evaluated continuously throughout the call: deflect (home self-care; agent provides closed-loop instructions and offers a scheduled callback), escalate (human nurse takes over; conversation summary pre-populated), 911 (deterministic red-flag fired; agent instructs caller to hang up and dial 911 immediately, with an offer to conference the call). The deflect path requires both high confidence and absence of red flags; escalate is the default on uncertainty; 911 is the default on any red-flag positive.

§ 4 Evaluation Protocol

**Table 1.** Calline evaluation metrics.
Metric	Definition	Target
Deflection rate	Fraction of calls terminated in self-care disposition	≥ 60%
24h ED-revisit rate	Deflected callers presenting to ED within 24 hours	0 (hard gate)
Red-flag recall	Sensitivity for deterministic red-flag presentations (ACS, stroke, sepsis, anaphylaxis)	≥ 0.95
Latency	End-to-end turn-around (caller stops → agent starts)	< 2.0 s p95
Out-of-scope catch	Mental health crisis / scope-of-practice cases routed to human	≥ 0.95

Pass criterion Calline v0.1 succeeds if 60% deflection is achieved with zero 24h ED revisits among deflected callers in a 500-call evaluation set, red-flag recall ≥ 0.95, and end-to-end latency under 2 seconds at p95. The zero-revisit gate is the binding safety constraint — derived from Erkelens et al.'s ACS findings[3] — and dominates the deflection target.

§ 5 Expected Contributions

System. An open voice-first triage architecture with named components and documented latency.
Evaluation. A reproducible 500-call evaluation harness pinned to the Huibers safety profile[2].
Operating envelope. Documentation of the deflection-vs-safety tradeoff curve and the abstention rate required to hit zero ED revisits.

§ 6 Limitations and Risks

Voice-AI triage carries category-specific risks the literature already documents. Asymmetric failure costs (a missed sepsis call vs a routine deflection): Calline addresses this with the deterministic red-flag layer and zero-revisit gate, but the underlying red-flag screens still have published sensitivity under 100%. Equity and accent bias in ASR: Whisper's WER varies substantially by accent and demographic; Calline must measure WER by demographic stratum on a representative call set. Billing volatility: the January 2025 CPT changes[13] demonstrate that the regulatory environment changes faster than the technology, and any deployed Calline instance needs a quarterly billing review.

§ 7 Conclusion

Calline targets the gap between two well-documented modes of after-hours care: nurse triage lines (safe, expensive, capacity-bounded) and symptom-checker apps (cheap, capacity-unbounded, under-performing on triage[5]). The technical components needed to bridge that gap exist today with documented latency and accuracy[7][8][9][10]; the missing piece is the safety-pinned evaluation harness. Calline provides it.

References

Lattimer V, George S, Thompson F, et al. Safety and effectiveness of nurse telephone consultation in out of hours primary care: randomised controlled trial. BMJ 317(7165):1054–9, 1998. pubmed.ncbi.nlm.nih.gov/9774295
Huibers L, Smits M, Renaud V, Giesen P, Wensing M. Safety of telephone triage in out-of-hours care: a systematic review. Scandinavian Journal of Primary Health Care 29(4):198–209, 2011. pmc.ncbi.nlm.nih.gov/articles/PMC3308461
Erkelens DC, Rutten FH, Wouters LT, et al. Missed Acute Coronary Syndrome During Telephone Triage at Out-of-Hours Primary Care: Lessons From A Case-Control Study. J Patient Saf 18(1):40–45, 2022. pmc.ncbi.nlm.nih.gov/articles/PMC8719497
Wheeler SQ, Greenberg ME, Mahlmeister L, Wolfe N. Safety of clinical and non-clinical decision makers in telephone triage: a narrative review. J Telemed Telecare 21(6):305–22, 2015. pubmed.ncbi.nlm.nih.gov/26026188
Semigran HL, Linder JA, Gidengil C, Mehrotra A. Evaluation of symptom checkers for self diagnosis and triage: audit study. BMJ 351:h3480, 2015. pubmed.ncbi.nlm.nih.gov/26157077
Hammoud M, Douglas S, Darmach M, et al. Evaluating the Diagnostic Performance of Symptom Checkers: Clinical Vignette Study. JMIR AI 3:e46875, 2024. ai.jmir.org/2024/1/e46875
Macháček D, Dabre R, Bojar O. Turning Whisper into Real-Time Transcription System. IWSLT 2023 / arXiv:2307.14743. arxiv.org/abs/2307.14743
Silero Team. Silero VAD: pre-trained enterprise-grade Voice Activity Detector. Open-source release, 2024. github.com/snakers4/silero-vad
OpenAI. Introducing gpt-realtime and Realtime API updates for production voice agents. August 2025. openai.com/index/introducing-gpt-realtime
ElevenLabs. Models / Understanding latency (Flash v2.5). Documentation. elevenlabs.io/docs/eleven-api/concepts/latency
Castillo-López G, de Chalendar G, Semmar N. Intent Recognition and Out-of-Scope Detection using LLMs in Multi-party Conversations. arXiv:2507.22289, 2025. Reports an approximately 5-point F1 improvement on out-of-scope detection over fine-tuned BERT baselines using a hybrid confidence-gated routing approach. arxiv.org/abs/2507.22289
HHS Office for Civil Rights. Guidance on Audio-Only Telehealth under HIPAA. June 2022. hhs.gov/hipaa/.../hipaa-audio-telehealth
AMA / CPT. 2025 Telemedicine Code Set — deletion of 99441–99443 and creation of 98008–98015. Effective January 2025. ama-assn.org/practice-management/cpt/how-ama-meets-need-new-telehealth-cpt-codes
NCSBN. Position Paper on Telehealth Nursing Practice. National Council of State Boards of Nursing. ncsbn.org/public-files/14_Telehealth.pdf

— · § · — Preliminary manuscript · Calline v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond