Dossier №01 · Project 06 · Pharos

Pharos: A Long-Horizon Voice Agent for Chronic Disease Management with Clinician Oversight

Weekly voice check-ins, persistent patient memory across months, and an escalation loop that puts a human clinician in the driver's seat. Targeted at heart failure and type-2 diabetes.

Chandra Healthcare AI Engineering Compiled · May 2026

Abstract The 30-day heart-failure readmission rate is approximately 22–23% nationally and is tracked under CMS's Hospital Readmissions Reduction Program[4]. The two largest published heart-failure remote-monitoring RCTs — Tele-HF[3] and BEAT-HF[2] — produced null primary endpoints, with Tele-HF showing adherence dropping to ~55% by week 26: a problem of engagement and integration with care, not of monitoring per se. Pharos is a voice-first long-horizon agent for chronic disease management whose three load-bearing design choices respond directly to those null results: (i) voice rather than IVR/app to reduce friction (Sensely's pilot found avatar-based check-ins acceptable[15]); (ii) persistent memory across months via the MemGPT[12] / recursive-summarisation[13] / Mem0[14] family of long-horizon agent architectures; (iii) tight clinician oversight so detection triggers a human-led intervention loop, not just a notification. Pass criterion: 80% weekly check-in adherence at 12 weeks (versus Tele-HF's ~55%) and a documented escalation pattern matched against ASPEN-style monitoring practice[8].

§ 1 Introduction

Chronic disease management lives in the gap between the clinical visit and the next. Heart failure is the canonical example: HF-Action[1] randomised 2,331 patients to exercise training versus usual care and showed a non-significant HR of 0.93 for the primary endpoint of death/hospitalisation — the gap between the intervention's biological efficacy and its real-world effect is a problem of adherence and engagement. Tele-HF[3] and BEAT-HF[2] are the canonical RCTs for remote patient monitoring in HF and both produced null primary endpoints, with Tele-HF documenting the adherence decay that drives the result.

Type-2 diabetes shows a different pattern: continuous glucose monitoring works (the MOBILE trial[5] shows 1.1% HbA1c reduction at 8 months versus 0.6% with BGM, adjusted Δ −0.4%, p=0.02); and large-scale connected-meter-plus-coaching programs (Livongo[6]) document 16.4% reduction in hyperglycemia days and 18.4% in hypoglycemia days versus baseline at the 4,544-member scale. The diabetes evidence is the case for the program structure; the HF evidence is the case for why naive remote monitoring fails. Pharos integrates both lessons.

1.1 Contributions

A voice-first chronic-disease management agent for HF and type-2 diabetes, with weekly check-ins, persistent memory, and structured escalation.
A long-horizon memory architecture that operationalises MemGPT-style[12] hierarchical context, recursive summarisation[13], and Mem0-style[14] structured fact extraction in a clinical setting.
An evaluation against the Tele-HF / BEAT-HF benchmark of 55% / mid-50s long-term adherence — the explicit goal is to halve the dropout rate, not to claim a new outcome.

§ 2 Background and Related Work

2.1 Heart Failure: The Evidence Wall

HF-Action[1], BEAT-HF[2], and Tele-HF[3] together establish the central HF self-management finding: biologically beneficial interventions fail to reach effect size in real-world cohorts unless the engagement substrate is solid. Tele-HF's specific number — adherence dropping from initial enrollment to ~55% by week 26 in an IVR-based daily check-in regime — is the design ceiling Pharos's voice-first architecture must beat. CMS's HRRP penalises HF readmissions at the population level[4], providing the ROI signal for the intervention.

2.2 Diabetes: The Engagement Pattern That Works

Martens et al.'s MOBILE trial[5] demonstrates a real intervention effect on HbA1c. Downing et al.'s Livongo cohort[6] documents the at-scale outcome: 16.4% / 18.4% reduction in hyper-/hypoglycemia days in 4,544 members, with the engagement model being a connected glucose meter plus a Certified Diabetes Educator coaching layer. Sepah et al.'s Omada three-year follow-up[7] reports sustained weight loss (~3% at year three) and an HbA1c reduction of approximately 0.31%. Powers et al.'s ADA/AADE consensus[8] defines four critical timepoints for DSMES referral — at diagnosis, annually, with new complications, and with care transitions — and reports associated HbA1c reductions in the range of 0.45–0.57% across the underlying evidence base.

2.3 Conversational AI Agents in Chronic Care

Fitzpatrick et al.'s Woebot RCT[9] demonstrates that a fully automated conversational agent can produce significant PHQ-9 reduction at 2 weeks (F=6.47, p=0.01) in a 70-participant trial. Inkster et al.'s Wysa evaluation[10] documents that high-engagement users (≥2 sessions/week × 2 weeks) saw PHQ-9 reductions >5 points versus low-engagement. Laranjo et al.'s systematic review[11] covers 17 healthcare conversational-agent studies through 2018 — most used finite-state dialogue, acceptability was generally high, and rigorous outcome evidence was limited.

2.4 Long-Horizon Agent Memory

MemGPT[12] (now Letta) introduced hierarchical context management with main and external memory tiers, enabling multi-session conversational agents that retain prior context. Wang et al.'s recursive-summarisation work[13] demonstrates improved dialogue consistency across thousands of turns — directly applicable to a patient-clinician relationship that persists for months. Mem0[14] (2025) demonstrates hierarchical fact extraction across sessions with token-efficient retrieval at production scale. Pharos uses Mem0 as the memory substrate.

2.5 Avatar / Voice Agents for HF

A 2023 pilot integrating Sensely's "Molly" avatar with HF telemedicine[15] rated avatar-based check-ins as usable and satisfactory; vendor reports of reduced 30-day readmissions exist but lack peer-reviewed RCT confirmation. Pharos's voice-first design is the deliberate counterpart: voice avoids the visual-attention requirement avatars impose, which matters for the elderly HF population.

§ 3 Proposed Approach

3.1 Architecture

Figure 1 · Long-horizon agent with clinician loop

Figure 1. Pharos pairs a voice agent with a clinician dashboard. The voice agent maintains three memory tiers — episodic (recent weeks verbatim plus recursive summary per Wang et al.[13]), semantic (Mem0-style[14] structured facts), and rule-based red flags (HF: 3-lb-in-1-day or 5-lb-in-1-week weight gain; DM: glucose <54 mg/dL or >300 mg/dL twice consecutively). The conversational core (OpenAI gpt-realtime or Claude) runs on top of the memory substrate. The clinician dashboard exposes the patient panel, a red-flag inbox, and an accept/override surface — this is the loop that the Tele-HF[3] null result identifies as missing in IVR-only monitoring.

Safety surface

Clinician-in-the-loop. Every model-initiated escalation (med change, new symptom, missed log) is reviewed by the patient's care team within 24 hours before any action propagates.
Stopping rule. A pre-specified composite of cardiac decompensation, severe hypoglycaemia (BG < 54 mg/dL), or severe hyperglycaemia (BG > 300 mg/dL) triggers an immediate study pause and external clinical review.
Audit log. Every call, prompt, and care-team handoff is written to an append-only log keyed by patient and consent scope.
Data boundary. Voice transcripts and biometric streams stay inside the BAA-covered subsystem; no PHI leaves the boundary at inference time.
Failure metric. 12-week adherence with 95% CI as the primary endpoint; the composite serious-adverse-event rate as the binding safety gate.

3.2 Weekly Check-In Structure

Each call is approximately 15 minutes, structured around three sections: state update (medications, symptoms, daily weights or glucose), open conversation (the patient's questions and concerns), and plan and education (next-week goals and contextual teaching from clinical guidelines). The state update has fixed structured prompts; the open conversation is unscripted. The guideline-anchored education follows the ADA-recommended DSMES timepoint framework[8] for diabetes and the BEAT-HF[2] patient-education content set for HF.

3.3 Escalation Loop

Three escalation tiers. Routine findings are summarised into the dashboard the care manager reviews the next business day. Escalate findings (deterministic red flag fired or model confidence in absence-of-red-flag below threshold) trigger an RN callback within four hours during business hours, or schedule a clinic visit. 911 findings (acute symptoms matching the deterministic life-threat list — crushing chest pain, severe dyspnea, focal neuro symptoms) trigger immediate instruction to hang up and dial 911 with offer to conference the call.

§ 4 Evaluation Protocol

**Table 1.** Pharos evaluation metrics.
Metric	Definition	Target
12-week adherence	Patients completing 10+ of 12 weekly calls	≥ 80% (Tele-HF baseline ~55%)
Memory recall accuracy	Agent correctly recalls previously-discussed facts at week 12	≥ 90%
Red-flag recall	Sensitivity for seeded HF/DM red flags	≥ 0.95
Time-to-clinician	Median time from red-flag detection to clinician contact	≤ 4h (business hours)
30-day readmission delta (HF subset)	Adjusted readmission rate change vs matched cohort	Trend toward reduction
HbA1c change at 6mo (DM subset)	Adjusted HbA1c change vs matched cohort	−0.4% to −0.6% (MOBILE-comparable[5])

Pass criterion Pharos v0.1 succeeds if 12-week adherence reaches 80% (substantially above Tele-HF's 55% IVR baseline[3]) and red-flag recall ≥ 0.95 in a 200-patient mixed HF/DM pilot. Clinical-outcome targets are secondary in v0.1 and require a v0.2 RCT to test.

4.4 Statistical Plan and Safety Gates

Primary endpoint is 12-week adherence (composite of weight logs, BP readings, or glucose checks at the prescribed cadence) reported with the 95% confidence interval. Powering: detecting an absolute improvement from the Tele-HF 55% benchmark to a target 80% at α = 0.05 and power 0.80 requires roughly 50 patients per condition arm; the pilot plan is N = 100 per arm split across HF and type-2 diabetes. Safety monitoring is hard-coded: a pre-specified composite of cardiac decompensation (HF re-hospitalisation), severe hypoglycaemia (BG < 54 mg/dL), or severe hyperglycaemia (BG > 300 mg/dL) is monitored continuously. Any serious adverse event triggers an immediate pause and external clinical review. Clinician oversight: every model-initiated escalation is reviewed by the patient's care team within 24 hours; the adherence endpoint is adjudicated by an independent panel blinded to study arm. Equity audit: adherence stratified by race, age, broadband-access status, and primary language, reported with per-stratum CIs.

§ 5 Expected Contributions

System. An open long-horizon voice agent with clinician oversight for HF and type-2 diabetes.
Memory architecture demonstration. A concrete clinical instantiation of the MemGPT[12] / Mem0[14] patterns showing what tier holds what facts at what retention.
Adherence finding. A quantitative comparison of voice-first vs IVR adherence at 12 weeks — the design lever that the null-result HF RCTs identify as the failure point.

§ 6 Limitations and Risks

The HF and DM populations differ enough that a single system has to negotiate two distinct red-flag profiles and education pathways; Pharos handles this with condition-specific configuration but the empirical case for the unified architecture is unproven. The clinician dashboard creates new alert burden, and the experience of CDS[11] is that alert fatigue dominates outcome unless escalation discipline holds. Voice-only delivery may exclude patients with hearing impairment, which intersects with the elderly HF population — a SMS-fallback mode is required and is a v0.2 task.

The deepest risk is that Pharos succeeds in adherence and still fails to move clinical outcomes — the exact pattern of Tele-HF and BEAT-HF[2][3]. The v0.1 evaluation is therefore explicit that clinical outcome is a v0.2 question.

§ 7 Conclusion

Pharos is built on the recognition that the chronic-disease management RCTs of the prior decade failed not because remote monitoring is biologically inert but because the engagement substrate was wrong. Voice agents, long-horizon memory, and a clinician-in-the-loop dashboard are the three changes that the underlying literature identifies as load-bearing. Whether they suffice is the empirical question Pharos is built to answer.

Model snapshots Baseline frontier models referenced in this manuscript were accessed in May 2026. Concrete IDs where applicable: Claude Opus 4.7 (claude-opus-4-7), Claude Sonnet 4.6 (claude-sonnet-4-6), GPT-5 (gpt-5-2025-08-07), Gemini 2.5 Pro (gemini-2.5-pro-preview), gpt-realtime (gpt-realtime-2025-08), MedGemma 4B/27B (HuggingFace google/medgemma-4b-it, google/medgemma-27b-text-it), Qwen3-8B (Qwen/Qwen3-8B). Frontier-model identifiers shift between releases; pin to these snapshots when reproducing.

References

O'Connor CM, Whellan DJ, Lee KL, et al. Efficacy and Safety of Exercise Training in Patients With Chronic Heart Failure: HF-ACTION RCT. JAMA 301(14):1439–1450, 2009. jamanetwork.com/journals/jama/fullarticle/183708
Ong MK, Romano PS, Edgington S, et al. Effectiveness of Remote Patient Monitoring After Discharge of Hospitalized Patients With Heart Failure: BEAT-HF RCT. JAMA Internal Medicine 176(3):310–318, 2016. jamanetwork.com/.../fullarticle/2488923
Chaudhry SI, Mattera JA, Curtis JP, et al. Telemonitoring in Patients with Heart Failure (Tele-HF). NEJM 363(24):2301–2309, 2010. nejm.org/doi/full/10.1056/NEJMoa1010029
CMS. Hospital Readmissions Reduction Program (HRRP). Centers for Medicare & Medicaid Services. cms.gov/.../hospital-readmissions-reduction-program-hrrp
Martens T, Beck RW, Bailey R, et al. Effect of Continuous Glucose Monitoring on Glycemic Control in T2D Treated With Basal Insulin (MOBILE). JAMA 325(22):2262–2272, 2021. jamanetwork.com/journals/jama/fullarticle/2780593
Downing J, Bollyky J, Schneider J. Connected Glucose Meter and CDE Coaching to Decrease Abnormal BG Excursions: Livongo for Diabetes Program. JMIR 19(7):e234, 2017. jmir.org/2017/7/e234
Sepah SC, Jiang L, Peters AL. Engagement and Outcomes in a Digital Diabetes Prevention Program: 3-year Update (Omada). Journal of Medical Internet Research, 2017. Sustained ~3% weight loss and HbA1c reduction of ~0.31% at year 3. pmc.ncbi.nlm.nih.gov/articles/PMC5595194
Powers MA, Bardsley JK, Cypress M, et al. Diabetes Self-management Education and Support in Adults With Type 2 Diabetes: A Consensus Report. Diabetes Care 43(7):1636–1649, 2020. diabetesjournals.org/care/article/38/7/1372/30767
Fitzpatrick KK, Darcy A, Vierhile M. Delivering CBT to Young Adults With Depression/Anxiety Using a Fully Automated Conversational Agent (Woebot): RCT. JMIR Mental Health 4(2):e19, 2017. mental.jmir.org/2017/2/e19
Inkster B, Sarda S, Subramanian V. An Empathy-Driven, Conversational AI Agent (Wysa) for Digital Mental Well-Being. JMIR mHealth uHealth 6(11):e12106, 2018. mhealth.jmir.org/2018/11/e12106
Laranjo L, Dunn AG, Tong HL, et al. Conversational agents in healthcare: a systematic review. JAMIA 25(9):1248–1258, 2018. academic.oup.com/jamia/article/25/9/1248/5052181
Packer C, Wooders S, Lin K, et al. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560, 2023. arxiv.org/abs/2310.08560
Wang Q, Fu Y, Cao Y, Wang S, Tian Z, Ding L. Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models. arXiv:2308.15022, 2023. arxiv.org/abs/2308.15022
Chhikara P, Khant D, Aryan S, et al. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413, 2025. arxiv.org/pdf/2504.19413
Integrating avatar technology into a telemedicine application in heart failure patients (Sensely "Molly" pilot). 2023. pmc.ncbi.nlm.nih.gov/articles/PMC9894666

— · § · — Preliminary manuscript · Pharos v0.1 · Dossier №01
C. Chandra Vikram · Set in EB Garamond