Back to Dossier
Paper 14 / 15 Preliminary Manuscript · v0.1 May 2026
Dossier №01 · Project 14 · Pharos

Pharos: A Long-Horizon Voice Agent for Chronic Disease Management with Clinician Oversight

Weekly voice check-ins, persistent patient memory across months, and an escalation loop that puts a human clinician in the driver's seat. Targeted at heart failure and type-2 diabetes.

Abstract The 30-day heart-failure readmission rate is approximately 22–23% nationally and is tracked under CMS's Hospital Readmissions Reduction Program[4]. The two largest published heart-failure remote-monitoring RCTs — Tele-HF[3] and BEAT-HF[2] — produced null primary endpoints, with Tele-HF showing adherence dropping to ~55% by week 26: a problem of engagement and integration with care, not of monitoring per se. Pharos is a voice-first long-horizon agent for chronic disease management whose three load-bearing design choices respond directly to those null results: (i) voice rather than IVR/app to reduce friction (Sensely's pilot found avatar-based check-ins acceptable[15]); (ii) persistent memory across months via the MemGPT[12] / recursive-summarisation[13] / Mem0[14] family of long-horizon agent architectures; (iii) tight clinician oversight so detection triggers a human-led intervention loop, not just a notification. Pass criterion: 80% weekly check-in adherence at 12 weeks (versus Tele-HF's ~55%) and a documented escalation pattern matched against ASPEN-style monitoring practice[8].

§ 1 Introduction

Chronic disease management lives in the gap between the clinical visit and the next. Heart failure is the canonical example: HF-Action[1] randomised 2,331 patients to exercise training versus usual care and showed a non-significant HR of 0.93 for the primary endpoint of death/hospitalisation — the gap between the intervention's biological efficacy and its real-world effect is a problem of adherence and engagement. Tele-HF[3] and BEAT-HF[2] are the canonical RCTs for remote patient monitoring in HF and both produced null primary endpoints, with Tele-HF documenting the adherence decay that drives the result.

Type-2 diabetes shows a different pattern: continuous glucose monitoring works (the MOBILE trial[5] shows 1.1% HbA1c reduction at 8 months versus 0.6% with BGM, adjusted Δ −0.4%, p=0.02); and large-scale connected-meter-plus-coaching programs (Livongo[6]) document 16.4% reduction in hyperglycemia days and 18.4% in hypoglycemia days versus baseline at the 4,544-member scale. The diabetes evidence is the case for the program structure; the HF evidence is the case for why naive remote monitoring fails. Pharos integrates both lessons.

1.1 Contributions

  1. A voice-first chronic-disease management agent for HF and type-2 diabetes, with weekly check-ins, persistent memory, and structured escalation.
  2. A long-horizon memory architecture that operationalises MemGPT-style[12] hierarchical context, recursive summarisation[13], and Mem0-style[14] structured fact extraction in a clinical setting.
  3. An evaluation against the Tele-HF / BEAT-HF benchmark of 55% / mid-50s long-term adherence — the explicit goal is to halve the dropout rate, not to claim a new outcome.

§ 2 Background and Related Work

2.1 Heart Failure: The Evidence Wall

HF-Action[1], BEAT-HF[2], and Tele-HF[3] together establish the central HF self-management finding: biologically beneficial interventions fail to reach effect size in real-world cohorts unless the engagement substrate is solid. Tele-HF's specific number — adherence dropping from initial enrollment to ~55% by week 26 in an IVR-based daily check-in regime — is the design ceiling Pharos's voice-first architecture must beat. CMS's HRRP penalises HF readmissions at the population level[4], providing the ROI signal for the intervention.

2.2 Diabetes: The Engagement Pattern That Works

Martens et al.'s MOBILE trial[5] demonstrates a real intervention effect on HbA1c. Downing et al.'s Livongo cohort[6] documents the at-scale outcome: 16.4% / 18.4% reduction in hyper-/hypoglycemia days in 4,544 members, with the engagement model being a connected glucose meter plus a Certified Diabetes Educator coaching layer. Sepah et al.'s Omada three-year follow-up[7] reports sustained weight loss (~3% at year three) and an HbA1c reduction of approximately 0.31%. Powers et al.'s ADA/AADE consensus[8] defines four critical timepoints for DSMES referral — at diagnosis, annually, with new complications, and with care transitions — and reports associated HbA1c reductions in the range of 0.45–0.57% across the underlying evidence base.

2.3 Conversational AI Agents in Chronic Care

Fitzpatrick et al.'s Woebot RCT[9] demonstrates that a fully automated conversational agent can produce significant PHQ-9 reduction at 2 weeks (F=6.47, p=0.01) in a 70-participant trial. Inkster et al.'s Wysa evaluation[10] documents that high-engagement users (≥2 sessions/week × 2 weeks) saw PHQ-9 reductions >5 points versus low-engagement. Laranjo et al.'s systematic review[11] covers 17 healthcare conversational-agent studies through 2018 — most used finite-state dialogue, acceptability was generally high, and rigorous outcome evidence was limited.

2.4 Long-Horizon Agent Memory

MemGPT[12] (now Letta) introduced hierarchical context management with main and external memory tiers, enabling multi-session conversational agents that retain prior context. Wang et al.'s recursive-summarisation work[13] demonstrates improved dialogue consistency across thousands of turns — directly applicable to a patient-clinician relationship that persists for months. Mem0[14] (2025) demonstrates hierarchical fact extraction across sessions with token-efficient retrieval at production scale. Pharos uses Mem0 as the memory substrate.

2.5 Avatar / Voice Agents for HF

A 2023 pilot integrating Sensely's "Molly" avatar with HF telemedicine[15] rated avatar-based check-ins as usable and satisfactory; vendor reports of reduced 30-day readmissions exist but lack peer-reviewed RCT confirmation. Pharos's voice-first design is the deliberate counterpart: voice avoids the visual-attention requirement avatars impose, which matters for the elderly HF population.

§ 3 Proposed Approach

3.1 Architecture

Figure 1 · Long-horizon agent with clinician loop
Patient weekly call 15 min Pharos voice agent Episodic memory last 12 weeks verbatim recursive summary Semantic memory facts: meds, weights, labs, goals Red-flag rules HF: +3 lb in 1d / +5 in 1wk DM: BG <54 / >300 ×2 gpt-realtime · Claude Clinician dashboard care manager view patient list (panel) red-flag inbox trend visualisations accept / override message back Escalate RN call · clinic visit 911 deterministic red flag EHR labs meds
Figure 1. Pharos pairs a voice agent with a clinician dashboard. The voice agent maintains three memory tiers — episodic (recent weeks verbatim plus recursive summary per Wang et al.[13]), semantic (Mem0-style[14] structured facts), and rule-based red flags (HF: 3-lb-in-1-day or 5-lb-in-1-week weight gain; DM: glucose <54 mg/dL or >300 mg/dL twice consecutively). The conversational core (OpenAI gpt-realtime or Claude) runs on top of the memory substrate. The clinician dashboard exposes the patient panel, a red-flag inbox, and an accept/override surface — this is the loop that the Tele-HF[3] null result identifies as missing in IVR-only monitoring.

3.2 Weekly Check-In Structure

Each call is approximately 15 minutes, structured around three sections: state update (medications, symptoms, daily weights or glucose), open conversation (the patient's questions and concerns), and plan and education (next-week goals and contextual teaching from clinical guidelines). The state update has fixed structured prompts; the open conversation is unscripted. The guideline-anchored education follows the ADA-recommended DSMES timepoint framework[8] for diabetes and the BEAT-HF[2] patient-education content set for HF.

3.3 Escalation Loop

Three escalation tiers. Routine findings are summarised into the dashboard the care manager reviews the next business day. Escalate findings (deterministic red flag fired or model confidence in absence-of-red-flag below threshold) trigger an RN callback within four hours during business hours, or schedule a clinic visit. 911 findings (acute symptoms matching the deterministic life-threat list — crushing chest pain, severe dyspnea, focal neuro symptoms) trigger immediate instruction to hang up and dial 911 with offer to conference the call.

§ 4 Evaluation Protocol

Table 1. Pharos evaluation metrics.
MetricDefinitionTarget
12-week adherencePatients completing 10+ of 12 weekly calls≥ 80% (Tele-HF baseline ~55%)
Memory recall accuracyAgent correctly recalls previously-discussed facts at week 12≥ 90%
Red-flag recallSensitivity for seeded HF/DM red flags≥ 0.95
Time-to-clinicianMedian time from red-flag detection to clinician contact≤ 4h (business hours)
30-day readmission delta (HF subset)Adjusted readmission rate change vs matched cohortTrend toward reduction
HbA1c change at 6mo (DM subset)Adjusted HbA1c change vs matched cohort−0.4% to −0.6% (MOBILE-comparable[5])
Pass criterion Pharos v0.1 succeeds if 12-week adherence reaches 80% (substantially above Tele-HF's 55% IVR baseline[3]) and red-flag recall ≥ 0.95 in a 200-patient mixed HF/DM pilot. Clinical-outcome targets are secondary in v0.1 and require a v0.2 RCT to test.

§ 5 Expected Contributions

  1. System. An open long-horizon voice agent with clinician oversight for HF and type-2 diabetes.
  2. Memory architecture demonstration. A concrete clinical instantiation of the MemGPT[12] / Mem0[14] patterns showing what tier holds what facts at what retention.
  3. Adherence finding. A quantitative comparison of voice-first vs IVR adherence at 12 weeks — the design lever that the null-result HF RCTs identify as the failure point.

§ 6 Limitations and Risks

The HF and DM populations differ enough that a single system has to negotiate two distinct red-flag profiles and education pathways; Pharos handles this with condition-specific configuration but the empirical case for the unified architecture is unproven. The clinician dashboard creates new alert burden, and the experience of CDS[11] is that alert fatigue dominates outcome unless escalation discipline holds. Voice-only delivery may exclude patients with hearing impairment, which intersects with the elderly HF population — a SMS-fallback mode is required and is a v0.2 task.

The deepest risk is that Pharos succeeds in adherence and still fails to move clinical outcomes — the exact pattern of Tele-HF and BEAT-HF[2][3]. The v0.1 evaluation is therefore explicit that clinical outcome is a v0.2 question.

§ 7 Conclusion

Pharos is built on the recognition that the chronic-disease management RCTs of the prior decade failed not because remote monitoring is biologically inert but because the engagement substrate was wrong. Voice agents, long-horizon memory, and a clinician-in-the-loop dashboard are the three changes that the underlying literature identifies as load-bearing. Whether they suffice is the empirical question Pharos is built to answer.

References

  1. O'Connor CM, Whellan DJ, Lee KL, et al. Efficacy and Safety of Exercise Training in Patients With Chronic Heart Failure: HF-ACTION RCT. JAMA 301(14):1439–1450, 2009. jamanetwork.com/journals/jama/fullarticle/183708
  2. Ong MK, Romano PS, Edgington S, et al. Effectiveness of Remote Patient Monitoring After Discharge of Hospitalized Patients With Heart Failure: BEAT-HF RCT. JAMA Internal Medicine 176(3):310–318, 2016. jamanetwork.com/.../fullarticle/2488923
  3. Chaudhry SI, Mattera JA, Curtis JP, et al. Telemonitoring in Patients with Heart Failure (Tele-HF). NEJM 363(24):2301–2309, 2010. nejm.org/doi/full/10.1056/NEJMoa1010029
  4. CMS. Hospital Readmissions Reduction Program (HRRP). Centers for Medicare & Medicaid Services. cms.gov/.../hospital-readmissions-reduction-program-hrrp
  5. Martens T, Beck RW, Bailey R, et al. Effect of Continuous Glucose Monitoring on Glycemic Control in T2D Treated With Basal Insulin (MOBILE). JAMA 325(22):2262–2272, 2021. jamanetwork.com/journals/jama/fullarticle/2780593
  6. Downing J, Bollyky J, Schneider J. Connected Glucose Meter and CDE Coaching to Decrease Abnormal BG Excursions: Livongo for Diabetes Program. JMIR 19(7):e234, 2017. jmir.org/2017/7/e234
  7. Sepah SC, Jiang L, Peters AL. Engagement and Outcomes in a Digital Diabetes Prevention Program: 3-year Update (Omada). Journal of Medical Internet Research, 2017. Sustained ~3% weight loss and HbA1c reduction of ~0.31% at year 3. pmc.ncbi.nlm.nih.gov/articles/PMC5595194
  8. Powers MA, Bardsley JK, Cypress M, et al. Diabetes Self-management Education and Support in Adults With Type 2 Diabetes: A Consensus Report. Diabetes Care 43(7):1636–1649, 2020. diabetesjournals.org/care/article/38/7/1372/30767
  9. Fitzpatrick KK, Darcy A, Vierhile M. Delivering CBT to Young Adults With Depression/Anxiety Using a Fully Automated Conversational Agent (Woebot): RCT. JMIR Mental Health 4(2):e19, 2017. mental.jmir.org/2017/2/e19
  10. Inkster B, Sarda S, Subramanian V. An Empathy-Driven, Conversational AI Agent (Wysa) for Digital Mental Well-Being. JMIR mHealth uHealth 6(11):e12106, 2018. mhealth.jmir.org/2018/11/e12106
  11. Laranjo L, Dunn AG, Tong HL, et al. Conversational agents in healthcare: a systematic review. JAMIA 25(9):1248–1258, 2018. academic.oup.com/jamia/article/25/9/1248/5052181
  12. Packer C, Wooders S, Lin K, et al. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560, 2023. arxiv.org/abs/2310.08560
  13. Wang Q, Fu Y, Cao Y, Wang S, Tian Z, Ding L. Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models. arXiv:2308.15022, 2023. arxiv.org/abs/2308.15022
  14. Chhikara P, Khant D, Aryan S, et al. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413, 2025. arxiv.org/pdf/2504.19413
  15. Integrating avatar technology into a telemedicine application in heart failure patients (Sensely "Molly" pilot). 2023. pmc.ncbi.nlm.nih.gov/articles/PMC9894666
— · § · — Preliminary manuscript · Pharos v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond