Dossier №01 · Project 08 · Chartwalker

Chartwalker: A Computer-Use Agent and Execution-Graded Benchmark for Electronic Health Record Operation

Twenty clinical workflows. A sandboxed OpenEMR. A Playwright grader that reads the database to score success. The first public computer-use benchmark for an actual EHR.

Chandra Takeoff AI · Healthcare AI Engineering Compiled · May 2026

Abstract Arndt et al.[11] reported in Annals of Family Medicine (2017) that primary-care physicians spend approximately 5.9 hours per clinic day inside the EHR — including approximately 1.4 hours of after-hours "pajama time" (86 minutes per the original paper). Sinsky et al.[12] earlier quantified the broader pattern: roughly two hours of EHR work for every one hour of direct patient care. Computer-use agents are now technically capable of operating real desktop and web applications, as demonstrated by Anthropic's Computer Use release[1] and a sequence of GUI-agent benchmarks[2][3][4][5]. No public computer-use agent or benchmark operates an actual EHR. Chartwalker fills this gap: a sandboxed OpenEMR instance, a 20-task clinical workflow suite, and a Playwright-based grading harness that reads the database state to assert success — following the execution-graded pattern of WebArena[2] and OSWorld[4]. Pass criterion: ≥ 70% task completion driven by Claude Computer Use.

§ 1 Introduction

EHR navigation is a clinically meaningful task class. The Arndt and Sinsky findings[11][12] establish the size of the prize — every clinician hour saved on EHR mechanics is an hour returned to patient care or to clinician well-being. Yet despite the obvious importance, public computer-use agent work has focused on consumer web tasks (Mind2Web[7], WebVoyager[5], VisualWebArena[3]) and desktop apps (OSWorld[4]). No benchmark covers EHR operation, and no agent has been shown to complete clinical workflows end-to-end against a real EHR interface.

Chartwalker is the simplest version of that missing artifact. It is a benchmark, not a model. The architecture decisions are deliberate: a sandboxed OpenEMR runs in Docker, twenty workflows are defined with deterministic post-conditions in the EHR's PostgreSQL backend, and the Playwright-based grader asserts those post-conditions independently of the agent's claimed trajectory.

1.1 Contributions

A sandboxed OpenEMR evaluation harness with reproducible setup and seeded patient data.
A 20-task benchmark suite covering order entry, chart review, refill workflow, problem-list reconciliation, and result acknowledgment — each task with a deterministic post-condition gold.
A public leaderboard with baseline results from Claude Computer Use[1], evaluated with per-task completion, per-step accuracy, and grounded-failure-mode classification.

§ 2 Background and Related Work

2.1 Computer-Use Agents

Anthropic's Computer Use release[1] (October 2024) introduced a screenshot-plus-cursor action loop in which the model receives the screen as an image and emits mouse and keyboard actions. Chartwalker is built on this loop. The agent is granted only screen access — it does not query the OpenEMR API directly — because the explicit point of the evaluation is to measure UI-mediated agency, not API-mediated agency.

2.2 Execution-Graded Web Benchmarks

WebArena[2] (ICLR 2024) established the methodology Chartwalker inherits: sandboxed full-stack web applications with execution-based graders that read backend state after the agent finishes. VisualWebArena[3] extended the framework to multimodal grounding and popularised Set-of-Mark prompting[9]. OSWorld[4] generalised execution grading to full desktop environments. WebVoyager[5] demonstrated end-to-end multimodal agents on live websites; SeeAct[6] decomposed web agency into action generation plus action grounding, a separation that informs our Chartwalker prompt structure.

2.3 Visual Grounding

Set-of-Mark[9] overlays numbered marks on UI widgets so the model can reference elements by ID rather than coordinates. Ferret-UI[10] demonstrates that UI-specific grounding on high-aspect-ratio screens outperforms generic vision-language models. Chartwalker overlays SoM marks on every screenshot before sending to the agent, an inexpensive intervention with well-documented benefits.

2.4 Generalist Web Agency

Mind2Web[7] evaluates agents across 137 websites with cross-domain generalisation as the central concern. Chartwalker inherits the cross-domain framing: while the v0.1 benchmark targets OpenEMR specifically, the task taxonomy is portable to other EHRs (OpenMRS, GNU Health), and a v0.2 effort should add at least one additional EHR for true generalisation testing.

2.5 Why an EHR-Specific Benchmark

Existing GUI-agent benchmarks systematically underrepresent the patterns that dominate EHR work: time-bounded ordering, structured-form entry with terminology binding, problem-list reconciliation, refill workflows with administrative steps. AgentBench[8] demonstrated that LLM agent performance varies dramatically by task category — aggregate scores conceal large per-category spreads. Chartwalker's per-category reporting is designed to surface the categories where current agents fail, which is the actionable result for downstream agent designers.

§ 3 Proposed Approach

3.1 Sandbox

OpenEMR (latest stable release) runs in Docker with PostgreSQL as the backend. A seed script populates the sandbox with 50 synthetic patients drawn from Synthea (see Atrium, Project 01), each with a realistic baseline of conditions, medications, allergies, and recent encounters. The sandbox is reset between every task evaluation so that no task can be confounded by state from a previous task.

Figure 1 · Chartwalker harness

Figure 1. Chartwalker's three-component harness. (Left) Claude Computer Use[1] drives the EHR UI through a screenshot-plus-cursor loop with Set-of-Mark[9] overlay. (Centre) OpenEMR runs in Docker with a PostgreSQL backend, seeded with 50 Synthea patients; the UI is the only surface the agent sees. (Right) A Playwright-based grader reads the PostgreSQL backend directly after the agent finishes each task, asserting deterministic post-conditions over the database — the same execution-graded methodology established by WebArena[2] and OSWorld[4]. For context, Claude 3.5 Sonnet scored 14.9% on OSWorld in October 2024; Claude Sonnet 4.5 reached 61.4% by September 2025 (Anthropic), and OpenAI's CUA reached 38.1% at Operator launch — Chartwalker tracks where on this trajectory frontier agents currently sit for clinical work.

3.2 The 20-Task Suite

**Table 1.** Chartwalker v0.1 task categories.
Category	Tasks	Example
Order entry	5	Order metformin 500 mg BID for patient P0001.
Chart review	4	Identify the patient's most recent HbA1c value and add a note acknowledging it.
Refill workflow	3	Refill atorvastatin for 90 days with two refills; cancel an expired prescription.
Problem-list reconciliation	4	Mark "Type 2 diabetes — uncontrolled" as resolved; add "Type 2 diabetes — controlled".
Result acknowledgment	4	Acknowledge a returned BMP, flag potassium as abnormal, message the patient.

3.3 Execution-Based Grading

Each task has a pre-defined post-condition: a set of assertions over the OpenEMR database that must hold after the agent finishes. For example, "Order metformin 500 mg BID for P0001" has the post-condition:

SELECT 1 FROM prescriptions WHERE patient_id = 'P0001' AND drug = 'metformin' AND dose_str = '500 mg' AND frequency LIKE '%BID%' AND active = TRUE;

The Playwright-based grader runs the agent's session, then connects to the PostgreSQL backend and asserts each post-condition. The task verdict is binary; partial credit is reported separately as the count of post-conditions met out of total.

3.4 Agent Prompting

The agent receives: (i) the task description in natural language, (ii) the initial OpenEMR screenshot with Set-of-Mark[9] overlay, (iii) a short, fixed system prompt describing OpenEMR's navigation idioms. No additional task-specific scaffolding is provided. The architecture is intentionally bare so that the benchmark measures the agent's intrinsic capability rather than the operator's ability to engineer a workflow-specific prompt.

§ 4 Evaluation Protocol

4.1 Metrics

**Table 2.** Chartwalker evaluation metrics.
Metric	Definition	Reporting
Task completion rate	Binary: all post-conditions met.	Overall + per category.
Partial credit	Mean fraction of post-conditions met.	Per category.
Step efficiency	Mean actions per completed task.	Per category.
Failure-mode classification	One of: misclick, terminology-binding error, navigation loss, premature termination.	Distribution across failed tasks.

Pass criterion Chartwalker v0.1 succeeds when Claude Computer Use[1] achieves ≥ 70% task completion rate aggregated across the 20-task suite, with no category below 50%. The 70% target is calibrated against published results on WebArena and OSWorld, where frontier-agent scores have climbed from 14% (initial WebArena) to the 60–70% range over 2024–2025.

4.2 Failure-Mode Audit

Every failed task is classified into one of four primary failure modes, drawn from the GUI-agent literature. Misclick: action grounded to the wrong widget, often a near-neighbour in the SoM overlay. Terminology-binding error: the model selected a clinically incorrect code (e.g., metformin extended-release versus immediate-release). Navigation loss: the model entered a screen from which it could not return. Premature termination: the model declared the task complete before all post-conditions held. Reporting this distribution is the practically useful output for downstream agent designers.

§ 5 Expected Contributions

Benchmark. The first public, reproducible computer-use benchmark for an actual EHR.
Failure-mode taxonomy. A classified failure-mode distribution from a frontier computer-use agent on clinical workflows.
Leaderboard. Initial cross-agent results (Claude Computer Use, and any subsequent computer-use systems) with quarterly re-evaluation.

§ 6 Limitations and Risks

OpenEMR is open-source and stylistically distinct from production EHRs like Epic and Cerner; results on Chartwalker do not directly transfer to those systems. The architectural pattern — sandbox plus post-condition grader — does, but a v0.2 effort should add at least one additional EHR target. The Set-of-Mark[9] overlay is also an artifact of evaluation that production systems would not have; agents that depend heavily on SoM marks may underperform on raw EHR screens.

A separate risk: writing real prescriptions to a sandboxed EHR is safe; but an agent capable of writing prescriptions reliably is a precursor to systems that could write them to production EHRs. Chartwalker's release explicitly limits to the sandbox configuration and includes guidance against repointing the agent at production systems without institutional QA, in keeping with the broader literature on clinical LLM evaluation gaps.

§ 7 Conclusion

Chartwalker takes the well-established sandbox-plus-execution-grader pattern from WebArena[2] and OSWorld[4], applies it to the place clinicians actually spend their working day, and exposes a leaderboard that surfaces the per-category failure modes that aggregate scores conceal. It is the simplest version of a benchmark that should exist and currently does not.

References

Anthropic. Introducing computer use, a new Claude 3.5 Sonnet, and a new Claude 3.5 Haiku. 2024. anthropic.com/news/3-5-models-and-computer-use
Zhou S, Xu FF, Zhu H, et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR, 2024. arxiv.org/abs/2307.13854
Koh JY, Lo R, Jang L, et al. VisualWebArena: Evaluating Multimodal Agents on Realistic Visually Grounded Web Tasks. ACL, 2024. arxiv.org/abs/2401.13649
Xie T, Zhang D, Chen J, et al. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. NeurIPS, 2024. arxiv.org/abs/2404.07972
He H, Yao W, Ma K, et al. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. ACL, 2024. arxiv.org/abs/2401.13919
Zheng B, Gou B, Kil J, Sun H, Su Y. GPT-4V(ision) is a Generalist Web Agent, if Grounded (SeeAct). ICML, 2024. arxiv.org/abs/2401.01614
Deng X, Gu Y, Zheng B, et al. Mind2Web: Towards a Generalist Agent for the Web. NeurIPS Spotlight, 2023. arxiv.org/abs/2306.06070
Liu X, Yu H, Zhang H, et al. AgentBench: Evaluating LLMs as Agents. ICLR, 2024. arxiv.org/abs/2308.03688
Yang J, Zhang H, Li F, et al. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. 2023. arxiv.org/abs/2310.11441
You H, Zhang H, Gan Z, et al. Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs. ECCV, 2024. arxiv.org/abs/2404.05719
Arndt BG, Beasley JW, Watkinson MD, et al. Tethered to the EHR: Primary Care Physician Workload Assessment Using EHR Event Log Data and Time-Motion Observations. Annals of Family Medicine, 15(5):419–426, 2017. annfammed.org/content/15/5/419
Sinsky C, Colligan L, Li L, et al. Allocation of Physician Time in Ambulatory Practice: A Time and Motion Study in 4 Specialties. Annals of Internal Medicine, 165(11):753–760, 2016. acpjournals.org/doi/10.7326/M16-0961

— · § · — Preliminary manuscript · Chartwalker v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond