Agent eval suite
dAvePi’s positioning — that agents build apps on it — is testable.
A harness in eval/
runs Claude against 10 prompts of escalating complexity, scoring
each with a programmatic check. The pass-rate runs nightly, gets
committed back to main, and feeds both the README badge and this
page.
Latest pass-rate
Section titled “Latest pass-rate”The badge above is rendered from
eval/results/badge.json
on main. The nightly workflow commits that file (plus the
per-prompt breakdown in
eval/results/latest.json)
after every run, so this number is at most ~24 hours behind the
latest commit.
What the prompts cover
Section titled “What the prompts cover”Each is a single, concrete change against a tiny task-tracker project. They build on each other — prompt 3’s starting state is the deterministic outcome of prompts 1 and 2.
| # | Prompt | Tests |
|---|---|---|
| 01 | Add title (required) + done (default false) fields to task. | Basic field declaration. |
| 02 | Add a status state machine: todo → in_progress → done. | State machines with terminal states. |
| 03 | Add a parent belongsTo relation from task to project. | Relations + foreign-key naming. |
| 04 | Add a computed displayLabel field joining title and status. | computed: field functions. |
| 05 | Add an aggregation countByStatus. | Aggregation declaration with $group. |
| 06 | Restrict DELETE to admins via acl.delete. | Document-level ACL. |
| 07 | Add a file field attachment (PDFs only, 1 MB max). | type: 'File' + accept / maxBytes config. |
| 08 | Mark title + a new description field as searchable: true. | Full-text search opt-in. |
| 09 | Add internalNotes field readable only by admins. | Field-level ACL (read), without restricting writes. |
| 10 | Add a second aggregation countByProject, sorted by count. | Multiple aggregations + sort stages. |
How a prompt is scored
Section titled “How a prompt is scored”Each prompt directory contains three files:
prompt.md— the natural-language task the agent receives.apply.js— a deterministic reference implementation. The harness uses it to seed the starting state for downstream prompts (so prompt 5 starts from prompts 1-4 already done correctly), and the self-tests use it to prove the check / apply pair is internally consistent (a perfect agent would always pass).check.js— a function(projectRoot) => { ok, message }. Itrequire()s the agent’s output (the schema file) and asserts on the resulting object. Returnsok: falsewith a useful message on any deviation.
The harness invokes the agent with a small tool surface
(read_file, write_file, list_directory) scoped to the
project directory. A system prompt seeded from the project’s own
agent.md tells Claude the framework’s conventions — same content
that ships in every scaffolded project, so the eval tests the
same path a real user would have.
Reading the score
Section titled “Reading the score”The latest run is committed to
eval/results/latest.json
and rendered on the README via shields.io
pointed at eval/results/badge.json.
Each result includes:
name— the prompt directory name.passed— boolean.message— what the check found (or"all assertions passed").durationMs— wall clock of the agent’s turn-loop for that prompt.
A failed prompt doesn’t fail the workflow — the eval keeps running through the remaining prompts and publishes the partial-pass result. The exit code is captured as a warning, not a hard failure, so the badge updates regardless.
Cost and cadence
Section titled “Cost and cadence”The eval runs nightly at 06:00 UTC. Each full run is 10 prompts ×
roughly 5-15 model turns each, so ~50-150 messages against a
mid-tier Claude model. Cost per run is in the dollars-not-tens, but
the budget is the maintainer’s ANTHROPIC_API_KEY — we don’t
publish the bill.
To trigger an ad-hoc run after a meaningful change (new framework
feature, agent.md tweak, etc.) the workflow has a manual
workflow_dispatch trigger on the Actions tab.
What this is and isn’t
Section titled “What this is and isn’t”Is. A regression guard against changes to the framework that
break the agent experience — schema vocabulary churn, missing
fields in _describe, agent.md drift. If a refactor lands and
the eval drops from 10/10 to 7/10, the next morning’s commit
shows exactly which prompts regressed.
Isn’t. A benchmark of model intelligence. Different Claude versions will score differently, but that’s a side effect — the goal is to detect framework regressions, not to rank models.
Isn’t. Exhaustive. The 10 prompts cover the common feature surface, not every edge case. The eval is a smoke test for the agent-first story, not a substitute for the framework’s unit tests.
Running it locally
Section titled “Running it locally”cd evalnpm installANTHROPIC_API_KEY=sk-ant-... node bin/run-eval.js
# Or a single prompt:ANTHROPIC_API_KEY=sk-ant-... node bin/run-eval.js --prompt 03
# Or with the stub agent (replays a fixture; useful for testing the# harness itself without spending tokens):EVAL_AGENT=stub node bin/run-eval.js --prompt 01Setting EVAL_KEEP_TEMP=1 preserves the scratch directory after
each prompt so you can diff the agent’s output against the
canonical apply.js.
See also
Section titled “See also”- The harness source on GitHub.
- Concepts → Why agents come first — the design rationale this eval exists to validate.
- The nightly workflow.