Skip to content

Agent eval suite

dAvePi’s positioning — that agents build apps on it — is testable. A harness in eval/ runs Claude against 10 prompts of escalating complexity, scoring each with a programmatic check. The pass-rate runs nightly, gets committed back to main, and feeds both the README badge and this page.

Agent eval pass rate

The badge above is rendered from eval/results/badge.json on main. The nightly workflow commits that file (plus the per-prompt breakdown in eval/results/latest.json) after every run, so this number is at most ~24 hours behind the latest commit.

Each is a single, concrete change against a tiny task-tracker project. They build on each other — prompt 3’s starting state is the deterministic outcome of prompts 1 and 2.

#PromptTests
01Add title (required) + done (default false) fields to task.Basic field declaration.
02Add a status state machine: todo → in_progress → done.State machines with terminal states.
03Add a parent belongsTo relation from task to project.Relations + foreign-key naming.
04Add a computed displayLabel field joining title and status.computed: field functions.
05Add an aggregation countByStatus.Aggregation declaration with $group.
06Restrict DELETE to admins via acl.delete.Document-level ACL.
07Add a file field attachment (PDFs only, 1 MB max).type: 'File' + accept / maxBytes config.
08Mark title + a new description field as searchable: true.Full-text search opt-in.
09Add internalNotes field readable only by admins.Field-level ACL (read), without restricting writes.
10Add a second aggregation countByProject, sorted by count.Multiple aggregations + sort stages.

Each prompt directory contains three files:

  • prompt.md — the natural-language task the agent receives.
  • apply.js — a deterministic reference implementation. The harness uses it to seed the starting state for downstream prompts (so prompt 5 starts from prompts 1-4 already done correctly), and the self-tests use it to prove the check / apply pair is internally consistent (a perfect agent would always pass).
  • check.js — a function (projectRoot) => { ok, message }. It require()s the agent’s output (the schema file) and asserts on the resulting object. Returns ok: false with a useful message on any deviation.

The harness invokes the agent with a small tool surface (read_file, write_file, list_directory) scoped to the project directory. A system prompt seeded from the project’s own agent.md tells Claude the framework’s conventions — same content that ships in every scaffolded project, so the eval tests the same path a real user would have.

The latest run is committed to eval/results/latest.json and rendered on the README via shields.io pointed at eval/results/badge.json.

Each result includes:

  • name — the prompt directory name.
  • passed — boolean.
  • message — what the check found (or "all assertions passed").
  • durationMs — wall clock of the agent’s turn-loop for that prompt.

A failed prompt doesn’t fail the workflow — the eval keeps running through the remaining prompts and publishes the partial-pass result. The exit code is captured as a warning, not a hard failure, so the badge updates regardless.

The eval runs nightly at 06:00 UTC. Each full run is 10 prompts × roughly 5-15 model turns each, so ~50-150 messages against a mid-tier Claude model. Cost per run is in the dollars-not-tens, but the budget is the maintainer’s ANTHROPIC_API_KEY — we don’t publish the bill.

To trigger an ad-hoc run after a meaningful change (new framework feature, agent.md tweak, etc.) the workflow has a manual workflow_dispatch trigger on the Actions tab.

Is. A regression guard against changes to the framework that break the agent experience — schema vocabulary churn, missing fields in _describe, agent.md drift. If a refactor lands and the eval drops from 10/10 to 7/10, the next morning’s commit shows exactly which prompts regressed.

Isn’t. A benchmark of model intelligence. Different Claude versions will score differently, but that’s a side effect — the goal is to detect framework regressions, not to rank models.

Isn’t. Exhaustive. The 10 prompts cover the common feature surface, not every edge case. The eval is a smoke test for the agent-first story, not a substitute for the framework’s unit tests.

Terminal window
cd eval
npm install
ANTHROPIC_API_KEY=sk-ant-... node bin/run-eval.js
# Or a single prompt:
ANTHROPIC_API_KEY=sk-ant-... node bin/run-eval.js --prompt 03
# Or with the stub agent (replays a fixture; useful for testing the
# harness itself without spending tokens):
EVAL_AGENT=stub node bin/run-eval.js --prompt 01

Setting EVAL_KEEP_TEMP=1 preserves the scratch directory after each prompt so you can diff the agent’s output against the canonical apply.js.