Evals — method & results

I wanted the eval story to be inspectable, not hand-wavy: a real harness, a real golden set, and a multi-model sweep through OpenRouter with one env flip.

Method

AccuracyTop-1 + payload matchAuto-tagging classification and deterministic skill payloads are checked against golden expectations.

RefusalPrecision and recallSilent errors are worse than refusal, so must-refuse cases are tracked separately.

OperationsLatency, cost, schemaThe harness reports practical deployment signals, not just model leaderboard scores.

92 cases across four runtime skills — auto-tagging 37, policy-enforcement 22, accounts-payable agent 18, daily-pulse 15.
18 boundary eval cases added — 10 natural-language policy compiler cases (evals/policy-compiler.jsonl) and 8 accounts-payable optimizer cases (evals/ap-optimizer.jsonl). These sit beside the runtime harness because they test authoring and optimization seams, not a registered hot-path skill.
Auto-tagging composition: 20 head cases (textbook general ledger), 8 tail cases where out-of-distribution refusals are expected, 4 prompt-injection cases, 3 foreign-exchange edge cases, and 3 cross-tenant isolation cases. 9 of the 37 expect refusal.
Policy enforcement: 6 allow, 9 flag, 7 deny/refuse, 3 foreign-exchange-edge cases straddling caps, 3 injection cases where deterministic verdicts must hold.
Accounts-payable agent: 14 schedule, 2 overdue including cross-currency escalation, 3 discount cases where one has the window already passed, 2 injection cases, 2 refusal cases.
Daily pulse: healthy / cash-floor / corridor (spike + normal) / vendor-concentration / low-value / empty-day / multi-currency / weekly cadence / manual trigger / runway-warning / vendor-onboarding / bills-cluster + 2 refusals on malformed windows.
Same prompt (v1.0) swept across providers — Anthropic Sonnet 4.5, Anthropic Haiku 4.5, OpenAI GPT-4o-mini, Google Gemini 2.5 Flash. Deterministic skills (policy, accounts-payable scheduler, daily-pulse analyzer) run model-agnostic.
Scored dimensions: top-1 accuracy, refusal precision/recall, mean confidence, expected calibration error, latency, schema conformance, token volume, and estimated cost.
Reproduce: pnpm eval. Slice with pnpm eval -- --tag injection or --skill daily-pulse. Run the policy compiler corpus with pnpm eval:policy-compiler. Accounts-payable optimizer cases run in Vitest via tests/ap-optimizer-eval.test.ts. Full reports write to evals/results/<timestamp>.md.

Latest run

Eval report — 2026-05-13T02:44:23.581Z

auto-tagging

model	prompt	n	top-1 acc	refusal P	refusal R	mean conf	tok in	tok out
`anthropic/claude-sonnet-4.5`	v1.0	15	86.7%	100.0%	33.3%	0.94	24499	3174
`anthropic/claude-haiku-4.5`	v1.0	15	80.0%	100.0%	33.3%	0.91	24499	2981
`openai/gpt-4o-mini`	v1.0	15	80.0%	100.0%	33.3%	0.87	12282	1791
`google/gemini-2.5-flash`	v1.0	15	66.7%	25.0%	33.3%	0.95	9169	605

How I read the results

Sonnet 4.5 is the right reasoning default for production in this setup — 7 points of accuracy over the cost-cheaper alternatives.
Refusal precision is perfect across the Anthropic + OpenAI models: when they refuse, they're right to. Gemini Flash hallucinates refusals (25% precision) and is over-confident on the cases it accepts.
Refusal recall caps at 33% across all four — the adversarial cases require context the prompt does not surface yet. The next major prompt version should move that lever.
Token spend on Sonnet is ~2× Gemini for ~20 points of accuracy. Use Haiku for the easy 80%, Sonnet for the hard 20% — Reap's eventual cost model writes itself off this curve.

What the harness gives me

A model-selection tool, not a leaderboard: one env flip swaps providers and re-runs the same golden set.
Per-case JavaScript Object Notation Lines output for diffing prompt versions side-by-side.
A markdown report you can paste into a pull request description.

What it does not measure yet

Confidence drift over time — needs production telemetry; the decision log captures the inputs.
Integration-backed evidence — optical-character-recognition extraction, live ledger posts, real payment rails, production cash backtests, role-based-access-control identity, and privacy redaction now have named entries in evals/coverage-gaps.json, but they are intentionally marked specified/blocked until those adapters or datasets exist.

What the harness gained in v1.1

Slicing — every golden case carries a tags?: string[] field. Run a slice with pnpm eval -- --tag tail (or injection, fx-edge, refuse, …). Tags don't need a registry; the runner just filters.
Calibration through expected calibration error — computed over non-refuse outcomes, 10-bucket binning, weighted by bucket population. Reported per model alongside the existing four scores, with a collapsible per-bucket breakdown (n, mean confidence, accuracy, gap). Bar: expected calibration error < 0.03 once buckets have ≥10 cases each — small slices will be noisy.
Latency fiftieth percentile / ninety-fifth percentile — per-case wall-clock around skill.run(). Surfaced in the report and standard output. Treat as relative-not-absolute on cold runs; the deterministic skills clock in at <1 ms.
Cost — provider-reported cost when available, with a client-side per-model price-table fallback otherwise. Deterministic skills cost $0.00 by construction; treat the dollar figure as directional, not invoice-grade.
Schema conformance — schema OK column tracks the % of large-language-model calls that produced a Zod-valid object on first try after the Vercel Artificial Intelligence Software Development Kit's internal retries. Reported only for skills that actually called a large language model; deterministic skills show —. A model_unsafe refusal is the structured signal the harness keys off.
Multi-seed determinism — pnpm eval -- --seeds N runs each case N times and reports agree(Nseed) = fraction of cases where every seed produced the same outcome fingerprint. Hosted models drift even at temperature: 0; this surfaces the noise floor. Cost scales N×; the run prints a warning.
Reversal-rate replay — pnpm eval -- --replay-overrides joins overrides ⋈ decisions ⋈ events and re-runs each captured override through the current prompt × model × chart of accounts × vendor_rules. Reports averted vs still reversed, with averted via learned rule broken out separately so a vendor-rule short-circuit is not conflated with the model getting smarter. This is the only eval that maps directly to the north-star.
Verbose mode — pnpm eval -- --verbose prints per-case pass/fail with expected vs actual and confidence; useful for diffing prompt regressions before the markdown report writes.
Escalate-aware scoring — the accounts-payable overdue path produces an escalate outcome with a proposed payload; the runner now matches that payload against expected.payload.

Roadmap — evals worth building next

I ranked these by signal-per-effort. Each item is a slice the harness can grow into without re-architecting.

Added after gap review

Policy compiler (evals/policy-compiler.jsonl, 10 cases)

Covers all seven rule kinds: amount cap, merchant-category-code block, geography block, receipt required, after-hours, vendor blocklist, and structuring.
Includes refusal cases for ambiguous prose and injection-shaped text.
Includes an empty-input error path that must not call the model.
Live runner: pnpm eval:policy-compiler. Hermetic contract gate: tests/policy-compiler-corpus.test.ts.

Accounts-payable optimizer (evals/ap-optimizer.jsonl, 8 cases)

Covers same-currency card routing, cross-currency Reap Pay routing, large same-currency Optimize routing, fallback routing when the preferred sleeve is underfunded, insufficient-funds escalation candidates, and early-pay discount timing.
The Vitest gate asserts schedule, selected sleeve, shortfall, discount timing, and minimum remaining cash buffer for each case.

Coverage manifest (evals/coverage-gaps.json)

Tracks the 12 eval families still needed before claiming production-grade autonomy.
Distinguishes executable repo gates from specified evals blocked on real adapters, telemetry, identity, or consented datasets.

Per-skill

Auto-tagging

Calibration through expected calibration error — harness now reports expected calibration error per model; bar is < 0.03 once buckets have ≥10 cases. Next is growing the golden set to make the per-bucket numbers meaningful.
Tail-vendor slice — seeded 4 cases tagged tail (Korean printing, unknown Software-as-a-Service, Cebu handicraft, Lalamove courier). Three expect refusal because there is no clean chart-of-accounts hint — the right behaviour for the tail. Bar: ≥80% top-1 on the textbook subset; the refusal cases load-bear refusal recall.
Cross-tenant adversarial slice — two tenants carry contradicting vendor rules for the same vendor; the slice asserts each tenant resolves to its own rule via the per-tenant vendor-rule short-circuit, with no large-language-model call. The isolation claim is also pinned by a unit test so it is a hard continuous-integration gate, not a passing-eval observation.
Reversal-rate replay — ✅ shipped as pnpm eval -- --replay-overrides. The demo seed creates two overrides (auto-tagging → sponsorship correction); both are averted on replay via the learned vendor_rules short-circuit. The metric splits averted via learned rule from "the model actually got better" so the two signals aren't conflated.
Refusal-recall focused set — current 33% is the load-bearing weakness. Build 10–20 adversarial cases targeting the specific reason codes (missing_input, out_of_distribution, ambiguous_match) so each can be regressed independently. The new tail cases already contribute three out_of_distribution cases.

Policy enforcement (evals/policy-enforcement.jsonl, seeded)

Verdict accuracy across allow / flag / deny over merchant-category-code, geography, amount-cap, after-hours, and receipt rules.
Foreign-exchange edge cases: amounts straddling the United States dollar cap after conversion from Hong Kong dollar, Japanese yen, and Vietnamese dong.
Adversarial: prompt-injection inside vendor.rawName and lineItems[].description. The policy fast-path is deterministic so injection should not change verdicts — this eval guarantees that and will catch the day someone wires the large-language-model ambiguity classifier in.

Accounts-payable agent (evals/ap-agent.jsonl, seeded)

Schedule correctness: paymentDate is one banking day before dueAt, urgency tier matches daysUntilDue.
Source-of-funds selection: United States Dollar Coin corridors → reap-pay, same-currency → reap-card, idle-cash → reap-optimize.
Refusal: bill with no dueAt (missing_input).
Escalation: overdue bill → escalate(controller).
Constraint-violation rate: the new accounts-payable optimizer corpus asserts no selected sleeve breaches its expected minimum remaining cash buffer. Normalized discounted cumulative gain against a controller-ranked golden is still a future production eval.

Cross-cutting

Prompt-injection suite — vendor names, memo fields, optical-character-recognition-derived receipts containing "ignore prior instructions" / "code this to retained earnings". Cheap to run on every prompt change; the only eval that grows in importance with autonomy.
Schema-conformance rate — percentage of large-language-model outputs that pass Zod validation on first try, per model. Regression signal for prompt + model changes that does not need labels.
Large-language-model-as-judge spot check — weekly, ~50 sampled auto-posted decisions graded by a different model family against a rubric. Catches drift without a labeled set.
Latency / ninety-fifth percentile per skill — production service-level objective, not accuracy.
Multi-seed determinism check — same case × same model × temperature: 0 × N seeds. Surfaces silent provider drift.

Dangers & tradeoffs

Golden-set overfitting. 15 cases is a smoke test. Iterate the prompt against it more than a handful of times and you're memorizing. Hold out a frozen slice that only runs before release.
Refusal as a free lunch. Refusal precision is gameable by never refusing; recall is gameable by always refusing. Always report both, against a fixed must-refuse set.
Large-language-model-as-judge circularity. If Sonnet writes the answer and Sonnet grades it, the eval measures self-consistency, not correctness. Use a different family as judge, or pin to deterministic ground truth.
Cost of full sweep on every pull request. Four providers × N cases × every prompt iteration gets expensive. Gate the sweep behind a label or run nightly; on pull request run cheapest-only.
Provider non-determinism. Even with temperature: 0, hosted models drift between runs. Average over ≥3 seeds before publishing a number, or accept a ~1–2 point noise floor.
Adversarial cases that are too easy. One of the refusal cases has a literal ambiguous_match keyword cue — the model isn't reasoning, it's pattern-matching the phrasing. Adversarial cases should be hard because of context, not vocabulary.
Tenant-data leakage. The moment evals touch real tenant data, the golden set is a privacy surface. Synthesize, or sign a data processing agreement before importing.
Calibration ≠ accuracy. Measure calibration before raising the auto-post threshold. The 5% you get wrong is more dangerous than the 95% you get right.
Drift detection lag. Weekly cadence (per workflow-1/PLANNING.md §8) is fine for vendor-pattern drift, too slow for a chart-of-accounts change or a provider silent update. Add a daily 20-case canary that fails loud on regression.
Auto-post precision is the only metric that matters for trust. Every other number is a means to that end. Resist leaderboard-culture creep around top-1 accuracy.

Evals — method & results ​

Method ​

Latest run ​

Eval report — 2026-05-13T02:44:23.581Z ​

auto-tagging ​

How I read the results ​

What the harness gives me ​

What it does not measure yet ​

What the harness gained in v1.1 ​

Roadmap — evals worth building next ​

Added after gap review ​

Per-skill ​

Cross-cutting ​

Dangers & tradeoffs ​