Appearance
Solution narrative
This is the full design story: how I made the system think, what I built end to end, where I left honest seams, and which trade-offs I chose under the timebox.
TL;DR
A unified agentic platform for finance ops. Auto-tagging is built end-to-end. Policy enforcement, the Accounts Payable agent, and a scheduled Daily CFO Pulse run on the same platform — shared runtime, audit, autonomy ladder, and override learning loop. The framing: three reactive skills sit on one platform, and a proactive CFO layer now speaks first.
The honest positioning: this is not yet a full autonomous CFO. It is the control plane and first four finance skills that make an AI CFO credible: a typed financial event substrate, bounded cognition, proactive review surfaces, a cash-risk view, refusal-first safety, audit, evals, and a learning loop. To become a true AI CFO, the same spine needs live Reap rails, accounting writes, richer forecasts, and hardened scheduled risk monitoring.
"I treat the LLM as a constrained tool inside a deterministic workflow, not as the orchestrator. Every decision is versioned, logged, and reversible. The agent ships with four outcomes — apply, suggest, refuse, escalate — and a four-rung autonomy ladder that lets a tenant adopt at their own pace. Overrides feed three loops: per-vendor rules, retrieval index, and the eval set. Multi-tenant isolation is enforced at the data layer. The MVP runs end-to-end on mocked integrations with a real eval harness, and the production path swaps the synchronous runner for Inngest and the mocks for live adapters without rewriting the agent logic."
What it takes to become a true AI CFO
The bar is higher than "a chatbot over transactions." A real AI CFO has to notice problems before the finance team does, explain the evidence, and either take the safe action or escalate early enough that humans can still change the outcome.
| Capability | What is in this repo | What needs to exist in production |
|---|---|---|
| Financial memory | Canonical FinanceEvent union, vendors, CoA, policies, balances, decisions, overrides | Live Reap ledger feeds, real accounting sync, payroll/tax/AR/imported bank data, versioned entity snapshots |
| Proactive risk loop | Dashboard, review queue, treasury cockpit, six-week AP cash forecast, policy flags, fraud holds | Scheduled watch jobs that emit "next problem" alerts: cash-floor breach, FX exposure, vendor-bank change, policy drift, duplicate invoice, close-risk |
| Action rails | Accept/override actions, dual-control AP approval, recommendation-only payment plan | Xero/QBO/NetSuite writes, Reap Pay/Card/Optimize execution adapters, notification channels, idempotent retries, rollback windows |
| CFO judgment layer | Deterministic AP optimizer, policy engine, LLM auto-tagging, rationale panel | Scenario planning, variance explanations, working-capital recommendations, board-ready weekly digest, tenant-specific risk appetite |
| Trust and governance | Refusal contract, autonomy ladder, tenant-filtered queries, eval harness, versioned decisions | Real RBAC, Postgres RLS, DB-enforced immutable audit events, SOC 2 control map, prompt redaction, data residency policy |
| Self-improvement | Overrides write vendor rules and eval rows; retrieval embedding seam exists | Tenant-partitioned vector retrieval, shadow-mode replay, calibration tracking, promotion gates before autonomy increases |
The goal isn't to pretend the whole CFO is finished. The goal is to prove the hardest architectural slice: the agent can sit on rich financial data, make bounded decisions, refuse unsafe work, learn from corrections, and surface the next operational issue before it becomes a month-end cleanup task.
Reactive → Proactive
The first three skills are reactive: a Reap event arrives, a bounded skill classifies it, enforces a rule, or schedules a payment. The new Daily CFO Pulse is the agency layer on top: a scheduled scheduled_pulse event wakes the platform up, runs deterministic analysis over the same Reap rails, writes one daily-pulse decision, and renders /brief.
The product thesis is now: reactive event skills + proactive CFO pulse over Reap's privileged rails.
| Layer | Built behavior |
|---|---|
| Cognition layer | Auto-tagging, policy enforcement, and AP agent react to individual card, payout, receipt, FX, and bill events. |
| Agency layer | daily-pulse looks across Reap Direct, Pay, Card, and Optimize signals and tells the operator what to do this week. |
Self-grade against the three axes
Product thinking
I designed for two readers inside the product. The controller or accounts-payable clerk lives in the queue every day; the chief financial officer touches the dashboard, treasury cockpit, forecast, and morning brief. The wedge is finance-agent infrastructure on Reap rails: multi-currency fiat and United States Dollar Coin spend and bills, a typed refusal contract, and an autonomy ladder that lets each tenant graduate skill by skill. Every row carries confidence, evidence, and an override path because the "Why" panel is the trust surface. The north-star metric I would watch is reversal rate — auto-applied entries later overridden — plotted against token cost so improvement is visible, not asserted.
Architecture & cognition
Code orchestrates; the large language model is a tool. I made the vendor-rule short-circuit bypass the model entirely, and when the model is used its job is typed classification behind a Zod schema. All Reap-shaped events flow through one FinanceEvent union. The runtime is synchronous today and durable-target tomorrow through Inngest; the skill interface is shaped so that swap is a deployment change, not a rewrite. Tenant identifier is first-class on every table and query. Decision payloads are already shaped for accounting and payment adapters, but the real ledger and real payment rails remain explicit seams.
Production-ready
For production readiness, I focused on the failure modes that would break trust first. Every decision is a versioned row: prompt, model, chart-of-accounts, rules versions, evidence, confidence, rationale, and idempotency key. Refusal is a first-class outcome with a structured reason code. Schema failure becomes refuse(missing_input), off-chart-of-accounts output becomes refuse(out_of_distribution), and accounts-payable money-movement holds escalate instead of guessing. Idempotency is a hash of the full version tuple, so replays across prompt, model, chart of accounts, and policy edits are safe. Decisions are immutable by convention, vendor rules are append-only through supersession, and overrides are the rollback channel. Accounts Payable at or above $10k United States dollar-equivalent requires dual control; the second approval is the only path to execution.
Read next
- Cognition flow — how a FinanceEvent walks from ingest to decision log
- What's mocked, trade-offs, next steps — boundaries, what's real, what I'd build next
- Evals — method & results — multi-model sweep, calibration, replay