Eval Lane · The Referee
The Eval Lane is the most common Defendable Run. It is where an AI agent’s work gets tested against a declared rulebook and turned into a receipt.
The referee is a rulebook, not a judge. (See Rulebook Engine for the full doctrine.)
The flow
Section titled “The flow”Flight Sheet ───► Assignment ───► Submission ───► Audit ───► Findings ───► Approval ───► Eval Receipt (rulebook) (work order) (agent output) (referee (ranked by (human) (chain-linked) runs rules) tier h/m/l)Every step is declared, deterministic, and human-approvable.
1. Flight Sheet — the rulebook
Section titled “1. Flight Sheet — the rulebook”A Flight Sheet is a versioned, declared eval template. It carries:
slug,name,version,lane(agentfor this lane).assignment_instructions— what the agent is told to do (evidence-only, label assumptions, flag missing inputs, no fabrication).required_inputs— what the agent needs to be given.expected_outputs— what the submission must contain.eval_spec— the executable spec the referee runs:required_output_schema— structure the submission must match.deterministic_checks— structure / schema / required-fields auto rules.math_checks— formulas the engine recomputes from agent inputs.evidence_checks— evidence-non-empty / assumptions-labeled rules.rules— declared yes/no policy gates in the DSL.penalty— variable-penalty bands for magnitude-driven flags.
audit_checks— the rule list each check belongs to, withkind(auto/checklist),category, andseverity.
Flight Sheets are content, not migrations. The library is upserted by slug at boot from api/flight_sheets/*.json.
2. Assignment — the work order
Section titled “2. Assignment — the work order”When a client creates an Eval Run, they pick a Flight Sheet and copy (or customize) the assignment_instructions into the Run. The Run also carries:
- The agent profile (the stack — model + harness + tools + runtime).
- The evidence the agent will need (property memo, T-12, source documents, etc.).
3. Submission — the agent’s output
Section titled “3. Submission — the agent’s output”The agent runs (anywhere — owner compute, hosted, hybrid). It produces a structured JSON submission that conforms to the Flight Sheet’s required_output_schema. The canonical submission shape:
{ "assignment_id": "<flight-sheet slug>", "agent_summary": "<one sentence>", "inputs_used": ["<field>", "..."], "missing_inputs": [], "claims": [{"claim": "<one>", "evidence_reference": "<source>", "confidence": "provided"}], "calculations": [ {"name": "DSCR", "formula": "noi / annual_debt_service", "inputs": {"noi": 920000, "annual_debt_service": 706253}, "result": 1.303, "units": "ratio"} ], "risks": [], "assumptions": [], "open_questions": [], "final_output": "PASS", "self_check": {"all_required_sections_completed": true, "all_numbers_have_sources": true, "assumptions_labeled": true, "missing_inputs_disclosed": true}}Math has to be re-derivable. Every calculations[] entry carries its own formula + inputs + claimed result. The referee recomputes the result from those inputs and flags if it disagrees.
4. Audit — the referee runs the rules
Section titled “4. Audit — the referee runs the rules”The referee runs the rulebook engine over the submission:
- Structure — required fields present? Missing → flag.
- Schema — fields obey declared types? Wrong type → flag.
- Math re-derivation — recompute every
calculations[]result from its own inputs via a safe AST evaluator (handles the mortgage amortization formula, compound expressions). Beyond tolerance → flag with variable penalty (small miss = mid; ≥10% rel or material absolute $ = high). - Evidence — required fields non-empty; assumptions labeled; missing inputs disclosed.
- Policy DSL — declared gates (e.g.
DSCR >= 1.20) checked against the agent’s own numbers. Missing operand →skip(never false-flag).
No LLM is called on the receipt path. A judge-model slot exists for advisory hallucination / readiness signals later, but it never gates the receipt.
5. Findings — the ranked report
Section titled “5. Findings — the ranked report”The referee emits:
- Score = % of declared rules satisfied.
- Severity = honey · jelly · propolis (driven by the highest-tier flag).
- Risk breakdown =
{high, mid, low}count of flag tiers. - Client ready boolean.
- Recommended action =
resubmit | review | approve | reject(what the rulebook implies). - Findings = the flag list ranked high → low (catastrophic events surface first), each with its spot of the foul (
"DSCR recomputed 1.022 — gate 1.20 — MISMATCH").
Each flag is also sorted into one of three buckets via the three-bucket taxonomy:
- work-defect — fixable; correct & resubmit (the agent).
- deal-finding — the math is right, the rule says no (the client).
- stack-fit — the model/compute is below the lane (the operator). This is the sale.
6. Approval — the human signs
Section titled “6. Approval — the human signs”Approval is required. The portal’s Approve button (or POST /runs/{id}/approve) records the approver’s identity into the receipt payload. No receipt without approval.
7. Eval Receipt — the chain-linked record
Section titled “7. Eval Receipt — the chain-linked record”The eval receipt is minted via POST /runs/{id}/receipt:
- Schema:
defendablecloud.eval-receipt/v1. - Payload: Flight Sheet version, agent profile, assignment text, evidence hashes, submission hash, findings (ranked), verdict, approver, timestamp.
- Chain:
parent_hashpoints at the org’s prior receipt. SHA-256 over the canonical payload. - Render: PDF via
fpdf2, regenerable from the payload. - Share: the public API share endpoint
GET /share/{token}returns thePublicReceipt(full payload +receipt_sha256+ a server-recomputedverifiedboolean) — anyone can recompute and verify client-side, no auth. The Vault SPA may render a friendlier/r/<token>page on top of this endpoint;/r/<token>is the app’s client-side view, distinct from the API path/share/{token}.
The Flight Sheet library
Section titled “The Flight Sheet library”The eval lane is powered by a live library of ~50 Flight Sheets covering CRE underwriting, lending, dataset QA, compute attestation, document drafting, evidence extraction, financial math, GenAI checks, compliance, and repair-eligibility. Lanes are added by extending the library content (not by migrations).
The library is loaded at API boot from api/flight_sheets/*.json. The loader upserts by slug and deactivates anything absent. Historical Runs keep their flight_sheet_id.
Authoring discipline: Flight Sheets are authored via a forge that validates against a rule registry — never hand-edited. Unknown auto-rule keys are rejected, not silently accepted. (Forge tooling and the Kit-side rulebook sync are tracked separately; this docs sprint covers the lane doctrine itself.)
The trust boundary
Section titled “The trust boundary”“DefendableCloud is not AI judging AI. It is agent work tested against a declared rulebook.”
The proof layer is math and code. The referee has the rulebook. The referee throws flags. The human owns the final trust decision.
🐝 Flight sheet · submission · audit · approval · receipt. The referee is a rulebook. To the shed.