Documentation source
Evals
Golden sessions, replays, rubric scoring, trace-to-candidate triage, and deterministic knowledge-artifact evals for regression coverage of agent outputs and knowledge-loop outputs.
# Evals
The evals module captures agent outputs that mattered (golden sessions), replays them against the current agent config (replays), scores the replays against rating dimensions (rubrics), and surfaces negative-feedback prompts that should become next-week's goldens (candidates). All four live on `/admin/agent-review/evals`.
## Key concepts
- **Golden session** — a completed session promoted as a regression baseline. The promotion freezes a `GoldenSessionSnapshot` into `sessions.metadata.golden` so replays reproduce the same starting conditions even when shared-context lessons are edited later.
- **Replay** — a re-run of a golden's snapshot against the live agent runtime. Produces an `EvalComparison` (text similarity + optional rubric scores) attached to the replay session's `metadata.eval_result`.
- **Rubric** — a criteria set attached to a golden via `metadata.golden.criteriaSetId`. When present, replays are LLM-judged per rating dimension and emit a `RubricScoreResult`.
- **Candidate** — a dedup'd user prompt extracted from negative-feedback rows. Admins promote them into golden sets via the existing PromoteAsGolden dialog, or dismiss them.
## Surfaces
`/admin/agent-review/evals` is the operator's home for this module. It renders three sections, top to bottom:
1. **Intro** — short copy explaining what goldens are and how to promote one from a session detail page.
2. **Candidates** — the triage queue (see below). Capped at 5 visible cards by default with "View all N" expander.
3. **Golden sessions** — the `<GoldenSessionManager>` showing golden sets, single + bulk replay actions, and rubric attachments.
## Candidates — trace-to-prompt promotion
Surfaces dedup'd negative-feedback prompts so operators can promote them into golden sets in one click.
### Eligibility (strict)
A `feedback` row qualifies when ALL hold:
- `rating = 'negative'`, `status = 'pending'`
- `session_id IS NOT NULL`, `agent_id IS NOT NULL`
- `source_type IN ('response', 'session')` (chat is deferred; extraction/tool/observation excluded because their first session events are machine-generated)
- The session has at least one `session_events` row with `event_type = 'user.message'` and an extractable message via `extractMessageText`
### Dedup
Key = `(agent_id, prompt.trim().toLowerCase().replace(/\s+/g, " "))`. Two qualifying feedback rows for the same prompt + same agent collapse into one candidate with `occurrenceCount: 2` and both feedback IDs.
### Actions
- **Promote to golden** — opens the existing `<PromoteAsGoldenButton>` pre-filled with the candidate's representative session (most recent feedback in the group). On submit success, the group's feedback rows flip to `status='applied'` via `PATCH /api/admin/evals/candidates` with `action='applied'`.
- **Dismiss** — `<ConfirmDialog>` → PATCH with `action='dismissed'`. Race-guarded by `.eq("status","pending")`.
### Architecture
`listCandidateGoldens()` issues exactly TWO DB roundtrips regardless of candidate count: one `feedback` + `agents` join, one batched `session_events` `.in(session_ids)` lookup. Group / sort / dedup happen in memory.
```
feedback (rating='negative', status='pending', src∈{response,session}, ...)
│ join agents → agent slug
▼
session_events (.in(session_ids), event_type='user.message', order by sequence asc)
│ take first per session, extractMessageText → prompt
▼
group by (agent_id, normalized_prompt) → CandidateGolden[]
```
`resolveCandidateGroup(feedbackIds, action)` issues ONE bulk UPDATE: `.update({status, reviewed_by_user_id, reviewed_at}).in("id", ids).eq("tenant_id", t).eq("status","pending").select("id")`. Returns `{ updated: N }`.
## API reference
### Server actions (`features/evals/server/candidates.ts`)
```ts
listCandidateGoldens(opts?: { limit?: number }): Promise<CandidateGolden[]>;
// limit: clamped to [1, 100], default 20
// Admin-gated, tenant-scoped, two DB roundtrips
resolveCandidateGroup(
feedbackIds: string[],
action: "dismissed" | "applied",
): Promise<{ updated: number }>;
// Bulk UPDATE with race guard. Empty array short-circuits with updated:0.
```
### Routes
| Method | Path | Body / Query | Description |
| ------ | ------------------------------- | ---------------------------------------------- | ------------------------------------------- |
| GET | `/api/admin/evals/candidates` | `?limit=N` (1-100, default 20) | List candidate prompts for the active tenant |
| PATCH | `/api/admin/evals/candidates` | `{ feedbackIds: uuid[1..200], action: enum }` | Bulk resolve a candidate group |
| GET | `/api/admin/evals/golden` | `?set=slug&limit=N` | List golden sessions |
| POST | `/api/admin/evals/golden` | `{ sessionId, set?, note?, criteriaSetId? }` | Promote a completed session to golden |
| POST | `/api/admin/evals/runs` | `{ goldenSessionId }` | Replay a single golden |
| POST | `/api/admin/evals/runs/set` | `{ setName }` | Bulk-replay every golden in a set |
## Design decisions
- **No new DB schema.** Candidates v1 reuses the existing `feedback` table. The status column already has `pending / reviewed / applied / dismissed` values; PR 3 just adds new transitions.
- **Anchor candidates on first `user.message` only.** Not `session.created` / `session.claimed` — those are machine triggers, not user prompts. Multi-turn anchoring (cherry-picking the rejected turn) is a v2 follow-up.
- **One PATCH endpoint with `action` discriminator.** Both Dismiss and the post-promote feedback-flip flow through the same route — keeps the API surface minimal.
- **Race guard on resolve.** `.eq("status","pending")` on every bulk UPDATE ensures Admin A's promote-flip survives Admin B's stale dismiss.
- **Layout — cap visible at 5.** The triage section sits above `<GoldenSessionManager>`, which is the primary surface admins return to. Capping keeps the manager above the fold.
- **Codex was quota-exhausted during spec review.** Claude ran the Codex-equivalent pattern review; Gemini provided the third opinion. Acceptable per `.claude/rules/multi-model-review.md` for medium-risk admin-only surface with no new schema.
## Knowledge-artifact evals
Golden-session replay scores agent **runs**; the loop closure-scorer
(`features/loops/server/closure-scorer.ts`) scores loop **wiring**. The
knowledge-artifact harness (`features/evals/knowledge/`) scores the third thing:
the **outputs** a knowledge loop produces — an evidence claim, a protocol, a
product-knowledge record — proving they are evidence-backed, schema-complete,
actionable, safety-labeled, renderable, and tenant-scoped. It is a deterministic
pure library (no DB, no LLM) so it runs in vitest and a CI scorecard.
### The six dimensions
A `KnowledgeRubric` scores a normalized `KnowledgeArtifact` across six axes
(snake_case so the judge bridge can build `${key}_score` keys):
| Dimension | Question |
| ------------------------- | -------------------------------------------------------------------- |
| `source_grounding` | Is the claim backed by citable sources/signals, not asserted bare? |
| `schema_completeness` | Are the required + load-bearing fields populated? |
| `actionability` | Can a human/agent act on it — concrete, specific, non-vague? |
| `safety_human_gate` | Is the safety posture + human gate correctly labeled (no auto-approve)? |
| `workspace_renderability` | Does it route to a real surface with a non-stub payload? |
| `tenant_isolation` | Is it scoped to its tenant with no cross-tenant leakage? |
### Model: artifact → adapter → rubric → suite
- **`KnowledgeArtifact`** — a tenant entity (or agent output) projected to a
tenant-agnostic shape (`fields`, `evidenceRefs`, `safety`, `render`).
- **Adapter** — a per-tenant function that maps a real entity to a
`KnowledgeArtifact` (`evidenceClaimToArtifact`, `therapyPlanToArtifact`,
`pfIdeaToArtifact`, `pfKnowledgeToArtifact`). This is the model-agnostic seam:
the same rubric grades any entity, and a new type slots in behind a sibling
adapter with no harness change. Sprinter `pf_knowledge` (the evidence-backed
record the product-knowledge loop **produces**) is graded directly against its
10 real seed records.
- **Rubric** — six dimensions, each composing a reusable check from
`features/evals/knowledge/checks.ts` (`requireEvidenceRefs`, `requireFields`,
`requireActionablePayload`, `requireSafetyLabel`, `requireRenderable`,
`assertTenantScope`) against the tenant's real field names. A rubric SHOULD be
a **lens over its `KnowledgeLoopDefinition`** (`features/loops/lib/knowledge-loop.ts`):
the `pfKnowledgeRubric` derives its render surfaces, lifecycle statuses
(`draft`/`published`), evidence fields, and review gate from the loop's
declared `surfaces` / `records` / `reviewGate` — so the eval can never drift
from what the loop declares.
- **Suite** — a rubric + labeled fixtures (good → PASS, bad → FAIL on one named
dimension). `runKnowledgeEvalSuite` enforces two contracts: every fixture
behaves as labeled, **and** every dimension has ≥1 failing fixture targeting
it (proof the dimension is load-bearing).
No parallel systems: `scoreKnowledgeArtifact` mirrors `scoreLoopClosureFromFacts`,
emits the existing `RubricScoreResult` surface shape
(`knowledgeResultToRubricScoreResult`), and bridges the same rubric to the live
`runRubricJudge` LLM path via `knowledgeRubricToCriteriaDimensions`. The harness
re-exports `KnowledgeSafetyClass` from the loop primitive rather than redefining
it — one safety taxonomy across the loop and its evals.
### Source grounding is due at publish, not at draft
`requireEvidenceRefs({ draftStatuses })` exempts in-progress records from the
grounding floor — a research-KG population loop drafts a record first and grounds
it before publish. A draft with zero sources passes `source_grounding`; the same
record, once it reaches a `finalizedStatuses` value, MUST be sourced. This is how
the pf_knowledge suite grades real `draft` seeds as trustworthy while still
failing a `published` record that cites nothing.
### Renderability verification (do published surfaces resolve?)
The `workspace_renderability` dimension trusts a `knownSurfaces` list. To prove
those surfaces are real (not stubs), `surface-resolution.ts` resolves a
`KnowledgeLoopDefinition`'s declared `surfaces[]` against injected per-kind
resolvers (`resolveKnowledgeSurfaces` / `unresolvedKnowledgeSurfaces`), and
`makeSurfaceProbe(resolveSurface)` turns a resolver into a `RenderabilityProbe`
that `requireRenderable({ probe })` treats as authoritative. The per-tenant
`renderability.test.ts` files wire real resolvers — Sprinter page surfaces →
`TenantModule.pages` component map; DOC'S workspace / entity-list surfaces →
declared workspaces + entity types (+ platform system types) — and assert every
loop-declared surface resolves. (PR #2584 shipped no `validateManifestRenderability`;
this bridge is the deterministic, CI-safe proof in its place.)
### Running the evals
```bash
# Contract tests (good→pass, bad→fail-on-dimension, full coverage, schema drift)
pnpm exec vitest run features/evals/knowledge \
features/custom/tenants/docs/evals features/custom/tenants/sprinter/evals
# Deterministic scorecard — exit 0 = all pass, 1 = a suite failed
pnpm evals:knowledge # human-readable
pnpm evals:knowledge --json # machine-readable
pnpm evals:knowledge --out DIR # write summary.json + summary.md
```
### Adding a tenant suite (extension recipe)
1. Create `features/custom/tenants/<slug>/evals/`.
2. Write an **adapter** mapping your entity's real fields to a `KnowledgeArtifact`.
3. Write a **rubric** composing the six dimensions from the reusable checks,
keyed on your entity's real `json_schema` field names.
4. Write **fixtures**: good artifacts that PASS + one bad artifact per dimension
(`failsDimension`). Real seed records make the best good fixtures.
5. Export a `KnowledgeEvalSuite` and add a co-located `*.test.ts` that runs it
plus a **drift guard** asserting your rubric only references fields the real
entity-type schema declares.
6. Register the suite in `scripts/evals/knowledge-scorecard.ts`. Step 5's
co-located test runs in CI automatically; step 6 is what surfaces the suite
in `pnpm evals:knowledge` — skip it and the suite is CI-covered but invisible
in the scorecard.
The DOC'S/Praxium (`evidence-claim`, `therapy-plan`) and Sprinter (`pf_idea`,
`pf_knowledge`) suites are the worked examples. `pf_knowledge` is the canonical
loop-derived rubric; `renderability.test.ts` in each tenant's `evals/` is the
worked surface-resolution proof.
## Related modules
- `features/feedback/**` — the unified feedback table that powers Candidates as a query view
- `features/responses/**` — `CriteriaSetDimension` + `computeResponseScore`; the rubric LLM-judge path and the knowledge-rubric criteria bridge
- `features/loops/**` — `scoreLoopClosureFromFacts` (the facts→score→gaps pattern the knowledge scorer mirrors)
- `features/sessions/**` — `session_events` is the source of truth for candidate prompts
- `features/context/**` — frozen `shared_context` baked into `GoldenSessionSnapshot`
- `features/custom/tenants/{docs,sprinter}/evals/**` — the tenant knowledge-eval suites
## For agents
When triaging candidates programmatically:
1. `GET /api/admin/evals/candidates` to fetch the queue.
2. For each candidate, `POST /api/admin/evals/golden { sessionId: representativeSessionId, set, criteriaSetId? }` to promote.
3. `PATCH /api/admin/evals/candidates { feedbackIds, action: 'applied' }` to flip the group's feedback rows out of `pending`.
4. Or `PATCH ... action: 'dismissed'` to drop a candidate.
All routes are admin-gated. Background jobs should call the server actions directly (`listCandidateGoldens`, `resolveCandidateGroup`) under `withToolContext({ tenantId })`.