Documentation source

Evals

Golden sessions, replays, rubric scoring, trace-to-candidate triage, and deterministic knowledge-artifact evals for regression coverage of agent outputs and knowledge-loop outputs.

# Evals

The evals module captures agent outputs that mattered (golden sessions), replays them against the current agent config (replays), scores the replays against rating dimensions (rubrics), and surfaces negative-feedback prompts that should become next-week's goldens (candidates). All four live on `/admin/agent-review/evals`.

## Key concepts

- **Golden session** — a completed session promoted as a regression baseline. The promotion freezes a `GoldenSessionSnapshot` into `sessions.metadata.golden` so replays reproduce the same starting conditions even when shared-context lessons are edited later.
- **Replay** — a re-run of a golden's snapshot against the live agent runtime. Produces an `EvalComparison` (text similarity + optional rubric scores) attached to the replay session's `metadata.eval_result`.
- **Rubric** — a criteria set attached to a golden via `metadata.golden.criteriaSetId`. When present, replays are LLM-judged per rating dimension and emit a `RubricScoreResult`.
- **Candidate** — a dedup'd user prompt extracted from negative-feedback rows. Admins promote them into golden sets via the existing PromoteAsGolden dialog, or dismiss them.

## Surfaces

`/admin/agent-review/evals` is the operator's home for this module. It renders three sections, top to bottom:

1. **Intro** — short copy explaining what goldens are and how to promote one from a session detail page.
2. **Candidates** — the triage queue (see below). Capped at 5 visible cards by default with "View all N" expander.
3. **Golden sessions** — the `<GoldenSessionManager>` showing golden sets, single + bulk replay actions, and rubric attachments.

## Candidates — trace-to-prompt promotion

Surfaces dedup'd negative-feedback prompts so operators can promote them into golden sets in one click.

### Eligibility (strict)

A `feedback` row qualifies when ALL hold:

- `rating = 'negative'`, `status = 'pending'`
- `session_id IS NOT NULL`, `agent_id IS NOT NULL`
- `source_type IN ('response', 'session')` (chat is deferred; extraction/tool/observation excluded because their first session events are machine-generated)
- The session has at least one `session_events` row with `event_type = 'user.message'` and an extractable message via `extractMessageText`

### Dedup

Key = `(agent_id, prompt.trim().toLowerCase().replace(/\s+/g, " "))`. Two qualifying feedback rows for the same prompt + same agent collapse into one candidate with `occurrenceCount: 2` and both feedback IDs.

### Actions

- **Promote to golden** — opens the existing `<PromoteAsGoldenButton>` pre-filled with the candidate's representative session (most recent feedback in the group). On submit success, the group's feedback rows flip to `status='applied'` via `PATCH /api/admin/evals/candidates` with `action='applied'`.
- **Dismiss** — `<ConfirmDialog>` → PATCH with `action='dismissed'`. Race-guarded by `.eq("status","pending")`.

### Architecture

`listCandidateGoldens()` issues exactly TWO DB roundtrips regardless of candidate count: one `feedback` + `agents` join, one batched `session_events` `.in(session_ids)` lookup. Group / sort / dedup happen in memory.

```
feedback (rating='negative', status='pending', src∈{response,session}, ...)
   │ join agents → agent slug
   ▼
session_events (.in(session_ids), event_type='user.message', order by sequence asc)
   │ take first per session, extractMessageText → prompt
   ▼
group by (agent_id, normalized_prompt) → CandidateGolden[]
```

`resolveCandidateGroup(feedbackIds, action)` issues ONE bulk UPDATE: `.update({status, reviewed_by_user_id, reviewed_at}).in("id", ids).eq("tenant_id", t).eq("status","pending").select("id")`. Returns `{ updated: N }`.

## API reference

### Server actions (`features/evals/server/candidates.ts`)

```ts
listCandidateGoldens(opts?: { limit?: number }): Promise<CandidateGolden[]>;
//  limit: clamped to [1, 100], default 20
//  Admin-gated, tenant-scoped, two DB roundtrips

resolveCandidateGroup(
  feedbackIds: string[],
  action: "dismissed" | "applied",
): Promise<{ updated: number }>;
//  Bulk UPDATE with race guard. Empty array short-circuits with updated:0.
```

### Routes

| Method | Path                            | Body / Query                                   | Description                                 |
| ------ | ------------------------------- | ---------------------------------------------- | ------------------------------------------- |
| GET    | `/api/admin/evals/candidates`   | `?limit=N` (1-100, default 20)                 | List candidate prompts for the active tenant |
| PATCH  | `/api/admin/evals/candidates`   | `{ feedbackIds: uuid[1..200], action: enum }`  | Bulk resolve a candidate group              |
| GET    | `/api/admin/evals/golden`       | `?set=slug&limit=N`                            | List golden sessions                        |
| POST   | `/api/admin/evals/golden`       | `{ sessionId, set?, note?, criteriaSetId? }`   | Promote a completed session to golden       |
| POST   | `/api/admin/evals/runs`         | `{ goldenSessionId }`                          | Replay a single golden                      |
| POST   | `/api/admin/evals/runs/set`     | `{ setName }`                                  | Bulk-replay every golden in a set           |

## Design decisions

- **No new DB schema.** Candidates v1 reuses the existing `feedback` table. The status column already has `pending / reviewed / applied / dismissed` values; PR 3 just adds new transitions.
- **Anchor candidates on first `user.message` only.** Not `session.created` / `session.claimed` — those are machine triggers, not user prompts. Multi-turn anchoring (cherry-picking the rejected turn) is a v2 follow-up.
- **One PATCH endpoint with `action` discriminator.** Both Dismiss and the post-promote feedback-flip flow through the same route — keeps the API surface minimal.
- **Race guard on resolve.** `.eq("status","pending")` on every bulk UPDATE ensures Admin A's promote-flip survives Admin B's stale dismiss.
- **Layout — cap visible at 5.** The triage section sits above `<GoldenSessionManager>`, which is the primary surface admins return to. Capping keeps the manager above the fold.
- **Codex was quota-exhausted during spec review.** Claude ran the Codex-equivalent pattern review; Gemini provided the third opinion. Acceptable per `.claude/rules/multi-model-review.md` for medium-risk admin-only surface with no new schema.

## Knowledge-artifact evals

Golden-session replay scores agent **runs**; the loop closure-scorer
(`features/loops/server/closure-scorer.ts`) scores loop **wiring**. The
knowledge-artifact harness (`features/evals/knowledge/`) scores the third thing:
the **outputs** a knowledge loop produces — an evidence claim, a protocol, a
product-knowledge record — proving they are evidence-backed, schema-complete,
actionable, safety-labeled, renderable, and tenant-scoped. It is a deterministic
pure library (no DB, no LLM) so it runs in vitest and a CI scorecard.

### The six dimensions

A `KnowledgeRubric` scores a normalized `KnowledgeArtifact` across six axes
(snake_case so the judge bridge can build `${key}_score` keys):

| Dimension                 | Question                                                              |
| ------------------------- | -------------------------------------------------------------------- |
| `source_grounding`        | Is the claim backed by citable sources/signals, not asserted bare?   |
| `schema_completeness`     | Are the required + load-bearing fields populated?                    |
| `actionability`           | Can a human/agent act on it — concrete, specific, non-vague?         |
| `safety_human_gate`       | Is the safety posture + human gate correctly labeled (no auto-approve)? |
| `workspace_renderability` | Does it route to a real surface with a non-stub payload?             |
| `tenant_isolation`        | Is it scoped to its tenant with no cross-tenant leakage?             |

### Model: artifact → adapter → rubric → suite

- **`KnowledgeArtifact`** — a tenant entity (or agent output) projected to a
  tenant-agnostic shape (`fields`, `evidenceRefs`, `safety`, `render`).
- **Adapter** — a per-tenant function that maps a real entity to a
  `KnowledgeArtifact` (`evidenceClaimToArtifact`, `therapyPlanToArtifact`,
  `pfIdeaToArtifact`, `pfKnowledgeToArtifact`). This is the model-agnostic seam:
  the same rubric grades any entity, and a new type slots in behind a sibling
  adapter with no harness change. Sprinter `pf_knowledge` (the evidence-backed
  record the product-knowledge loop **produces**) is graded directly against its
  10 real seed records.
- **Rubric** — six dimensions, each composing a reusable check from
  `features/evals/knowledge/checks.ts` (`requireEvidenceRefs`, `requireFields`,
  `requireActionablePayload`, `requireSafetyLabel`, `requireRenderable`,
  `assertTenantScope`) against the tenant's real field names. A rubric SHOULD be
  a **lens over its `KnowledgeLoopDefinition`** (`features/loops/lib/knowledge-loop.ts`):
  the `pfKnowledgeRubric` derives its render surfaces, lifecycle statuses
  (`draft`/`published`), evidence fields, and review gate from the loop's
  declared `surfaces` / `records` / `reviewGate` — so the eval can never drift
  from what the loop declares.
- **Suite** — a rubric + labeled fixtures (good → PASS, bad → FAIL on one named
  dimension). `runKnowledgeEvalSuite` enforces two contracts: every fixture
  behaves as labeled, **and** every dimension has ≥1 failing fixture targeting
  it (proof the dimension is load-bearing).

No parallel systems: `scoreKnowledgeArtifact` mirrors `scoreLoopClosureFromFacts`,
emits the existing `RubricScoreResult` surface shape
(`knowledgeResultToRubricScoreResult`), and bridges the same rubric to the live
`runRubricJudge` LLM path via `knowledgeRubricToCriteriaDimensions`. The harness
re-exports `KnowledgeSafetyClass` from the loop primitive rather than redefining
it — one safety taxonomy across the loop and its evals.

### Source grounding is due at publish, not at draft

`requireEvidenceRefs({ draftStatuses })` exempts in-progress records from the
grounding floor — a research-KG population loop drafts a record first and grounds
it before publish. A draft with zero sources passes `source_grounding`; the same
record, once it reaches a `finalizedStatuses` value, MUST be sourced. This is how
the pf_knowledge suite grades real `draft` seeds as trustworthy while still
failing a `published` record that cites nothing.

### Renderability verification (do published surfaces resolve?)

The `workspace_renderability` dimension trusts a `knownSurfaces` list. To prove
those surfaces are real (not stubs), `surface-resolution.ts` resolves a
`KnowledgeLoopDefinition`'s declared `surfaces[]` against injected per-kind
resolvers (`resolveKnowledgeSurfaces` / `unresolvedKnowledgeSurfaces`), and
`makeSurfaceProbe(resolveSurface)` turns a resolver into a `RenderabilityProbe`
that `requireRenderable({ probe })` treats as authoritative. The per-tenant
`renderability.test.ts` files wire real resolvers — Sprinter page surfaces →
`TenantModule.pages` component map; DOC'S workspace / entity-list surfaces →
declared workspaces + entity types (+ platform system types) — and assert every
loop-declared surface resolves. (PR #2584 shipped no `validateManifestRenderability`;
this bridge is the deterministic, CI-safe proof in its place.)

### Running the evals

```bash
# Contract tests (good→pass, bad→fail-on-dimension, full coverage, schema drift)
pnpm exec vitest run features/evals/knowledge \
  features/custom/tenants/docs/evals features/custom/tenants/sprinter/evals

# Deterministic scorecard — exit 0 = all pass, 1 = a suite failed
pnpm evals:knowledge            # human-readable
pnpm evals:knowledge --json     # machine-readable
pnpm evals:knowledge --out DIR  # write summary.json + summary.md
```

### Adding a tenant suite (extension recipe)

1. Create `features/custom/tenants/<slug>/evals/`.
2. Write an **adapter** mapping your entity's real fields to a `KnowledgeArtifact`.
3. Write a **rubric** composing the six dimensions from the reusable checks,
   keyed on your entity's real `json_schema` field names.
4. Write **fixtures**: good artifacts that PASS + one bad artifact per dimension
   (`failsDimension`). Real seed records make the best good fixtures.
5. Export a `KnowledgeEvalSuite` and add a co-located `*.test.ts` that runs it
   plus a **drift guard** asserting your rubric only references fields the real
   entity-type schema declares.
6. Register the suite in `scripts/evals/knowledge-scorecard.ts`. Step 5's
   co-located test runs in CI automatically; step 6 is what surfaces the suite
   in `pnpm evals:knowledge` — skip it and the suite is CI-covered but invisible
   in the scorecard.

The DOC'S/Praxium (`evidence-claim`, `therapy-plan`) and Sprinter (`pf_idea`,
`pf_knowledge`) suites are the worked examples. `pf_knowledge` is the canonical
loop-derived rubric; `renderability.test.ts` in each tenant's `evals/` is the
worked surface-resolution proof.

## Related modules

- `features/feedback/**` — the unified feedback table that powers Candidates as a query view
- `features/responses/**` — `CriteriaSetDimension` + `computeResponseScore`; the rubric LLM-judge path and the knowledge-rubric criteria bridge
- `features/loops/**` — `scoreLoopClosureFromFacts` (the facts→score→gaps pattern the knowledge scorer mirrors)
- `features/sessions/**` — `session_events` is the source of truth for candidate prompts
- `features/context/**` — frozen `shared_context` baked into `GoldenSessionSnapshot`
- `features/custom/tenants/{docs,sprinter}/evals/**` — the tenant knowledge-eval suites

## For agents

When triaging candidates programmatically:

1. `GET /api/admin/evals/candidates` to fetch the queue.
2. For each candidate, `POST /api/admin/evals/golden { sessionId: representativeSessionId, set, criteriaSetId? }` to promote.
3. `PATCH /api/admin/evals/candidates { feedbackIds, action: 'applied' }` to flip the group's feedback rows out of `pending`.
4. Or `PATCH ... action: 'dismissed'` to drop a candidate.

All routes are admin-gated. Background jobs should call the server actions directly (`listCandidateGoldens`, `resolveCandidateGroup`) under `withToolContext({ tenantId })`.

# Evals The evals module captures agent outputs that mattered (golden sessions), replays them against the current agent config (replays), scores the replays against rating dimensions (rubrics), and surfaces negative-feedback prompts that should become next-week's goldens (candidates). All four live on `/admin/agent-review/evals`. ## Key concepts - **Golden session** — a completed session promoted as a regression baseline. The promotion freezes a `GoldenSessionSnapshot` into `sessions.metadata.golden` so replays reproduce the same starting conditions even when shared-context lessons are edited later. - **Replay** — a re-run of a golden's snapshot against the live agent runtime. Produces an `EvalComparison` (text similarity + optional rubric scores) attached to the replay session's `metadata.eval_result`. - **Rubric** — a criteria set attached to a golden via `metadata.golden.criteriaSetId`. When present, replays are LLM-judged per rating dimension and emit a `RubricScoreResult`. - **Candidate** — a dedup'd user prompt extracted from negative-feedback rows. Admins promote them into golden sets via the existing PromoteAsGolden dialog, or dismiss them. ## Surfaces `/admin/agent-review/evals` is the operator's home for this module. It renders three sections, top to bottom: 1. **Intro** — short copy explaining what goldens are and how to promote one from a session detail page. 2. **Candidates** — the triage queue (see below). Capped at 5 visible cards by default with "View all N" expander. 3. **Golden sessions** — the `<GoldenSessionManager>` showing golden sets, single + bulk replay actions, and rubric attachments. ## Candidates — trace-to-prompt promotion Surfaces dedup'd negative-feedback prompts so operators can promote them into golden sets in one click. ### Eligibility (strict) A `feedback` row qualifies when ALL hold: - `rating = 'negative'`, `status = 'pending'` - `session_id IS NOT NULL`, `agent_id IS NOT NULL` - `source_type IN ('response', 'session')` (chat is deferred; extraction/tool/observation excluded because their first session events are machine-generated) - The session has at least one `session_events` row with `event_type = 'user.message'` and an extractable message via `extractMessageText` ### Dedup Key = `(agent_id, prompt.trim().toLowerCase().replace(/\s+/g, " "))`. Two qualifying feedback rows for the same prompt + same agent collapse into one candidate with `occurrenceCount: 2` and both feedback IDs. ### Actions - **Promote to golden** — opens the existing `<PromoteAsGoldenButton>` pre-filled with the candidate's representative session (most recent feedback in the group). On submit success, the group's feedback rows flip to `status='applied'` via `PATCH /api/admin/evals/candidates` with `action='applied'`. - **Dismiss** — `<ConfirmDialog>` → PATCH with `action='dismissed'`. Race-guarded by `.eq("status","pending")`. ### Architecture `listCandidateGoldens()` issues exactly TWO DB roundtrips regardless of candidate count: one `feedback` + `agents` join, one batched `session_events` `.in(session_ids)` lookup. Group / sort / dedup happen in memory. ``` feedback (rating='negative', status='pending', src∈{response,session}, ...) │ join agents → agent slug ▼ session_events (.in(session_ids), event_type='user.message', order by sequence asc) │ take first per session, extractMessageText → prompt ▼ group by (agent_id, normalized_prompt) → CandidateGolden[] ``` `resolveCandidateGroup(feedbackIds, action)` issues ONE bulk UPDATE: `.update({status, reviewed_by_user_id, reviewed_at}).in("id", ids).eq("tenant_id", t).eq("status","pending").select("id")`. Returns `{ updated: N }`. ## API reference ### Server actions (`features/evals/server/candidates.ts`) ```ts listCandidateGoldens(opts?: { limit?: number }): Promise<CandidateGolden[]>; // limit: clamped to [1, 100], default 20 // Admin-gated, tenant-scoped, two DB roundtrips resolveCandidateGroup( feedbackIds: string[], action: "dismissed" | "applied", ): Promise<{ updated: number }>; // Bulk UPDATE with race guard. Empty array short-circuits with updated:0. ``` ### Routes | Method | Path | Body / Query | Description | | ------ | ------------------------------- | ---------------------------------------------- | ------------------------------------------- | | GET | `/api/admin/evals/candidates` | `?limit=N` (1-100, default 20) | List candidate prompts for the active tenant | | PATCH | `/api/admin/evals/candidates` | `{ feedbackIds: uuid[1..200], action: enum }` | Bulk resolve a candidate group | | GET | `/api/admin/evals/golden` | `?set=slug&limit=N` | List golden sessions | | POST | `/api/admin/evals/golden` | `{ sessionId, set?, note?, criteriaSetId? }` | Promote a completed session to golden | | POST | `/api/admin/evals/runs` | `{ goldenSessionId }` | Replay a single golden | | POST | `/api/admin/evals/runs/set` | `{ setName }` | Bulk-replay every golden in a set | ## Design decisions - **No new DB schema.** Candidates v1 reuses the existing `feedback` table. The status column already has `pending / reviewed / applied / dismissed` values; PR 3 just adds new transitions. - **Anchor candidates on first `user.message` only.** Not `session.created` / `session.claimed` — those are machine triggers, not user prompts. Multi-turn anchoring (cherry-picking the rejected turn) is a v2 follow-up. - **One PATCH endpoint with `action` discriminator.** Both Dismiss and the post-promote feedback-flip flow through the same route — keeps the API surface minimal. - **Race guard on resolve.** `.eq("status","pending")` on every bulk UPDATE ensures Admin A's promote-flip survives Admin B's stale dismiss. - **Layout — cap visible at 5.** The triage section sits above `<GoldenSessionManager>`, which is the primary surface admins return to. Capping keeps the manager above the fold. - **Codex was quota-exhausted during spec review.** Claude ran the Codex-equivalent pattern review; Gemini provided the third opinion. Acceptable per `.claude/rules/multi-model-review.md` for medium-risk admin-only surface with no new schema. ## Knowledge-artifact evals Golden-session replay scores agent **runs**; the loop closure-scorer (`features/loops/server/closure-scorer.ts`) scores loop **wiring**. The knowledge-artifact harness (`features/evals/knowledge/`) scores the third thing: the **outputs** a knowledge loop produces — an evidence claim, a protocol, a product-knowledge record — proving they are evidence-backed, schema-complete, actionable, safety-labeled, renderable, and tenant-scoped. It is a deterministic pure library (no DB, no LLM) so it runs in vitest and a CI scorecard. ### The six dimensions A `KnowledgeRubric` scores a normalized `KnowledgeArtifact` across six axes (snake_case so the judge bridge can build `${key}_score` keys): | Dimension | Question | | ------------------------- | -------------------------------------------------------------------- | | `source_grounding` | Is the claim backed by citable sources/signals, not asserted bare? | | `schema_completeness` | Are the required + load-bearing fields populated? | | `actionability` | Can a human/agent act on it — concrete, specific, non-vague? | | `safety_human_gate` | Is the safety posture + human gate correctly labeled (no auto-approve)? | | `workspace_renderability` | Does it route to a real surface with a non-stub payload? | | `tenant_isolation` | Is it scoped to its tenant with no cross-tenant leakage? | ### Model: artifact → adapter → rubric → suite - **`KnowledgeArtifact`** — a tenant entity (or agent output) projected to a tenant-agnostic shape (`fields`, `evidenceRefs`, `safety`, `render`). - **Adapter** — a per-tenant function that maps a real entity to a `KnowledgeArtifact` (`evidenceClaimToArtifact`, `therapyPlanToArtifact`, `pfIdeaToArtifact`, `pfKnowledgeToArtifact`). This is the model-agnostic seam: the same rubric grades any entity, and a new type slots in behind a sibling adapter with no harness change. Sprinter `pf_knowledge` (the evidence-backed record the product-knowledge loop **produces**) is graded directly against its 10 real seed records. - **Rubric** — six dimensions, each composing a reusable check from `features/evals/knowledge/checks.ts` (`requireEvidenceRefs`, `requireFields`, `requireActionablePayload`, `requireSafetyLabel`, `requireRenderable`, `assertTenantScope`) against the tenant's real field names. A rubric SHOULD be a **lens over its `KnowledgeLoopDefinition`** (`features/loops/lib/knowledge-loop.ts`): the `pfKnowledgeRubric` derives its render surfaces, lifecycle statuses (`draft`/`published`), evidence fields, and review gate from the loop's declared `surfaces` / `records` / `reviewGate` — so the eval can never drift from what the loop declares. - **Suite** — a rubric + labeled fixtures (good → PASS, bad → FAIL on one named dimension). `runKnowledgeEvalSuite` enforces two contracts: every fixture behaves as labeled, **and** every dimension has ≥1 failing fixture targeting it (proof the dimension is load-bearing). No parallel systems: `scoreKnowledgeArtifact` mirrors `scoreLoopClosureFromFacts`, emits the existing `RubricScoreResult` surface shape (`knowledgeResultToRubricScoreResult`), and bridges the same rubric to the live `runRubricJudge` LLM path via `knowledgeRubricToCriteriaDimensions`. The harness re-exports `KnowledgeSafetyClass` from the loop primitive rather than redefining it — one safety taxonomy across the loop and its evals. ### Source grounding is due at publish, not at draft `requireEvidenceRefs({ draftStatuses })` exempts in-progress records from the grounding floor — a research-KG population loop drafts a record first and grounds it before publish. A draft with zero sources passes `source_grounding`; the same record, once it reaches a `finalizedStatuses` value, MUST be sourced. This is how the pf_knowledge suite grades real `draft` seeds as trustworthy while still failing a `published` record that cites nothing. ### Renderability verification (do published surfaces resolve?) The `workspace_renderability` dimension trusts a `knownSurfaces` list. To prove those surfaces are real (not stubs), `surface-resolution.ts` resolves a `KnowledgeLoopDefinition`'s declared `surfaces[]` against injected per-kind resolvers (`resolveKnowledgeSurfaces` / `unresolvedKnowledgeSurfaces`), and `makeSurfaceProbe(resolveSurface)` turns a resolver into a `RenderabilityProbe` that `requireRenderable({ probe })` treats as authoritative. The per-tenant `renderability.test.ts` files wire real resolvers — Sprinter page surfaces → `TenantModule.pages` component map; DOC'S workspace / entity-list surfaces → declared workspaces + entity types (+ platform system types) — and assert every loop-declared surface resolves. (PR #2584 shipped no `validateManifestRenderability`; this bridge is the deterministic, CI-safe proof in its place.) ### Running the evals ```bash # Contract tests (good→pass, bad→fail-on-dimension, full coverage, schema drift) pnpm exec vitest run features/evals/knowledge \ features/custom/tenants/docs/evals features/custom/tenants/sprinter/evals # Deterministic scorecard — exit 0 = all pass, 1 = a suite failed pnpm evals:knowledge # human-readable pnpm evals:knowledge --json # machine-readable pnpm evals:knowledge --out DIR # write summary.json + summary.md ``` ### Adding a tenant suite (extension recipe) 1. Create `features/custom/tenants/<slug>/evals/`. 2. Write an **adapter** mapping your entity's real fields to a `KnowledgeArtifact`. 3. Write a **rubric** composing the six dimensions from the reusable checks, keyed on your entity's real `json_schema` field names. 4. Write **fixtures**: good artifacts that PASS + one bad artifact per dimension (`failsDimension`). Real seed records make the best good fixtures. 5. Export a `KnowledgeEvalSuite` and add a co-located `*.test.ts` that runs it plus a **drift guard** asserting your rubric only references fields the real entity-type schema declares. 6. Register the suite in `scripts/evals/knowledge-scorecard.ts`. Step 5's co-located test runs in CI automatically; step 6 is what surfaces the suite in `pnpm evals:knowledge` — skip it and the suite is CI-covered but invisible in the scorecard. The DOC'S/Praxium (`evidence-claim`, `therapy-plan`) and Sprinter (`pf_idea`, `pf_knowledge`) suites are the worked examples. `pf_knowledge` is the canonical loop-derived rubric; `renderability.test.ts` in each tenant's `evals/` is the worked surface-resolution proof. ## Related modules - `features/feedback/**` — the unified feedback table that powers Candidates as a query view - `features/responses/**` — `CriteriaSetDimension` + `computeResponseScore`; the rubric LLM-judge path and the knowledge-rubric criteria bridge - `features/loops/**` — `scoreLoopClosureFromFacts` (the facts→score→gaps pattern the knowledge scorer mirrors) - `features/sessions/**` — `session_events` is the source of truth for candidate prompts - `features/context/**` — frozen `shared_context` baked into `GoldenSessionSnapshot` - `features/custom/tenants/{docs,sprinter}/evals/**` — the tenant knowledge-eval suites ## For agents When triaging candidates programmatically: 1. `GET /api/admin/evals/candidates` to fetch the queue. 2. For each candidate, `POST /api/admin/evals/golden { sessionId: representativeSessionId, set, criteriaSetId? }` to promote. 3. `PATCH /api/admin/evals/candidates { feedbackIds, action: 'applied' }` to flip the group's feedback rows out of `pending`. 4. Or `PATCH ... action: 'dismissed'` to drop a candidate. All routes are admin-gated. Background jobs should call the server actions directly (`listCandidateGoldens`, `resolveCandidateGroup`) under `withToolContext({ tenantId })`.