Documentation source
Unified Agent Intelligence
Consolidate feedback intake, self-improvement, admin review, and eval systems into a thin harness around two primitives — feedback inputs and shared_context outputs — that turn every agent interaction into durable workspace intelligence.
## Problem
Amble has six parallel channels for capturing agent feedback, and only one of them closes the loop into durable intelligence. The rest are either write-only dead ends or one-shot fixes that never generalize into workspace knowledge.
**What exists today:**
| Channel | Writes to | Reads from | End-to-end? |
|---|---|---|---|
| Corrections / Lessons | `shared_context` JSON blobs via `addCorrection`/`addLesson` tools | Every agent prompt via `loadSharedContextPrompt()` | YES |
| User memories | `user_memories` via `saveMemory` tool | User-scoped prompt injection | YES (user-scoped only) |
| Chat feedback | `chat_feedback` via thumbs-up/down buttons | Nothing | **NO** |
| Response scoring | `entity_responses` via form submission | Aggregation views, field promotion | Partial (no learning loop) |
| Extraction rejection | Inngest event → `feedback-rerun.ts` | One-shot retry with feedback instruction | **Limited** (no durable lesson) |
| Session events | `session_events` via append | Audit trail only | **NO** |
**The core problems:**
1. **Fragmented intake.** Feedback on agent output lives in four places (`chat_feedback`, `entity_responses.status`, Inngest rejection events, implicit in `session_events`). No unified read path means no systematic review.
2. **Brittle storage.** Corrections and lessons are stored as JSON arrays inside a key/value `shared_context` table. Individual items cannot be edited, deactivated, or audited without rewriting the whole array. The 8-item prompt cap is treated as a data cap — any 9th lesson falls off forever.
3. **No review pipeline.** Nothing processes `chat_feedback`, response rejections, or session failures into durable learnings. The one system that does learn durably (corrections/lessons) requires agents to proactively call tools — there is no review loop.
4. **No eval system.** Extraction accuracy, agent output quality, and regression detection have no golden-set mechanism. "Did this change improve or regress agent behavior on canonical scenarios?" cannot be answered. Today, scoring only covers entity-response field values — chat behavior, tool-call sequences, and full-session outcomes have no reference set.
5. **Surface sprawl.** `/sessions` is a user-facing page that exposes raw execution machinery customers don't need. PR 707 proposed `/admin/conversations` as another standalone surface (not yet merged to dev). Admin has no single place to review agent work, manage workspace knowledge, or evaluate quality.
This is vibecoded sprawl. It ships features that look like feedback loops but don't close. The fix is to collapse the surface to a thin harness — two tables, one review task, one admin section — and let the durable knowledge (workspace rules, learned patterns, eval results) become the asset that grows over time.
## Goals
1. **Unify feedback intake.** All agent-output evaluations (chat thumbs, response rejections, extraction rejections, session failures, admin observations) flow into a single `feedback` table with consistent shape.
2. **Normalize workspace knowledge.** Corrections, lessons, routing, insights, and guidelines become individually-addressable rows in `shared_context`, not JSON array items. Each carries provenance, scope, active state, and lifecycle metadata.
3. **Close the self-improvement loop.** A scheduled feedback review agent reads pending feedback, identifies patterns, and generates durable `shared_context` entries. Human admins can review, edit, approve, or dismiss.
4. **Add session-centric golden-set evals.** Any completed session (chat, extraction, response, tool, mixed) can be flagged as golden. Replay its inputs against a candidate agent or config, compare the new session's tool calls, outputs, and mutations to the golden, and report divergence. Persona-driven canonical sessions become regression coverage for user journeys.
5. **Consolidate admin surfaces.** One `/admin/agent-review` section provides the conversation browser, sessions browser, feedback review, shared_context management, and eval reporting. PR 707's conversation-browser work, if still pending, gets rebased onto this structure rather than shipped standalone.
6. **Remove user-facing surfaces that customers don't need.** `/sessions` redirects to `/tasks`. Session detail remains for deep drill-down from admin and entity panels.
## Non-goals
- Not rebuilding the agent prompt construction path — `loadSharedContextPrompt()` keeps its signature; internals change.
- Not changing `user_memories` — user-scoped memory stays as-is, separate from workspace knowledge.
- Not changing `entity_responses` or `criteria_sets` — the response/scoring system is sound, just gets wired into feedback generation.
- Not changing `session_events` — it stays append-only telemetry; feedback is separate.
- Not introducing a new primitive for evals — golden sets are sessions flagged as `metadata.golden`, replayed through the existing session executor, and compared via a new `features/evals/lib/compare.ts` dispatched by `session_type`.
## Design
### Two tables, purpose-built
**`feedback` (new) — inputs to the self-improvement system.**
Replaces `chat_feedback` and absorbs all signals that evaluate agent output. Lightweight, append-mostly, has a processing lifecycle.
```sql
CREATE TABLE feedback (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id uuid NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
-- Source of the feedback
source_type text NOT NULL CHECK (source_type IN (
'chat', 'response', 'extraction', 'tool', 'session', 'observation'
)),
rating text NOT NULL CHECK (rating IN ('positive', 'negative', 'neutral')),
comment text,
-- Links to the agent output being evaluated
session_id uuid REFERENCES sessions(id) ON DELETE SET NULL,
entity_id uuid REFERENCES entities(id) ON DELETE SET NULL,
agent_id uuid REFERENCES agents(id) ON DELETE SET NULL,
context jsonb NOT NULL DEFAULT '{}'::jsonb,
-- context carries source-specific metadata:
-- chat: { chat_id, message_index }
-- response: { response_id, criteria_set_id, field_name }
-- extraction: { field_name, entity_type_id, rejected_value }
-- tool: { tool_slug, tool_run_id }
-- session: { failure_reason, event_id }
-- Processing lifecycle
status text NOT NULL DEFAULT 'pending' CHECK (status IN (
'pending', 'reviewed', 'applied', 'dismissed'
)),
reviewed_by_agent_id uuid REFERENCES agents(id) ON DELETE SET NULL,
reviewed_by_user_id uuid REFERENCES auth.users(id) ON DELETE SET NULL,
reviewed_at timestamptz,
review_notes text,
created_by uuid NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX idx_feedback_tenant_status ON feedback(tenant_id, status, created_at DESC);
CREATE INDEX idx_feedback_session ON feedback(session_id) WHERE session_id IS NOT NULL;
CREATE INDEX idx_feedback_entity ON feedback(entity_id) WHERE entity_id IS NOT NULL;
CREATE INDEX idx_feedback_agent ON feedback(agent_id) WHERE agent_id IS NOT NULL;
CREATE INDEX idx_feedback_pending ON feedback(tenant_id, created_at DESC) WHERE status = 'pending';
```
**`shared_context` (restructured) — outputs of the self-improvement system.**
Keeps its current name. Drops the `key`/`value` JSON-blob shape. Each correction, lesson, insight, or guideline becomes its own row with individual lifecycle and provenance.
```sql
-- New structure (requires drop + recreate, see migration section)
CREATE TABLE shared_context (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id uuid NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
-- What kind of knowledge
type text NOT NULL CHECK (type IN (
'correction', 'lesson', 'routing', 'insight', 'guideline'
)),
content text NOT NULL,
context text, -- when this applies (e.g., "when extracting valuation for PE deals")
-- Scoping (NULL = tenant-wide; set for narrower scope)
agent_id uuid REFERENCES agents(id) ON DELETE CASCADE,
entity_type_id uuid REFERENCES entity_types(id) ON DELETE CASCADE,
-- Lifecycle
active boolean NOT NULL DEFAULT true,
-- Provenance
source_feedback_id uuid REFERENCES feedback(id) ON DELETE SET NULL,
created_by_agent_id uuid REFERENCES agents(id) ON DELETE SET NULL,
created_by_user_id uuid REFERENCES auth.users(id) ON DELETE SET NULL,
metadata jsonb NOT NULL DEFAULT '{}'::jsonb,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX idx_shared_context_tenant_type_active ON shared_context(tenant_id, type, active, created_at DESC);
CREATE INDEX idx_shared_context_agent ON shared_context(agent_id) WHERE agent_id IS NOT NULL;
CREATE INDEX idx_shared_context_entity_type ON shared_context(entity_type_id) WHERE entity_type_id IS NOT NULL;
CREATE INDEX idx_shared_context_source_feedback ON shared_context(source_feedback_id) WHERE source_feedback_id IS NOT NULL;
```
Why keep the name `shared_context`:
- Zero rename churn in code, tests, and docs that reference it.
- Semantically accurate: context shared across all agent runs for this tenant.
- `loadSharedContextPrompt()` keeps its signature — only internals change.
- `CONTEXT_KEYS` is removed; types are now enum values on rows, not JSON keys.
### The feedback-to-knowledge pipeline
```
Agent produces output
↓
Human/system evaluates (thumbs, reject, score, flag)
↓
feedback row inserted (status='pending')
↓
Feedback Review Agent (scheduled task, hourly)
↓
Reads pending feedback, groups by pattern
↓
Decides action per feedback:
• Create shared_context row (correction/lesson/insight)
• Update agent config (prompt tweaks, tool groups)
• Update entity/entity_type (structural fix)
• Dismiss (noise, already addressed, user error)
↓
Marks feedback as applied/dismissed with review_notes
↓
loadSharedContextPrompt() reads active rows, formats by type,
injects top N per type into every future agent prompt
```
### Feedback intake points
Existing channels route into the `feedback` table:
- **Chat thumbs (existing UI)** — `features/chat/components/message-actions-bar.tsx` writes to `feedback` with `source_type='chat'`, `context={chat_id, message_index}`. `chat_feedback` table backfilled and dropped.
- **Response rejection** — When a response status transitions to `rejected` with a `rejection_reason`, `submitResponseAdmin()` writes a `feedback` row with `source_type='response'`, `context={response_id, criteria_set_id}`. The existing `extraction/result-rejected` Inngest event continues to fire.
- **Extraction rejection** — `feedback-rerun.ts` continues to run the one-shot retry, AND writes a `feedback` row with `source_type='extraction'`, `rating='negative'`, status='applied' (the retry itself is the application). The review agent can still pick up patterns across rejections.
- **Tool results** — Agents and admins can call a new `addFeedback` tool to record observations about tool outputs.
- **Session failures** — Sessions that transition to `failed` status automatically generate a `feedback` row with `source_type='session'`, `rating='negative'`, `context={failure_reason}`.
- **Admin observation** — Admins can add a feedback row manually from any session or conversation via the review UI ("Flag this response / message / session").
### Workspace knowledge authoring
`shared_context` rows are written by:
- **`addCorrection` tool** — Updated to `INSERT` a `type='correction'` row. Signature unchanged: `(content: string, context?: string)`. Now also accepts optional `agentId` and `entityTypeId` for scoping.
- **`addLesson` tool** — Updated to `INSERT` a `type='lesson'` row. Signature unchanged.
- **New `addInsight` tool** — For agent observations that aren't corrections or lessons but represent durable patterns.
- **Admin UI** — Direct CRUD for all five types (correction, lesson, routing, insight, guideline) via `/admin/agent-review/knowledge`. Routing and guideline rows are admin-authored today; agents can propose them via `addLesson` and admins can re-type as needed.
- **Feedback review agent** — Writes rows via the tools above, setting `source_feedback_id` for provenance.
`shared_context` rows are read by:
- **`loadSharedContextPrompt(admin, tenantId, { agentId?, entityTypeId? })`** — Same signature, new internals. Queries active rows, filters by scope (global + agent-specific + entity-type-specific where relevant), ranks by recency and usage, returns top N per type formatted as prompt sections.
- **Admin UI** — list, detail, edit, deactivate, delete.
### Feedback Review Agent
A system agent (registered once, available to every tenant via `is_system=true` per existing agent conventions) that runs on a scheduled heartbeat per tenant that opts in.
**Configuration (default):**
- Heartbeat cron: hourly per tenant (skipped when no pending feedback exists)
- Tool groups: `context`, `entities-read`, `agents-admin`, `feedback-review`
- Instructions (base prompt):
> You are the Feedback Review Agent. Every hour, review pending feedback in this workspace.
> For each feedback or group of related feedback:
> 1. Read the source context (chat, entity, session) to understand what happened
> 2. Decide if this represents a durable pattern or a one-off
> 3. If durable: create a correction, lesson, or insight via addCorrection/addLesson/addInsight with source_feedback_id provenance
> 4. If structural: consider updating agent config or entity type schema
> 5. If noise: dismiss with review_notes explaining why
> 6. Mark feedback as applied or dismissed
> Prefer to dismiss weak signals rather than pollute shared_context with low-value rules.
**Tools (new):**
- `listFeedback({ status, sourceType?, agentId?, limit })` — paginate pending feedback
- `getFeedback(id)` — detail view with linked source context
- `reviewFeedback(id, action: 'applied' | 'dismissed', notes: string)` — mark processed
- `addInsight(content, context?)` — new shared_context type for observations
- `updateSharedContext(id, { active?, content?, context? })` — tune existing rules
- `deactivateSharedContext(id, reason)` — soft-delete a stale rule
**Guardrails:**
- The review agent has a permission role that lets it read but not delete feedback or arbitrary data
- It cannot create new agents or delete entity types
- All actions are logged in `session_events` for audit
- Admins can disable the heartbeat and review feedback manually
### Admin "Agent Review" section
Replaces `/admin/conversations` (PR 707) and absorbs `/sessions`. Lives at `/admin/agent-review` under the "AI & Agents" group in `ADMIN_SECTIONS`.
**Tabs:**
1. **Conversations** — Chat observability. Queries `chats` + `messages` + `feedback` table. Browse all chats, view transcripts, see thumbs and feedback inline, link to related sessions. Admin can add observation feedback directly from messages. (If PR 707 lands concurrently, its implementation is rebased onto this structure and its `/admin/conversations` route redirects here.)
2. **Sessions** — Execution browser for all session types (agent/response/tool/mixed). Moved from user-facing `/sessions` page. Filter by status, type, agent, date. Link to session detail transcript.
3. **Feedback** — Unified feed across all sources. Filter by status (pending/reviewed/applied/dismissed), source_type, rating, agent. Click through to source (chat message, entity, session). Bulk actions: mark reviewed, dismiss with shared note.
4. **Knowledge** — Manage `shared_context` rows. Tabs within: corrections, lessons, routing, insights, guidelines. Each row shows content, context, scope (agent/entity_type), active state, provenance (which feedback generated it, which agent wrote it). CRUD: add, edit, deactivate, delete.
5. **Evals** — Golden set management and accuracy reports. List of eval sets (tagged entity groups), per-set accuracy dashboards, trend charts, drill-down into specific failed cases, link to trigger re-runs.
### Eval system (session-centric golden sets)
**Any session can be a golden case.** An agent session (chat, extraction, response, tool, mixed) captures inputs, execution (all tool calls via session_events), and outputs. Promoting a session as "golden" turns it into a reusable regression test: replay the same inputs against a new agent or config, compare the new session's behavior to the golden one, and report divergence.
This generalizes from "entity extraction accuracy" to "does the agent still behave the way we want for this scenario." Response-level evals (compare extracted field values to expected values) become one case; chat behavior (did the agent use the right tools, produce the right summary) and tool-session behavior (did it call the right API with the right params) are all covered by the same mechanism.
**Defining a golden session:**
- Admin reviews a completed session in `/admin/agent-review/sessions` or `/admin/agent-review/conversations`
- Admin clicks "Promote as golden" on a session whose outputs they validated as correct
- System writes `sessions.metadata.golden = { set: '<set-name>', promoted_at, promoted_by, snapshot }` where `snapshot` captures the reproducible inputs (agent_id, initial message, entity_id, view_context, tool_grants, etc.)
- Golden sessions are immutable: `sessions.status` is frozen to `completed` and the session cannot be deleted while golden (soft-enforced in server actions + RLS)
- Golden sessions can be organized into named sets (`'pe-extraction'`, `'chat-smoke'`, `'tool-accuracy'`) via `metadata.golden.set`
**Running an eval:**
- Eval run triggered from admin UI, scheduled, or on post-deploy hook
- For each golden session in the target set:
1. Read the input snapshot from `sessions.metadata.golden.snapshot`
2. Create a new session with those inputs (possibly using a different agent_id to test a candidate config)
3. Execute the session to completion via the session executor
4. Compare the new session's events + outputs to the golden session's events + outputs
- Writes comparison results to `sessions.metadata.eval_result` on the replay session, linking back to the golden via `metadata.eval_source = golden_session_id`
**Comparison logic (`features/evals/lib/compare.ts`), dispatched by session_type:**
```ts
type ComparisonResult = {
session_type: SessionType
dimensions: Record<string, DimensionResult> // field-by-field comparison
overall_accuracy: number // 0..1
divergences: Divergence[] // ordered list of mismatches for drill-down
}
// agent | chat sessions: compare tool-call sequence + final message
// - tool_calls: set overlap (order-insensitive) with expected tools
// - tool_args: per-tool, compare input args via jsonDiff
// - final_message: semantic similarity score (optional) + keyword presence
// - entities_created: compare what entities the session created
// - entities_updated: compare which fields were mutated
// response sessions: compare values per criteria dimension
// - numeric: tolerance-based match
// - text: exact or semantic similarity
// - select: exact match
// - relation-rank: order-aware set comparison
// tool sessions: compare tool output and any entity mutations
// - output: jsonDiff against golden output
// - entity mutations: same as agent
// extraction tasks (session_type='agent' with output_type='field'|'fields'):
// - submitResponse values: dimension-by-dimension comparison
// - field metadata: source attribution, confidence
```
**Scoring and reporting:**
- Per dimension: pass/fail + similarity score
- Per session: overall accuracy % (weighted by dimension importance)
- Per agent: accuracy across all golden sessions in a set
- Per set: accuracy trend over time (chart)
- Admin UI shows pass/fail matrix (golden sessions × agent configs) with drill-down to per-dimension divergence
**Self-improvement integration:**
- When a replay fails on a golden session, the divergence becomes a feedback row (`source_type='session'`, context includes golden_session_id and replay_session_id)
- Feedback review agent picks it up, analyzes the divergence, generates a correction or adjusts agent config
- Accuracy-trend dashboard surfaces regressions within the admin review section
**Persona-driven golden sets:**
- `content/docs/personas/*.mdx` defines user personas and their jobs-to-be-done
- Each persona should have at least one golden session capturing a canonical interaction (e.g., pe-analyst asking for deal comparisons, sales-ops adding contacts)
- Persona-based golden sets give us user-journey regression coverage, not just field-level accuracy
**No new tables in phase 5.** `sessions.metadata.golden` and `sessions.metadata.eval_result` carry all the data. If we later need cross-run aggregation dashboards (accuracy over the last 10 runs), we can add an `eval_runs` table in phase 5b:
```sql
CREATE TABLE eval_runs (
id uuid PRIMARY KEY,
tenant_id uuid NOT NULL,
name text NOT NULL, -- 'golden-pe-extraction', 'chat-smoke-2026-04-15'
set_name text, -- the golden set being evaluated
agent_id uuid, -- candidate agent being tested
triggered_by text, -- 'manual' | 'scheduled' | 'post-deploy'
started_at timestamptz NOT NULL DEFAULT now(),
completed_at timestamptz,
golden_session_count integer,
passed_count integer,
failed_count integer,
results jsonb NOT NULL DEFAULT '{}'::jsonb,
-- { bySession: { goldenSessionId: { passed, overall_accuracy, divergences } }, ... }
created_by uuid
);
```
Phase 5 ships without `eval_runs`; per-session metadata is enough to list recent replays and drill into divergences. Add the table when the UI needs cross-run charts.
### Migration path
**Phase 1: schema**
1. Create `feedback` table
2. Backfill `feedback` from `chat_feedback` (source_type='chat', rating mapping from 'up'/'down')
3. Create new `shared_context` table structure (temporary name `shared_context_v2`, or drop-and-recreate in same migration)
4. Backfill from old `shared_context` JSON arrays — expand each array item into a row with appropriate `type` and parsed `content`/`context`
5. Drop old `shared_context`, rename `shared_context_v2` → `shared_context`
6. Mark `chat_feedback` as deprecated; drop in next release
**Phase 2: code**
1. Rewrite `features/context/server/actions.ts` and `features/context/lib/load.ts` to read/write normalized rows. Preserve public signatures.
2. Update `addCorrection`/`addLesson` tools to write rows (signatures unchanged, internals changed)
3. Update `features/chat/feedback/server/actions.ts` to write to `feedback` table
4. Update chat feedback API routes to query `feedback` table
5. Wire response rejection flow to insert feedback row
7. Wire extraction rejection flow to insert feedback row alongside the retry
8. Wire session failure to insert feedback row
**Phase 3: UI**
1. Build `features/feedback/` module (types, hooks, components)
2. Build `features/evals/` module (types, compare logic, reporting components)
3. Build new `/admin/agent-review` section with 5 tabs
4. Add `agent-review` entry to `ADMIN_SECTIONS` under "AI & Agents" group
5. Delete `app/(app)/sessions/page.tsx`, add redirect to `/tasks`
6. Keep `/sessions/[id]` detail page — linked from admin and entity panels
7. If PR 707 lands concurrently, add a redirect from `/admin/conversations` → `/admin/agent-review/conversations`
**Phase 4: review agent**
1. Create system agent record (feedback-review-agent)
2. Create task template for "Feedback Review"
3. Configure heartbeat
4. Write system prompt and guardrails
5. Ship disabled by default — admin opts in per workspace
**Phase 5: evals**
1. Add "Promote as golden" action to session detail + conversation detail (writes `metadata.golden`)
2. Add protection in session actions: block delete and block status changes when `metadata.golden` is set
3. Build input-snapshot extractor (`features/evals/server/snapshot.ts`) that captures reproducible inputs from session + events
4. Build replay runner (`features/evals/server/replay.ts`) that creates a new session from a snapshot, invokes the session executor, and sets `metadata.eval_source`
5. Build comparison logic per session_type (`compare-agent.ts`, `compare-response.ts`, `compare-tool.ts`)
6. Build admin eval UI: golden session manager, replay trigger, pass/fail matrix, divergence drill-down
7. Seed persona-driven golden sessions for `pe-analyst`, `sales-ops`, `first-time-admin` as a starter set
8. Wire replay failures into `feedback` table as `source_type='session'` entries so the review agent picks up regressions
Each phase is independently shippable. Phase 1-2 is the foundation; phase 3-5 can land incrementally.
## Trade-offs
**Why a new `feedback` table instead of generalizing `session_events`?**
Session events are append-only audit telemetry — "agent called tool X at time Y." Feedback is evaluation — "the output was bad because Z." They have different lifecycles (events are immutable; feedback has a status machine), different query patterns (events are per-session; feedback is cross-session aggregate), and different readers (events for replay; feedback for learning). Mixing them would blur the intent and bloat the events table.
**Why a new `shared_context` structure instead of an entity type?**
Considered making `shared_context` rows entities of type `agent-knowledge` or `learning`. Pros: zero new tables, standard entity CRUD, tags/relations free. Cons: entity overhead on the hot path (every agent prompt loads shared context), entity semantics don't fit ephemeral lifecycle knowledge well, and the admin UI for entities is geared toward business records, not system metadata. The dedicated table is ~50 lines of migration and wins on performance, semantics, and purpose-fit.
**Why drop `chat_feedback` instead of keeping both?**
Keeping `chat_feedback` as a chat-specific table + a generic `feedback` table means chat thumbs would write to two places, or chat-specific feedback would split from everything else. Consolidating to one table means one query path, one review pipeline, one admin view. Migration is straightforward (tens of rows in practice).
**Why keep `shared_context` as the name?**
Preserves zero rename churn across `loadSharedContextPrompt()`, `CONTEXT_KEYS`, tests, docs. The semantic meaning ("context shared across all agent runs for this tenant") still fits the new structure. User preference aligned.
**Why not merge feedback and shared_context into one table?**
Feedback is input (evaluation signal); shared_context is output (distilled knowledge). A single correction in shared_context can be produced by reviewing 5-10 feedback signals. The write patterns, lifecycles, and read patterns are different enough that one table would force either a `status='feedback'` vs `status='rule'` split (back to a type column hack) or an awkward denormalization. Two tables is cleaner.
**Why defer `eval_runs` table?**
Starting with per-session metadata (`sessions.metadata.eval_result`, linked to golden via `metadata.eval_source`) is enough for the first phase. Replays are themselves sessions — listing them, filtering by status, drilling in, all work via the existing session queries. If we need cross-run aggregation dashboards or eval history charts, add the table then. YAGNI.
**Why session-centric evals instead of entity+response?**
Considered the narrower approach: tag entities as eval set, promote responses as golden, compare new responses. This works for extraction accuracy but misses chat behavior, tool-call correctness, and multi-step agent journeys. Session-centric evals let us regress-test any agent output — including the persona journeys defined in `content/docs/personas/` — with one mechanism. The entity+response case is a specialization (response sessions compared dimension-by-dimension). One primitive covers every case.
**Why not snapshot session inputs into a dedicated `golden_cases` table?**
Considered extracting golden input snapshots into their own table so original sessions could be freely modified or deleted. Decided against: the session record already contains the inputs (first events in `session_events`), and snapshotting to `metadata.golden.snapshot` at promotion time locks in the reproducible view. This avoids a parallel data model and lets admins browse golden sessions through the same session list UI they already use. Immutability is enforced at the action layer rather than at a schema level.
## Acceptance Criteria
### Feedback intake
- [ ] `feedback` table exists with the schema above, RLS policies, and indexes
- [ ] Chat thumbs write to `feedback` with `source_type='chat'` and correct context
- [ ] Response rejection writes to `feedback` with `source_type='response'`
- [ ] Extraction rejection writes to `feedback` alongside the retry
- [ ] Session failure writes to `feedback` automatically
- [ ] `chat_feedback` data fully backfilled into `feedback`
- [ ] `chat_feedback` table dropped after one release
### Workspace knowledge
- [ ] `shared_context` normalized table exists with schema above, RLS, indexes
- [ ] All existing JSON-array rows backfilled into individual rows
- [ ] `addCorrection`/`addLesson` tools write rows (same signatures)
- [ ] New `addInsight` tool exists
- [ ] `loadSharedContextPrompt()` signature unchanged; reads from new structure
- [ ] Scoping works: `agent_id`-specific rules only load for that agent, `entity_type_id`-specific rules only load for that type
- [ ] `active = false` rows excluded from prompt injection
- [ ] Provenance links preserved (source_feedback_id, created_by_agent_id, created_by_user_id)
### Feedback review agent
- [ ] System agent exists with correct tool groups and permissions
- [ ] Heartbeat task runs on schedule
- [ ] Review agent can list, read, and process pending feedback
- [ ] Review agent creates `shared_context` rows with provenance
- [ ] Review agent can update agent configs when appropriate
- [ ] Review agent marks feedback as applied/dismissed with notes
- [ ] All review actions logged in `session_events`
- [ ] Admin can disable review agent per workspace
### Admin UI
- [ ] `/admin/agent-review` exists under "AI & Agents"
- [ ] `agent-review` in `ADMIN_SECTIONS`
- [ ] Conversations tab works (from PR 707, updated data source)
- [ ] Sessions tab works (moved from user-facing)
- [ ] Feedback tab shows unified feed with filters
- [ ] Knowledge tab manages shared_context rows (CRUD)
- [ ] Evals tab shows golden set reports
- [ ] `/sessions` redirects to `/tasks`
- [ ] Session detail page `/sessions/[id]` remains accessible from admin and entity panels
- [ ] `/admin/conversations` redirects to `/admin/agent-review/conversations`
### Eval system
- [ ] Any completed session can be promoted as golden via `/admin/agent-review/sessions` or `/admin/agent-review/conversations`
- [ ] Golden sessions capture reproducible input snapshot in `sessions.metadata.golden.snapshot`
- [ ] Golden sessions are immutable (protected from deletion + status changes while flagged)
- [ ] Admin can organize golden sessions into named sets (`pe-extraction`, `chat-smoke`, etc.)
- [ ] Eval run replays each golden session's inputs against target agent and produces a replay session
- [ ] Comparison logic dispatches by `session_type`:
- `agent`/`chat`: tool-call set overlap, args jsonDiff, entities created/updated, final message similarity
- `response`: per-dimension value comparison (numeric tolerance, exact, semantic)
- `tool`: output jsonDiff + entity mutations
- [ ] Eval report shows pass/fail matrix (golden sessions × agent configs) with drill-down to per-dimension divergence
- [ ] Accuracy trend tracked per golden set over time
- [ ] Persona-driven golden sets exist for at least `pe-analyst`, `sales-ops`, `first-time-admin`
- [ ] Failed replays automatically generate a `feedback` row linking the golden and replay sessions, picked up by the review agent
### Quality gates
- [ ] All new migrations are reversible
- [ ] `pnpm test` passes with updated and new tests
- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm build` pass
- [ ] `documents/DATABASE.md` reflects new schema
- [ ] `/docs/features/agent-system` updated with feedback + evals sections
- [ ] `content/docs/data-model.mdx` updated with new tables
- [ ] Feature docs for `/admin/agent-review` added to `content/docs/features/`
## Files
### Database
- `supabase/migrations/20260415010000_feedback_table.sql` — create `feedback` + backfill from `chat_feedback`
- `supabase/migrations/20260415010100_shared_context_normalized.sql` — restructure `shared_context` + backfill from JSON arrays
- `supabase/migrations/20260415010200_drop_chat_feedback.sql` — drop deprecated `chat_feedback` (next release after phase 1 ships)
### Feedback module (new)
- `features/feedback/types.ts` — `FeedbackRecord`, `FeedbackSourceType`, `FeedbackStatus`
- `features/feedback/server/actions.ts` — `createFeedback`, `listFeedback`, `reviewFeedback`, `getFeedback`
- `features/feedback/server/queries.ts` — tenant-scoped query builders with joins
- `features/feedback/hooks/use-feedback-list.ts` — React Query hook
- `features/feedback/components/feedback-list.tsx` — list component with filters
- `features/feedback/components/feedback-detail.tsx` — detail view with source context
- `app/api/feedback/route.ts` — POST (create), GET (list with filters)
- `app/api/feedback/[id]/route.ts` — GET (detail), PATCH (review)
### Context module (refactored)
- `features/context/types.ts` — rewritten: `SharedContextRecord`, `SharedContextType` enum
- `features/context/server/actions.ts` — `createSharedContextRule`, `updateSharedContextRule`, `deactivateSharedContextRule`, `listSharedContext`
- `features/context/lib/load.ts` — `loadSharedContextPrompt()` signature unchanged; reads normalized rows
- `features/context/lib/format.ts` — format functions unchanged externally
- `features/context/components/shared-context-manager.tsx` — admin CRUD component
- `app/api/admin/shared-context/route.ts` — GET (list), POST (create)
- `app/api/admin/shared-context/[id]/route.ts` — PATCH (update), DELETE (hard delete if needed)
### Tools (updated)
- `features/tools/context-tools.ts` — `addCorrection`, `addLesson` updated to write rows; new `addInsight`, `updateSharedContext`, `deactivateSharedContext`
- `features/tools/feedback-tools.ts` (new) — `listFeedback`, `getFeedback`, `reviewFeedback`, `addFeedback`
### Feedback intake wiring
- `features/chat/components/message-actions-bar.tsx` — write to `/api/feedback` instead of `/api/chats/[id]/feedback`
- `features/chat/feedback/server/actions.ts` — rewrite to use `feedback` table
- `features/responses/server/actions.ts` — on rejection, insert feedback row
- `features/inngest/functions/feedback-rerun.ts` — insert feedback row alongside retry
- `features/sessions/server/event-log.ts` — on session failure transition, insert feedback row
### Feedback review agent
- `features/agents/system-agents.ts` — register `feedback-review-agent`
- `features/tasks/system-tasks.ts` — register "Feedback Review" task template
- Migration to seed review agent + task
### Evals module (new)
- `features/evals/types.ts` — `GoldenSessionSnapshot`, `EvalComparison`, `Divergence`, `DimensionResult`
- `features/evals/server/actions.ts` — `promoteSessionToGolden`, `revokeGoldenStatus`, `replayGoldenSession`, `listGoldenSessions`, `listReplaysForGolden`
- `features/evals/server/snapshot.ts` — extract reproducible input snapshot from a session + events
- `features/evals/server/replay.ts` — construct new session from snapshot, invoke session executor
- `features/evals/lib/compare.ts` — top-level comparator dispatching by `session_type`
- `features/evals/lib/compare-agent.ts` — tool-call sequence, args, entity mutations, final message
- `features/evals/lib/compare-response.ts` — per-dimension value comparison (reuses existing scoring primitives)
- `features/evals/lib/compare-tool.ts` — tool output + mutation comparison
- `features/evals/lib/similarity.ts` — helpers (numeric tolerance, keyword presence, optional semantic similarity)
- `features/evals/components/eval-report.tsx` — pass/fail matrix + trend
- `features/evals/components/golden-session-manager.tsx` — list, promote, organize into sets
- `features/evals/components/replay-diff.tsx` — side-by-side divergence drill-down
- `features/sessions/server/actions.ts` — add `is_golden` helper (reads `metadata.golden`) + guard against delete/status-change when golden
- `app/api/admin/evals/golden/route.ts` — GET (list golden sessions), POST (promote session)
- `app/api/admin/evals/golden/[id]/route.ts` — DELETE (revoke), PATCH (rename set)
- `app/api/admin/evals/runs/route.ts` — POST (trigger replay for a set or single session)
- `app/api/admin/evals/runs/[id]/route.ts` — GET (replay results + diff)
### Admin Agent Review section
- `app/(app)/admin/agent-review/page.tsx` — landing page / tab container
- `app/(app)/admin/agent-review/conversations/page.tsx` — absorbs PR 707
- `app/(app)/admin/agent-review/conversations/[id]/page.tsx` — absorbs PR 707 detail
- `app/(app)/admin/agent-review/sessions/page.tsx` — moved from user-facing
- `app/(app)/admin/agent-review/feedback/page.tsx` — new
- `app/(app)/admin/agent-review/knowledge/page.tsx` — new
- `app/(app)/admin/agent-review/evals/page.tsx` — new
- `features/admin/lib/sections.ts` — add `agent-review` entry
### Removed / redirected
- `app/(app)/sessions/page.tsx` — delete, redirect to `/tasks`
- `app/(app)/admin/conversations/**` — redirect to `/admin/agent-review/conversations/**`
- `chat_feedback` table — drop after backfill + one release
### Docs
- `content/docs/features/agent-system.mdx` — add feedback + evals + review sections
- `content/docs/data-model.mdx` — document `feedback` and new `shared_context`
- `content/docs/features/agent-review.mdx` (new) — admin section guide
- `documents/DATABASE.md` — update table inventory
- `documents/CHANGELOG.md` — entry when phase 1 ships
## Open Questions
1. **Should `feedback` rows be user-visible?** Users can see their own chat feedback today. Decision: yes for user's own feedback (`created_by = user`), no for aggregate or other users' feedback (admin-only). RLS handles this.
2. **How aggressive should the review agent be?** Over-eager creation of rules pollutes shared_context; under-eager leaves signal on the table. Start conservative — review agent creates rules only when 2+ feedback signals support the same pattern, or an admin manually flags a pattern.
3. **Per-agent shared_context scoping — opt-in or automatic?** When an agent writes a lesson from its own session, should it default to `agent_id = self` (narrow scope) or `agent_id = null` (workspace-wide)? Default to workspace-wide with explicit agent-specific opt-in, since most learnings should generalize.
4. **Eval run scheduling — manual or automatic?** Start manual (admin triggers from UI). Phase 2 could add post-deploy automatic runs and scheduled regression checks.
5. **Backfill of extraction rejections into `feedback`?** The `feedback-rerun.ts` path is alive today but doesn't record feedback. Backfill historical extraction rejections from `entity_responses` where `status='rejected'` with rejection reasons? Decision: yes, one-time backfill as part of phase 1 migration.
6. **How reproducible are chat session inputs?** Chat sessions depend on user_info, workspace context, skills, and shared_context at the time of the original run — all of which change over time. Replay fidelity will drift. Decision: snapshot the full prompt context (entity types, shared_context rules, memories, skills) into `metadata.golden.snapshot` at promotion time. Replays use the snapshot, not live context. Admins can opt into "live context" replays if they want to test whether new shared_context rules actually improve behavior on past scenarios.
7. **Semantic similarity for text comparisons — embedding model?** Comparing chat final messages or text field values benefits from embedding-based similarity. Decision: phase 5 ships with keyword-overlap + exact-match. Phase 5b adds embedding similarity if the signal-to-noise ratio warrants the cost/complexity.