Unified Agent Intelligence

Consolidate feedback intake, self-improvement, admin review, and eval systems into a thin harness around two primitives — feedback inputs and shared_context outputs — that turn every agent interaction into durable workspace intelligence.

Problem

Amble has six parallel channels for capturing agent feedback, and only one of them closes the loop into durable intelligence. The rest are either write-only dead ends or one-shot fixes that never generalize into workspace knowledge.

What exists today:

Channel	Writes to	Reads from	End-to-end?
Corrections / Lessons	`shared_context` JSON blobs via `addCorrection`/`addLesson` tools	Every agent prompt via `loadSharedContextPrompt()`	YES
User memories	`user_memories` via `saveMemory` tool	User-scoped prompt injection	YES (user-scoped only)
Chat feedback	`chat_feedback` via thumbs-up/down buttons	Nothing	NO
Response scoring	`entity_responses` via form submission	Aggregation views, field promotion	Partial (no learning loop)
Extraction rejection	Inngest event → `feedback-rerun.ts`	One-shot retry with feedback instruction	Limited (no durable lesson)
Session events	`session_events` via append	Audit trail only	NO

The core problems:

Fragmented intake. Feedback on agent output lives in four places (chat_feedback, entity_responses.status, Inngest rejection events, implicit in session_events). No unified read path means no systematic review.
Brittle storage. Corrections and lessons are stored as JSON arrays inside a key/value shared_context table. Individual items cannot be edited, deactivated, or audited without rewriting the whole array. The 8-item prompt cap is treated as a data cap — any 9th lesson falls off forever.
No review pipeline. Nothing processes chat_feedback, response rejections, or session failures into durable learnings. The one system that does learn durably (corrections/lessons) requires agents to proactively call tools — there is no review loop.
No eval system. Extraction accuracy, agent output quality, and regression detection have no golden-set mechanism. "Did this change improve or regress agent behavior on canonical scenarios?" cannot be answered. Today, scoring only covers entity-response field values — chat behavior, tool-call sequences, and full-session outcomes have no reference set.
Surface sprawl. /sessions is a user-facing page that exposes raw execution machinery customers don't need. PR 707 proposed /admin/conversations as another standalone surface (not yet merged to dev). Admin has no single place to review agent work, manage workspace knowledge, or evaluate quality.

This is vibecoded sprawl. It ships features that look like feedback loops but don't close. The fix is to collapse the surface to a thin harness — two tables, one review task, one admin section — and let the durable knowledge (workspace rules, learned patterns, eval results) become the asset that grows over time.

Goals

Unify feedback intake. All agent-output evaluations (chat thumbs, response rejections, extraction rejections, session failures, admin observations) flow into a single feedback table with consistent shape.
Normalize workspace knowledge. Corrections, lessons, routing, insights, and guidelines become individually-addressable rows in shared_context, not JSON array items. Each carries provenance, scope, active state, and lifecycle metadata.
Close the self-improvement loop. A scheduled feedback review agent reads pending feedback, identifies patterns, and generates durable shared_context entries. Human admins can review, edit, approve, or dismiss.
Add session-centric golden-set evals. Any completed session (chat, extraction, response, tool, mixed) can be flagged as golden. Replay its inputs against a candidate agent or config, compare the new session's tool calls, outputs, and mutations to the golden, and report divergence. Persona-driven canonical sessions become regression coverage for user journeys.
Consolidate admin surfaces. One /admin/agent-review section provides the conversation browser, sessions browser, feedback review, shared_context management, and eval reporting. PR 707's conversation-browser work, if still pending, gets rebased onto this structure rather than shipped standalone.
Remove user-facing surfaces that customers don't need. /sessions redirects to /tasks. Session detail remains for deep drill-down from admin and entity panels.

Non-goals

Not rebuilding the agent prompt construction path — loadSharedContextPrompt() keeps its signature; internals change.
Not changing user_memories — user-scoped memory stays as-is, separate from workspace knowledge.
Not changing entity_responses or criteria_sets — the response/scoring system is sound, just gets wired into feedback generation.
Not changing session_events — it stays append-only telemetry; feedback is separate.
Not introducing a new primitive for evals — golden sets are sessions flagged as metadata.golden, replayed through the existing session executor, and compared via a new features/evals/lib/compare.ts dispatched by session_type.

Design

Two tables, purpose-built

feedback (new) — inputs to the self-improvement system.

Replaces chat_feedback and absorbs all signals that evaluate agent output. Lightweight, append-mostly, has a processing lifecycle.

CREATE TABLE feedback (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id uuid NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,

  -- Source of the feedback
  source_type text NOT NULL CHECK (source_type IN (
    'chat', 'response', 'extraction', 'tool', 'session', 'observation'
  )),
  rating text NOT NULL CHECK (rating IN ('positive', 'negative', 'neutral')),
  comment text,

  -- Links to the agent output being evaluated
  session_id uuid REFERENCES sessions(id) ON DELETE SET NULL,
  entity_id uuid REFERENCES entities(id) ON DELETE SET NULL,
  agent_id uuid REFERENCES agents(id) ON DELETE SET NULL,
  context jsonb NOT NULL DEFAULT '{}'::jsonb,
  -- context carries source-specific metadata:
  --   chat: { chat_id, message_index }
  --   response: { response_id, criteria_set_id, field_name }
  --   extraction: { field_name, entity_type_id, rejected_value }
  --   tool: { tool_slug, tool_run_id }
  --   session: { failure_reason, event_id }

  -- Processing lifecycle
  status text NOT NULL DEFAULT 'pending' CHECK (status IN (
    'pending', 'reviewed', 'applied', 'dismissed'
  )),
  reviewed_by_agent_id uuid REFERENCES agents(id) ON DELETE SET NULL,
  reviewed_by_user_id uuid REFERENCES auth.users(id) ON DELETE SET NULL,
  reviewed_at timestamptz,
  review_notes text,

  created_by uuid NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,
  created_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX idx_feedback_tenant_status ON feedback(tenant_id, status, created_at DESC);
CREATE INDEX idx_feedback_session ON feedback(session_id) WHERE session_id IS NOT NULL;
CREATE INDEX idx_feedback_entity ON feedback(entity_id) WHERE entity_id IS NOT NULL;
CREATE INDEX idx_feedback_agent ON feedback(agent_id) WHERE agent_id IS NOT NULL;
CREATE INDEX idx_feedback_pending ON feedback(tenant_id, created_at DESC) WHERE status = 'pending';

shared_context (restructured) — outputs of the self-improvement system.

Keeps its current name. Drops the key/value JSON-blob shape. Each correction, lesson, insight, or guideline becomes its own row with individual lifecycle and provenance.

-- New structure (requires drop + recreate, see migration section)
CREATE TABLE shared_context (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id uuid NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,

  -- What kind of knowledge
  type text NOT NULL CHECK (type IN (
    'correction', 'lesson', 'routing', 'insight', 'guideline'
  )),
  content text NOT NULL,
  context text, -- when this applies (e.g., "when extracting valuation for PE deals")

  -- Scoping (NULL = tenant-wide; set for narrower scope)
  agent_id uuid REFERENCES agents(id) ON DELETE CASCADE,
  entity_type_id uuid REFERENCES entity_types(id) ON DELETE CASCADE,

  -- Lifecycle
  active boolean NOT NULL DEFAULT true,

  -- Provenance
  source_feedback_id uuid REFERENCES feedback(id) ON DELETE SET NULL,
  created_by_agent_id uuid REFERENCES agents(id) ON DELETE SET NULL,
  created_by_user_id uuid REFERENCES auth.users(id) ON DELETE SET NULL,

  metadata jsonb NOT NULL DEFAULT '{}'::jsonb,
  created_at timestamptz NOT NULL DEFAULT now(),
  updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX idx_shared_context_tenant_type_active ON shared_context(tenant_id, type, active, created_at DESC);
CREATE INDEX idx_shared_context_agent ON shared_context(agent_id) WHERE agent_id IS NOT NULL;
CREATE INDEX idx_shared_context_entity_type ON shared_context(entity_type_id) WHERE entity_type_id IS NOT NULL;
CREATE INDEX idx_shared_context_source_feedback ON shared_context(source_feedback_id) WHERE source_feedback_id IS NOT NULL;

Why keep the name shared_context:

Zero rename churn in code, tests, and docs that reference it.
Semantically accurate: context shared across all agent runs for this tenant.
loadSharedContextPrompt() keeps its signature — only internals change.
CONTEXT_KEYS is removed; types are now enum values on rows, not JSON keys.

The feedback-to-knowledge pipeline

Agent produces output
  ↓
Human/system evaluates (thumbs, reject, score, flag)
  ↓
feedback row inserted (status='pending')
  ↓
Feedback Review Agent (scheduled task, hourly)
  ↓
Reads pending feedback, groups by pattern
  ↓
Decides action per feedback:
  • Create shared_context row (correction/lesson/insight)
  • Update agent config (prompt tweaks, tool groups)
  • Update entity/entity_type (structural fix)
  • Dismiss (noise, already addressed, user error)
  ↓
Marks feedback as applied/dismissed with review_notes
  ↓
loadSharedContextPrompt() reads active rows, formats by type,
injects top N per type into every future agent prompt

Feedback intake points

Existing channels route into the feedback table:

Chat thumbs (existing UI) — features/chat/components/message-actions-bar.tsx writes to feedback with source_type='chat', context={chat_id, message_index}. chat_feedback table backfilled and dropped.
Response rejection — When a response status transitions to rejected with a rejection_reason, submitResponseAdmin() writes a feedback row with source_type='response', context={response_id, criteria_set_id}. The existing extraction/result-rejected Inngest event continues to fire.
Extraction rejection — feedback-rerun.ts continues to run the one-shot retry, AND writes a feedback row with source_type='extraction', rating='negative', status='applied' (the retry itself is the application). The review agent can still pick up patterns across rejections.
Tool results — Agents and admins can call a new addFeedback tool to record observations about tool outputs.
Session failures — Sessions that transition to failed status automatically generate a feedback row with source_type='session', rating='negative', context={failure_reason}.
Admin observation — Admins can add a feedback row manually from any session or conversation via the review UI ("Flag this response / message / session").

Workspace knowledge authoring

shared_context rows are written by:

addCorrection tool — Updated to INSERT a type='correction' row. Signature unchanged: (content: string, context?: string). Now also accepts optional agentId and entityTypeId for scoping.
addLesson tool — Updated to INSERT a type='lesson' row. Signature unchanged.
New addInsight tool — For agent observations that aren't corrections or lessons but represent durable patterns.
Admin UI — Direct CRUD for all five types (correction, lesson, routing, insight, guideline) via /admin/agent-review/knowledge. Routing and guideline rows are admin-authored today; agents can propose them via addLesson and admins can re-type as needed.
Feedback review agent — Writes rows via the tools above, setting source_feedback_id for provenance.

shared_context rows are read by:

loadSharedContextPrompt(admin, tenantId, { agentId?, entityTypeId? }) — Same signature, new internals. Queries active rows, filters by scope (global + agent-specific + entity-type-specific where relevant), ranks by recency and usage, returns top N per type formatted as prompt sections.
Admin UI — list, detail, edit, deactivate, delete.

Feedback Review Agent

A system agent (registered once, available to every tenant via is_system=true per existing agent conventions) that runs on a scheduled heartbeat per tenant that opts in.

Configuration (default):

Heartbeat cron: hourly per tenant (skipped when no pending feedback exists)
Tool groups: context, entities-read, agents-admin, feedback-review
Instructions (base prompt):
You are the Feedback Review Agent. Every hour, review pending feedback in this workspace. For each feedback or group of related feedback:
1. Read the source context (chat, entity, session) to understand what happened
2. Decide if this represents a durable pattern or a one-off
3. If durable: create a correction, lesson, or insight via addCorrection/addLesson/addInsight with source_feedback_id provenance
4. If structural: consider updating agent config or entity type schema
5. If noise: dismiss with review_notes explaining why
6. Mark feedback as applied or dismissed Prefer to dismiss weak signals rather than pollute shared_context with low-value rules.

Tools (new):

listFeedback({ status, sourceType?, agentId?, limit }) — paginate pending feedback
getFeedback(id) — detail view with linked source context
reviewFeedback(id, action: 'applied' | 'dismissed', notes: string) — mark processed
addInsight(content, context?) — new shared_context type for observations
updateSharedContext(id, { active?, content?, context? }) — tune existing rules
deactivateSharedContext(id, reason) — soft-delete a stale rule

Guardrails:

The review agent has a permission role that lets it read but not delete feedback or arbitrary data
It cannot create new agents or delete entity types
All actions are logged in session_events for audit
Admins can disable the heartbeat and review feedback manually

Admin "Agent Review" section

Replaces /admin/conversations (PR 707) and absorbs /sessions. Lives at /admin/agent-review under the "AI & Agents" group in ADMIN_SECTIONS.

Tabs:

Conversations — Chat observability. Queries chats + messages + feedback table. Browse all chats, view transcripts, see thumbs and feedback inline, link to related sessions. Admin can add observation feedback directly from messages. (If PR 707 lands concurrently, its implementation is rebased onto this structure and its /admin/conversations route redirects here.)
Sessions — Execution browser for all session types (agent/response/tool/mixed). Moved from user-facing /sessions page. Filter by status, type, agent, date. Link to session detail transcript.
Feedback — Unified feed across all sources. Filter by status (pending/reviewed/applied/dismissed), source_type, rating, agent. Click through to source (chat message, entity, session). Bulk actions: mark reviewed, dismiss with shared note.
Knowledge — Manage shared_context rows. Tabs within: corrections, lessons, routing, insights, guidelines. Each row shows content, context, scope (agent/entity_type), active state, provenance (which feedback generated it, which agent wrote it). CRUD: add, edit, deactivate, delete.
Evals — Golden set management and accuracy reports. List of eval sets (tagged entity groups), per-set accuracy dashboards, trend charts, drill-down into specific failed cases, link to trigger re-runs.

Eval system (session-centric golden sets)

Any session can be a golden case. An agent session (chat, extraction, response, tool, mixed) captures inputs, execution (all tool calls via session_events), and outputs. Promoting a session as "golden" turns it into a reusable regression test: replay the same inputs against a new agent or config, compare the new session's behavior to the golden one, and report divergence.

This generalizes from "entity extraction accuracy" to "does the agent still behave the way we want for this scenario." Response-level evals (compare extracted field values to expected values) become one case; chat behavior (did the agent use the right tools, produce the right summary) and tool-session behavior (did it call the right API with the right params) are all covered by the same mechanism.

Defining a golden session:

Admin reviews a completed session in /admin/agent-review/sessions or /admin/agent-review/conversations
Admin clicks "Promote as golden" on a session whose outputs they validated as correct
System writes sessions.metadata.golden = { set: '<set-name>', promoted_at, promoted_by, snapshot } where snapshot captures the reproducible inputs (agent_id, initial message, entity_id, view_context, tool_grants, etc.)
Golden sessions are immutable: sessions.status is frozen to completed and the session cannot be deleted while golden (soft-enforced in server actions + RLS)
Golden sessions can be organized into named sets ('pe-extraction', 'chat-smoke', 'tool-accuracy') via metadata.golden.set

Running an eval:

Eval run triggered from admin UI, scheduled, or on post-deploy hook
For each golden session in the target set:
1. Read the input snapshot from sessions.metadata.golden.snapshot
2. Create a new session with those inputs (possibly using a different agent_id to test a candidate config)
3. Execute the session to completion via the session executor
4. Compare the new session's events + outputs to the golden session's events + outputs
Writes comparison results to sessions.metadata.eval_result on the replay session, linking back to the golden via metadata.eval_source = golden_session_id

Comparison logic (features/evals/lib/compare.ts), dispatched by session_type:

type ComparisonResult = {
  session_type: SessionType
  dimensions: Record<string, DimensionResult> // field-by-field comparison
  overall_accuracy: number // 0..1
  divergences: Divergence[] // ordered list of mismatches for drill-down
}

// agent | chat sessions: compare tool-call sequence + final message
// - tool_calls: set overlap (order-insensitive) with expected tools
// - tool_args: per-tool, compare input args via jsonDiff
// - final_message: semantic similarity score (optional) + keyword presence
// - entities_created: compare what entities the session created
// - entities_updated: compare which fields were mutated

// response sessions: compare values per criteria dimension
// - numeric: tolerance-based match
// - text: exact or semantic similarity
// - select: exact match
// - relation-rank: order-aware set comparison

// tool sessions: compare tool output and any entity mutations
// - output: jsonDiff against golden output
// - entity mutations: same as agent

// extraction tasks (session_type='agent' with output_type='field'|'fields'):
// - submitResponse values: dimension-by-dimension comparison
// - field metadata: source attribution, confidence

Scoring and reporting:

Per dimension: pass/fail + similarity score
Per session: overall accuracy % (weighted by dimension importance)
Per agent: accuracy across all golden sessions in a set
Per set: accuracy trend over time (chart)
Admin UI shows pass/fail matrix (golden sessions × agent configs) with drill-down to per-dimension divergence

Self-improvement integration:

When a replay fails on a golden session, the divergence becomes a feedback row (source_type='session', context includes golden_session_id and replay_session_id)
Feedback review agent picks it up, analyzes the divergence, generates a correction or adjusts agent config
Accuracy-trend dashboard surfaces regressions within the admin review section

Persona-driven golden sets:

content/docs/personas/*.mdx defines user personas and their jobs-to-be-done
Each persona should have at least one golden session capturing a canonical interaction (e.g., pe-analyst asking for deal comparisons, sales-ops adding contacts)
Persona-based golden sets give us user-journey regression coverage, not just field-level accuracy

No new tables in phase 5. sessions.metadata.golden and sessions.metadata.eval_result carry all the data. If we later need cross-run aggregation dashboards (accuracy over the last 10 runs), we can add an eval_runs table in phase 5b:

CREATE TABLE eval_runs (
  id uuid PRIMARY KEY,
  tenant_id uuid NOT NULL,
  name text NOT NULL, -- 'golden-pe-extraction', 'chat-smoke-2026-04-15'
  set_name text, -- the golden set being evaluated
  agent_id uuid, -- candidate agent being tested
  triggered_by text, -- 'manual' | 'scheduled' | 'post-deploy'
  started_at timestamptz NOT NULL DEFAULT now(),
  completed_at timestamptz,
  golden_session_count integer,
  passed_count integer,
  failed_count integer,
  results jsonb NOT NULL DEFAULT '{}'::jsonb,
  -- { bySession: { goldenSessionId: { passed, overall_accuracy, divergences } }, ... }
  created_by uuid
);

Phase 5 ships without eval_runs; per-session metadata is enough to list recent replays and drill into divergences. Add the table when the UI needs cross-run charts.

Migration path

Phase 1: schema

Create feedback table
Backfill feedback from chat_feedback (source_type='chat', rating mapping from 'up'/'down')
Create new shared_context table structure (temporary name shared_context_v2, or drop-and-recreate in same migration)
Backfill from old shared_context JSON arrays — expand each array item into a row with appropriate type and parsed content/context
Drop old shared_context, rename shared_context_v2 → shared_context
Mark chat_feedback as deprecated; drop in next release

Phase 2: code

Rewrite features/context/server/actions.ts and features/context/lib/load.ts to read/write normalized rows. Preserve public signatures.
Update addCorrection/addLesson tools to write rows (signatures unchanged, internals changed)
Update features/chat/feedback/server/actions.ts to write to feedback table
Update chat feedback API routes to query feedback table
Wire response rejection flow to insert feedback row
Wire extraction rejection flow to insert feedback row alongside the retry
Wire session failure to insert feedback row

Phase 3: UI

Build features/feedback/ module (types, hooks, components)
Build features/evals/ module (types, compare logic, reporting components)
Build new /admin/agent-review section with 5 tabs
Add agent-review entry to ADMIN_SECTIONS under "AI & Agents" group
Delete app/(app)/sessions/page.tsx, add redirect to /tasks
Keep /sessions/[id] detail page — linked from admin and entity panels
If PR 707 lands concurrently, add a redirect from /admin/conversations → /admin/agent-review/conversations

Phase 4: review agent

Create system agent record (feedback-review-agent)
Create task template for "Feedback Review"
Configure heartbeat
Write system prompt and guardrails
Ship disabled by default — admin opts in per workspace

Phase 5: evals

Add "Promote as golden" action to session detail + conversation detail (writes metadata.golden)
Add protection in session actions: block delete and block status changes when metadata.golden is set
Build input-snapshot extractor (features/evals/server/snapshot.ts) that captures reproducible inputs from session + events
Build replay runner (features/evals/server/replay.ts) that creates a new session from a snapshot, invokes the session executor, and sets metadata.eval_source
Build comparison logic per session_type (compare-agent.ts, compare-response.ts, compare-tool.ts)
Build admin eval UI: golden session manager, replay trigger, pass/fail matrix, divergence drill-down
Seed persona-driven golden sessions for pe-analyst, sales-ops, first-time-admin as a starter set
Wire replay failures into feedback table as source_type='session' entries so the review agent picks up regressions

Each phase is independently shippable. Phase 1-2 is the foundation; phase 3-5 can land incrementally.

Trade-offs

Why a new feedback table instead of generalizing session_events? Session events are append-only audit telemetry — "agent called tool X at time Y." Feedback is evaluation — "the output was bad because Z." They have different lifecycles (events are immutable; feedback has a status machine), different query patterns (events are per-session; feedback is cross-session aggregate), and different readers (events for replay; feedback for learning). Mixing them would blur the intent and bloat the events table.

Why a new shared_context structure instead of an entity type? Considered making shared_context rows entities of type agent-knowledge or learning. Pros: zero new tables, standard entity CRUD, tags/relations free. Cons: entity overhead on the hot path (every agent prompt loads shared context), entity semantics don't fit ephemeral lifecycle knowledge well, and the admin UI for entities is geared toward business records, not system metadata. The dedicated table is ~50 lines of migration and wins on performance, semantics, and purpose-fit.

Why drop chat_feedback instead of keeping both? Keeping chat_feedback as a chat-specific table + a generic feedback table means chat thumbs would write to two places, or chat-specific feedback would split from everything else. Consolidating to one table means one query path, one review pipeline, one admin view. Migration is straightforward (tens of rows in practice).

Why keep shared_context as the name? Preserves zero rename churn across loadSharedContextPrompt(), CONTEXT_KEYS, tests, docs. The semantic meaning ("context shared across all agent runs for this tenant") still fits the new structure. User preference aligned.

Why not merge feedback and shared_context into one table? Feedback is input (evaluation signal); shared_context is output (distilled knowledge). A single correction in shared_context can be produced by reviewing 5-10 feedback signals. The write patterns, lifecycles, and read patterns are different enough that one table would force either a status='feedback' vs status='rule' split (back to a type column hack) or an awkward denormalization. Two tables is cleaner.

Why defer eval_runs table? Starting with per-session metadata (sessions.metadata.eval_result, linked to golden via metadata.eval_source) is enough for the first phase. Replays are themselves sessions — listing them, filtering by status, drilling in, all work via the existing session queries. If we need cross-run aggregation dashboards or eval history charts, add the table then. YAGNI.

Why session-centric evals instead of entity+response? Considered the narrower approach: tag entities as eval set, promote responses as golden, compare new responses. This works for extraction accuracy but misses chat behavior, tool-call correctness, and multi-step agent journeys. Session-centric evals let us regress-test any agent output — including the persona journeys defined in content/docs/personas/ — with one mechanism. The entity+response case is a specialization (response sessions compared dimension-by-dimension). One primitive covers every case.

Why not snapshot session inputs into a dedicated golden_cases table? Considered extracting golden input snapshots into their own table so original sessions could be freely modified or deleted. Decided against: the session record already contains the inputs (first events in session_events), and snapshotting to metadata.golden.snapshot at promotion time locks in the reproducible view. This avoids a parallel data model and lets admins browse golden sessions through the same session list UI they already use. Immutability is enforced at the action layer rather than at a schema level.

Acceptance Criteria

Feedback intake

feedback table exists with the schema above, RLS policies, and indexes
Chat thumbs write to feedback with source_type='chat' and correct context
Response rejection writes to feedback with source_type='response'
Extraction rejection writes to feedback alongside the retry
Session failure writes to feedback automatically
chat_feedback data fully backfilled into feedback
chat_feedback table dropped after one release

Workspace knowledge

shared_context normalized table exists with schema above, RLS, indexes
All existing JSON-array rows backfilled into individual rows
addCorrection/addLesson tools write rows (same signatures)
New addInsight tool exists
loadSharedContextPrompt() signature unchanged; reads from new structure
Scoping works: agent_id-specific rules only load for that agent, entity_type_id-specific rules only load for that type
active = false rows excluded from prompt injection
Provenance links preserved (source_feedback_id, created_by_agent_id, created_by_user_id)

Feedback review agent

System agent exists with correct tool groups and permissions
Heartbeat task runs on schedule
Review agent can list, read, and process pending feedback
Review agent creates shared_context rows with provenance
Review agent can update agent configs when appropriate
Review agent marks feedback as applied/dismissed with notes
All review actions logged in session_events
Admin can disable review agent per workspace

All new migrations are reversible
pnpm test passes with updated and new tests
pnpm typecheck, pnpm lint, pnpm build pass
documents/DATABASE.md reflects new schema
/docs/features/agent-system updated with feedback + evals sections
content/docs/data-model.mdx updated with new tables
Feature docs for /admin/agent-review added to content/docs/features/

Files

Database

supabase/migrations/20260415010000_feedback_table.sql — create feedback + backfill from chat_feedback
supabase/migrations/20260415010100_shared_context_normalized.sql — restructure shared_context + backfill from JSON arrays
supabase/migrations/20260415010200_drop_chat_feedback.sql — drop deprecated chat_feedback (next release after phase 1 ships)

Feedback module (new)

features/feedback/types.ts — FeedbackRecord, FeedbackSourceType, FeedbackStatus
features/feedback/server/actions.ts — createFeedback, listFeedback, reviewFeedback, getFeedback
features/feedback/server/queries.ts — tenant-scoped query builders with joins
features/feedback/hooks/use-feedback-list.ts — React Query hook
features/feedback/components/feedback-list.tsx — list component with filters
features/feedback/components/feedback-detail.tsx — detail view with source context
app/api/feedback/route.ts — POST (create), GET (list with filters)
app/api/feedback/[id]/route.ts — GET (detail), PATCH (review)

Context module (refactored)

features/context/types.ts — rewritten: SharedContextRecord, SharedContextType enum
features/context/server/actions.ts — createSharedContextRule, updateSharedContextRule, deactivateSharedContextRule, listSharedContext
features/context/lib/load.ts — loadSharedContextPrompt() signature unchanged; reads normalized rows
features/context/lib/format.ts — format functions unchanged externally
features/context/components/shared-context-manager.tsx — admin CRUD component
app/api/admin/shared-context/route.ts — GET (list), POST (create)
app/api/admin/shared-context/[id]/route.ts — PATCH (update), DELETE (hard delete if needed)

Tools (updated)

features/tools/context-tools.ts — addCorrection, addLesson updated to write rows; new addInsight, updateSharedContext, deactivateSharedContext
features/tools/feedback-tools.ts (new) — listFeedback, getFeedback, reviewFeedback, addFeedback

Feedback intake wiring

features/chat/components/message-actions-bar.tsx — write to /api/feedback instead of /api/chats/[id]/feedback
features/chat/feedback/server/actions.ts — rewrite to use feedback table
features/responses/server/actions.ts — on rejection, insert feedback row
features/inngest/functions/feedback-rerun.ts — insert feedback row alongside retry
features/sessions/server/event-log.ts — on session failure transition, insert feedback row

Feedback review agent

features/agents/system-agents.ts — register feedback-review-agent
features/tasks/system-tasks.ts — register "Feedback Review" task template
Migration to seed review agent + task

Evals module (new)

features/evals/types.ts — GoldenSessionSnapshot, EvalComparison, Divergence, DimensionResult
features/evals/server/actions.ts — promoteSessionToGolden, revokeGoldenStatus, replayGoldenSession, listGoldenSessions, listReplaysForGolden
features/evals/server/snapshot.ts — extract reproducible input snapshot from a session + events
features/evals/server/replay.ts — construct new session from snapshot, invoke session executor
features/evals/lib/compare.ts — top-level comparator dispatching by session_type
features/evals/lib/compare-agent.ts — tool-call sequence, args, entity mutations, final message
features/evals/lib/compare-response.ts — per-dimension value comparison (reuses existing scoring primitives)
features/evals/lib/compare-tool.ts — tool output + mutation comparison
features/evals/lib/similarity.ts — helpers (numeric tolerance, keyword presence, optional semantic similarity)
features/evals/components/eval-report.tsx — pass/fail matrix + trend
features/evals/components/golden-session-manager.tsx — list, promote, organize into sets
features/evals/components/replay-diff.tsx — side-by-side divergence drill-down
features/sessions/server/actions.ts — add is_golden helper (reads metadata.golden) + guard against delete/status-change when golden
app/api/admin/evals/golden/route.ts — GET (list golden sessions), POST (promote session)
app/api/admin/evals/golden/[id]/route.ts — DELETE (revoke), PATCH (rename set)
app/api/admin/evals/runs/route.ts — POST (trigger replay for a set or single session)
app/api/admin/evals/runs/[id]/route.ts — GET (replay results + diff)

Admin Agent Review section

app/(app)/admin/agent-review/page.tsx — landing page / tab container
app/(app)/admin/agent-review/conversations/page.tsx — absorbs PR 707
app/(app)/admin/agent-review/conversations/[id]/page.tsx — absorbs PR 707 detail
app/(app)/admin/agent-review/sessions/page.tsx — moved from user-facing
app/(app)/admin/agent-review/feedback/page.tsx — new
app/(app)/admin/agent-review/knowledge/page.tsx — new
app/(app)/admin/agent-review/evals/page.tsx — new
features/admin/lib/sections.ts — add agent-review entry

Removed / redirected

app/(app)/sessions/page.tsx — delete, redirect to /tasks
app/(app)/admin/conversations/** — redirect to /admin/agent-review/conversations/**
chat_feedback table — drop after backfill + one release

Docs

content/docs/features/agent-system.mdx — add feedback + evals + review sections
content/docs/data-model.mdx — document feedback and new shared_context
content/docs/features/agent-review.mdx (new) — admin section guide
documents/DATABASE.md — update table inventory
documents/CHANGELOG.md — entry when phase 1 ships

Open Questions

Should feedback rows be user-visible? Users can see their own chat feedback today. Decision: yes for user's own feedback (created_by = user), no for aggregate or other users' feedback (admin-only). RLS handles this.
How aggressive should the review agent be? Over-eager creation of rules pollutes shared_context; under-eager leaves signal on the table. Start conservative — review agent creates rules only when 2+ feedback signals support the same pattern, or an admin manually flags a pattern.
Per-agent shared_context scoping — opt-in or automatic? When an agent writes a lesson from its own session, should it default to agent_id = self (narrow scope) or agent_id = null (workspace-wide)? Default to workspace-wide with explicit agent-specific opt-in, since most learnings should generalize.
Eval run scheduling — manual or automatic? Start manual (admin triggers from UI). Phase 2 could add post-deploy automatic runs and scheduled regression checks.
Backfill of extraction rejections into feedback? The feedback-rerun.ts path is alive today but doesn't record feedback. Backfill historical extraction rejections from entity_responses where status='rejected' with rejection reasons? Decision: yes, one-time backfill as part of phase 1 migration.
How reproducible are chat session inputs? Chat sessions depend on user_info, workspace context, skills, and shared_context at the time of the original run — all of which change over time. Replay fidelity will drift. Decision: snapshot the full prompt context (entity types, shared_context rules, memories, skills) into metadata.golden.snapshot at promotion time. Replays use the snapshot, not live context. Admins can opt into "live context" replays if they want to test whether new shared_context rules actually improve behavior on past scenarios.
Semantic similarity for text comparisons — embedding model? Comparing chat final messages or text field values benefits from embedding-based similarity. Decision: phase 5 ships with keyword-overlap + exact-match. Phase 5b adds embedding similarity if the signal-to-noise ratio warrants the cost/complexity.

Unified Agent Intelligence

Problem

Goals

Non-goals

Design

Two tables, purpose-built

The feedback-to-knowledge pipeline

Feedback intake points

Workspace knowledge authoring

Feedback Review Agent

Admin "Agent Review" section

Eval system (session-centric golden sets)

Migration path

Trade-offs

Acceptance Criteria

Feedback intake

Workspace knowledge

Feedback review agent

Admin UI

Eval system

Quality gates

Files

Database

Feedback module (new)

Context module (refactored)

Tools (updated)

Feedback intake wiring

Feedback review agent

Evals module (new)

Admin Agent Review section

Removed / redirected

Docs

Open Questions

On this page