Unified Agent Intelligence
Consolidate feedback intake, self-improvement, admin review, and eval systems into a thin harness around two primitives — feedback inputs and shared_context outputs — that turn every agent interaction into durable workspace intelligence.
Problem
Amble has six parallel channels for capturing agent feedback, and only one of them closes the loop into durable intelligence. The rest are either write-only dead ends or one-shot fixes that never generalize into workspace knowledge.
What exists today:
| Channel | Writes to | Reads from | End-to-end? |
|---|---|---|---|
| Corrections / Lessons | shared_context JSON blobs via addCorrection/addLesson tools | Every agent prompt via loadSharedContextPrompt() | YES |
| User memories | user_memories via saveMemory tool | User-scoped prompt injection | YES (user-scoped only) |
| Chat feedback | chat_feedback via thumbs-up/down buttons | Nothing | NO |
| Response scoring | entity_responses via form submission | Aggregation views, field promotion | Partial (no learning loop) |
| Extraction rejection | Inngest event → feedback-rerun.ts | One-shot retry with feedback instruction | Limited (no durable lesson) |
| Session events | session_events via append | Audit trail only | NO |
The core problems:
- Fragmented intake. Feedback on agent output lives in four places (
chat_feedback,entity_responses.status, Inngest rejection events, implicit insession_events). No unified read path means no systematic review. - Brittle storage. Corrections and lessons are stored as JSON arrays inside a key/value
shared_contexttable. Individual items cannot be edited, deactivated, or audited without rewriting the whole array. The 8-item prompt cap is treated as a data cap — any 9th lesson falls off forever. - No review pipeline. Nothing processes
chat_feedback, response rejections, or session failures into durable learnings. The one system that does learn durably (corrections/lessons) requires agents to proactively call tools — there is no review loop. - No eval system. Extraction accuracy, agent output quality, and regression detection have no golden-set mechanism. "Did this change improve or regress agent behavior on canonical scenarios?" cannot be answered. Today, scoring only covers entity-response field values — chat behavior, tool-call sequences, and full-session outcomes have no reference set.
- Surface sprawl.
/sessionsis a user-facing page that exposes raw execution machinery customers don't need. PR 707 proposed/admin/conversationsas another standalone surface (not yet merged to dev). Admin has no single place to review agent work, manage workspace knowledge, or evaluate quality.
This is vibecoded sprawl. It ships features that look like feedback loops but don't close. The fix is to collapse the surface to a thin harness — two tables, one review task, one admin section — and let the durable knowledge (workspace rules, learned patterns, eval results) become the asset that grows over time.
Goals
- Unify feedback intake. All agent-output evaluations (chat thumbs, response rejections, extraction rejections, session failures, admin observations) flow into a single
feedbacktable with consistent shape. - Normalize workspace knowledge. Corrections, lessons, routing, insights, and guidelines become individually-addressable rows in
shared_context, not JSON array items. Each carries provenance, scope, active state, and lifecycle metadata. - Close the self-improvement loop. A scheduled feedback review agent reads pending feedback, identifies patterns, and generates durable
shared_contextentries. Human admins can review, edit, approve, or dismiss. - Add session-centric golden-set evals. Any completed session (chat, extraction, response, tool, mixed) can be flagged as golden. Replay its inputs against a candidate agent or config, compare the new session's tool calls, outputs, and mutations to the golden, and report divergence. Persona-driven canonical sessions become regression coverage for user journeys.
- Consolidate admin surfaces. One
/admin/agent-reviewsection provides the conversation browser, sessions browser, feedback review, shared_context management, and eval reporting. PR 707's conversation-browser work, if still pending, gets rebased onto this structure rather than shipped standalone. - Remove user-facing surfaces that customers don't need.
/sessionsredirects to/tasks. Session detail remains for deep drill-down from admin and entity panels.
Non-goals
- Not rebuilding the agent prompt construction path —
loadSharedContextPrompt()keeps its signature; internals change. - Not changing
user_memories— user-scoped memory stays as-is, separate from workspace knowledge. - Not changing
entity_responsesorcriteria_sets— the response/scoring system is sound, just gets wired into feedback generation. - Not changing
session_events— it stays append-only telemetry; feedback is separate. - Not introducing a new primitive for evals — golden sets are sessions flagged as
metadata.golden, replayed through the existing session executor, and compared via a newfeatures/evals/lib/compare.tsdispatched bysession_type.
Design
Two tables, purpose-built
feedback (new) — inputs to the self-improvement system.
Replaces chat_feedback and absorbs all signals that evaluate agent output. Lightweight, append-mostly, has a processing lifecycle.
CREATE TABLE feedback (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id uuid NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
-- Source of the feedback
source_type text NOT NULL CHECK (source_type IN (
'chat', 'response', 'extraction', 'tool', 'session', 'observation'
)),
rating text NOT NULL CHECK (rating IN ('positive', 'negative', 'neutral')),
comment text,
-- Links to the agent output being evaluated
session_id uuid REFERENCES sessions(id) ON DELETE SET NULL,
entity_id uuid REFERENCES entities(id) ON DELETE SET NULL,
agent_id uuid REFERENCES agents(id) ON DELETE SET NULL,
context jsonb NOT NULL DEFAULT '{}'::jsonb,
-- context carries source-specific metadata:
-- chat: { chat_id, message_index }
-- response: { response_id, criteria_set_id, field_name }
-- extraction: { field_name, entity_type_id, rejected_value }
-- tool: { tool_slug, tool_run_id }
-- session: { failure_reason, event_id }
-- Processing lifecycle
status text NOT NULL DEFAULT 'pending' CHECK (status IN (
'pending', 'reviewed', 'applied', 'dismissed'
)),
reviewed_by_agent_id uuid REFERENCES agents(id) ON DELETE SET NULL,
reviewed_by_user_id uuid REFERENCES auth.users(id) ON DELETE SET NULL,
reviewed_at timestamptz,
review_notes text,
created_by uuid NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX idx_feedback_tenant_status ON feedback(tenant_id, status, created_at DESC);
CREATE INDEX idx_feedback_session ON feedback(session_id) WHERE session_id IS NOT NULL;
CREATE INDEX idx_feedback_entity ON feedback(entity_id) WHERE entity_id IS NOT NULL;
CREATE INDEX idx_feedback_agent ON feedback(agent_id) WHERE agent_id IS NOT NULL;
CREATE INDEX idx_feedback_pending ON feedback(tenant_id, created_at DESC) WHERE status = 'pending';shared_context (restructured) — outputs of the self-improvement system.
Keeps its current name. Drops the key/value JSON-blob shape. Each correction, lesson, insight, or guideline becomes its own row with individual lifecycle and provenance.
-- New structure (requires drop + recreate, see migration section)
CREATE TABLE shared_context (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id uuid NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
-- What kind of knowledge
type text NOT NULL CHECK (type IN (
'correction', 'lesson', 'routing', 'insight', 'guideline'
)),
content text NOT NULL,
context text, -- when this applies (e.g., "when extracting valuation for PE deals")
-- Scoping (NULL = tenant-wide; set for narrower scope)
agent_id uuid REFERENCES agents(id) ON DELETE CASCADE,
entity_type_id uuid REFERENCES entity_types(id) ON DELETE CASCADE,
-- Lifecycle
active boolean NOT NULL DEFAULT true,
-- Provenance
source_feedback_id uuid REFERENCES feedback(id) ON DELETE SET NULL,
created_by_agent_id uuid REFERENCES agents(id) ON DELETE SET NULL,
created_by_user_id uuid REFERENCES auth.users(id) ON DELETE SET NULL,
metadata jsonb NOT NULL DEFAULT '{}'::jsonb,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX idx_shared_context_tenant_type_active ON shared_context(tenant_id, type, active, created_at DESC);
CREATE INDEX idx_shared_context_agent ON shared_context(agent_id) WHERE agent_id IS NOT NULL;
CREATE INDEX idx_shared_context_entity_type ON shared_context(entity_type_id) WHERE entity_type_id IS NOT NULL;
CREATE INDEX idx_shared_context_source_feedback ON shared_context(source_feedback_id) WHERE source_feedback_id IS NOT NULL;Why keep the name shared_context:
- Zero rename churn in code, tests, and docs that reference it.
- Semantically accurate: context shared across all agent runs for this tenant.
loadSharedContextPrompt()keeps its signature — only internals change.CONTEXT_KEYSis removed; types are now enum values on rows, not JSON keys.
The feedback-to-knowledge pipeline
Agent produces output
↓
Human/system evaluates (thumbs, reject, score, flag)
↓
feedback row inserted (status='pending')
↓
Feedback Review Agent (scheduled task, hourly)
↓
Reads pending feedback, groups by pattern
↓
Decides action per feedback:
• Create shared_context row (correction/lesson/insight)
• Update agent config (prompt tweaks, tool groups)
• Update entity/entity_type (structural fix)
• Dismiss (noise, already addressed, user error)
↓
Marks feedback as applied/dismissed with review_notes
↓
loadSharedContextPrompt() reads active rows, formats by type,
injects top N per type into every future agent promptFeedback intake points
Existing channels route into the feedback table:
- Chat thumbs (existing UI) —
features/chat/components/message-actions-bar.tsxwrites tofeedbackwithsource_type='chat',context={chat_id, message_index}.chat_feedbacktable backfilled and dropped. - Response rejection — When a response status transitions to
rejectedwith arejection_reason,submitResponseAdmin()writes afeedbackrow withsource_type='response',context={response_id, criteria_set_id}. The existingextraction/result-rejectedInngest event continues to fire. - Extraction rejection —
feedback-rerun.tscontinues to run the one-shot retry, AND writes afeedbackrow withsource_type='extraction',rating='negative', status='applied' (the retry itself is the application). The review agent can still pick up patterns across rejections. - Tool results — Agents and admins can call a new
addFeedbacktool to record observations about tool outputs. - Session failures — Sessions that transition to
failedstatus automatically generate afeedbackrow withsource_type='session',rating='negative',context={failure_reason}. - Admin observation — Admins can add a feedback row manually from any session or conversation via the review UI ("Flag this response / message / session").
Workspace knowledge authoring
shared_context rows are written by:
addCorrectiontool — Updated toINSERTatype='correction'row. Signature unchanged:(content: string, context?: string). Now also accepts optionalagentIdandentityTypeIdfor scoping.addLessontool — Updated toINSERTatype='lesson'row. Signature unchanged.- New
addInsighttool — For agent observations that aren't corrections or lessons but represent durable patterns. - Admin UI — Direct CRUD for all five types (correction, lesson, routing, insight, guideline) via
/admin/agent-review/knowledge. Routing and guideline rows are admin-authored today; agents can propose them viaaddLessonand admins can re-type as needed. - Feedback review agent — Writes rows via the tools above, setting
source_feedback_idfor provenance.
shared_context rows are read by:
loadSharedContextPrompt(admin, tenantId, { agentId?, entityTypeId? })— Same signature, new internals. Queries active rows, filters by scope (global + agent-specific + entity-type-specific where relevant), ranks by recency and usage, returns top N per type formatted as prompt sections.- Admin UI — list, detail, edit, deactivate, delete.
Feedback Review Agent
A system agent (registered once, available to every tenant via is_system=true per existing agent conventions) that runs on a scheduled heartbeat per tenant that opts in.
Configuration (default):
- Heartbeat cron: hourly per tenant (skipped when no pending feedback exists)
- Tool groups:
context,entities-read,agents-admin,feedback-review - Instructions (base prompt):
You are the Feedback Review Agent. Every hour, review pending feedback in this workspace. For each feedback or group of related feedback:
- Read the source context (chat, entity, session) to understand what happened
- Decide if this represents a durable pattern or a one-off
- If durable: create a correction, lesson, or insight via addCorrection/addLesson/addInsight with source_feedback_id provenance
- If structural: consider updating agent config or entity type schema
- If noise: dismiss with review_notes explaining why
- Mark feedback as applied or dismissed Prefer to dismiss weak signals rather than pollute shared_context with low-value rules.
Tools (new):
listFeedback({ status, sourceType?, agentId?, limit })— paginate pending feedbackgetFeedback(id)— detail view with linked source contextreviewFeedback(id, action: 'applied' | 'dismissed', notes: string)— mark processedaddInsight(content, context?)— new shared_context type for observationsupdateSharedContext(id, { active?, content?, context? })— tune existing rulesdeactivateSharedContext(id, reason)— soft-delete a stale rule
Guardrails:
- The review agent has a permission role that lets it read but not delete feedback or arbitrary data
- It cannot create new agents or delete entity types
- All actions are logged in
session_eventsfor audit - Admins can disable the heartbeat and review feedback manually
Admin "Agent Review" section
Replaces /admin/conversations (PR 707) and absorbs /sessions. Lives at /admin/agent-review under the "AI & Agents" group in ADMIN_SECTIONS.
Tabs:
-
Conversations — Chat observability. Queries
chats+messages+feedbacktable. Browse all chats, view transcripts, see thumbs and feedback inline, link to related sessions. Admin can add observation feedback directly from messages. (If PR 707 lands concurrently, its implementation is rebased onto this structure and its/admin/conversationsroute redirects here.) -
Sessions — Execution browser for all session types (agent/response/tool/mixed). Moved from user-facing
/sessionspage. Filter by status, type, agent, date. Link to session detail transcript. -
Feedback — Unified feed across all sources. Filter by status (pending/reviewed/applied/dismissed), source_type, rating, agent. Click through to source (chat message, entity, session). Bulk actions: mark reviewed, dismiss with shared note.
-
Knowledge — Manage
shared_contextrows. Tabs within: corrections, lessons, routing, insights, guidelines. Each row shows content, context, scope (agent/entity_type), active state, provenance (which feedback generated it, which agent wrote it). CRUD: add, edit, deactivate, delete. -
Evals — Golden set management and accuracy reports. List of eval sets (tagged entity groups), per-set accuracy dashboards, trend charts, drill-down into specific failed cases, link to trigger re-runs.
Eval system (session-centric golden sets)
Any session can be a golden case. An agent session (chat, extraction, response, tool, mixed) captures inputs, execution (all tool calls via session_events), and outputs. Promoting a session as "golden" turns it into a reusable regression test: replay the same inputs against a new agent or config, compare the new session's behavior to the golden one, and report divergence.
This generalizes from "entity extraction accuracy" to "does the agent still behave the way we want for this scenario." Response-level evals (compare extracted field values to expected values) become one case; chat behavior (did the agent use the right tools, produce the right summary) and tool-session behavior (did it call the right API with the right params) are all covered by the same mechanism.
Defining a golden session:
- Admin reviews a completed session in
/admin/agent-review/sessionsor/admin/agent-review/conversations - Admin clicks "Promote as golden" on a session whose outputs they validated as correct
- System writes
sessions.metadata.golden = { set: '<set-name>', promoted_at, promoted_by, snapshot }wheresnapshotcaptures the reproducible inputs (agent_id, initial message, entity_id, view_context, tool_grants, etc.) - Golden sessions are immutable:
sessions.statusis frozen tocompletedand the session cannot be deleted while golden (soft-enforced in server actions + RLS) - Golden sessions can be organized into named sets (
'pe-extraction','chat-smoke','tool-accuracy') viametadata.golden.set
Running an eval:
- Eval run triggered from admin UI, scheduled, or on post-deploy hook
- For each golden session in the target set:
- Read the input snapshot from
sessions.metadata.golden.snapshot - Create a new session with those inputs (possibly using a different agent_id to test a candidate config)
- Execute the session to completion via the session executor
- Compare the new session's events + outputs to the golden session's events + outputs
- Read the input snapshot from
- Writes comparison results to
sessions.metadata.eval_resulton the replay session, linking back to the golden viametadata.eval_source = golden_session_id
Comparison logic (features/evals/lib/compare.ts), dispatched by session_type:
type ComparisonResult = {
session_type: SessionType
dimensions: Record<string, DimensionResult> // field-by-field comparison
overall_accuracy: number // 0..1
divergences: Divergence[] // ordered list of mismatches for drill-down
}
// agent | chat sessions: compare tool-call sequence + final message
// - tool_calls: set overlap (order-insensitive) with expected tools
// - tool_args: per-tool, compare input args via jsonDiff
// - final_message: semantic similarity score (optional) + keyword presence
// - entities_created: compare what entities the session created
// - entities_updated: compare which fields were mutated
// response sessions: compare values per criteria dimension
// - numeric: tolerance-based match
// - text: exact or semantic similarity
// - select: exact match
// - relation-rank: order-aware set comparison
// tool sessions: compare tool output and any entity mutations
// - output: jsonDiff against golden output
// - entity mutations: same as agent
// extraction tasks (session_type='agent' with output_type='field'|'fields'):
// - submitResponse values: dimension-by-dimension comparison
// - field metadata: source attribution, confidenceScoring and reporting:
- Per dimension: pass/fail + similarity score
- Per session: overall accuracy % (weighted by dimension importance)
- Per agent: accuracy across all golden sessions in a set
- Per set: accuracy trend over time (chart)
- Admin UI shows pass/fail matrix (golden sessions × agent configs) with drill-down to per-dimension divergence
Self-improvement integration:
- When a replay fails on a golden session, the divergence becomes a feedback row (
source_type='session', context includes golden_session_id and replay_session_id) - Feedback review agent picks it up, analyzes the divergence, generates a correction or adjusts agent config
- Accuracy-trend dashboard surfaces regressions within the admin review section
Persona-driven golden sets:
content/docs/personas/*.mdxdefines user personas and their jobs-to-be-done- Each persona should have at least one golden session capturing a canonical interaction (e.g., pe-analyst asking for deal comparisons, sales-ops adding contacts)
- Persona-based golden sets give us user-journey regression coverage, not just field-level accuracy
No new tables in phase 5. sessions.metadata.golden and sessions.metadata.eval_result carry all the data. If we later need cross-run aggregation dashboards (accuracy over the last 10 runs), we can add an eval_runs table in phase 5b:
CREATE TABLE eval_runs (
id uuid PRIMARY KEY,
tenant_id uuid NOT NULL,
name text NOT NULL, -- 'golden-pe-extraction', 'chat-smoke-2026-04-15'
set_name text, -- the golden set being evaluated
agent_id uuid, -- candidate agent being tested
triggered_by text, -- 'manual' | 'scheduled' | 'post-deploy'
started_at timestamptz NOT NULL DEFAULT now(),
completed_at timestamptz,
golden_session_count integer,
passed_count integer,
failed_count integer,
results jsonb NOT NULL DEFAULT '{}'::jsonb,
-- { bySession: { goldenSessionId: { passed, overall_accuracy, divergences } }, ... }
created_by uuid
);Phase 5 ships without eval_runs; per-session metadata is enough to list recent replays and drill into divergences. Add the table when the UI needs cross-run charts.
Migration path
Phase 1: schema
- Create
feedbacktable - Backfill
feedbackfromchat_feedback(source_type='chat', rating mapping from 'up'/'down') - Create new
shared_contexttable structure (temporary nameshared_context_v2, or drop-and-recreate in same migration) - Backfill from old
shared_contextJSON arrays — expand each array item into a row with appropriatetypeand parsedcontent/context - Drop old
shared_context, renameshared_context_v2→shared_context - Mark
chat_feedbackas deprecated; drop in next release
Phase 2: code
- Rewrite
features/context/server/actions.tsandfeatures/context/lib/load.tsto read/write normalized rows. Preserve public signatures. - Update
addCorrection/addLessontools to write rows (signatures unchanged, internals changed) - Update
features/chat/feedback/server/actions.tsto write tofeedbacktable - Update chat feedback API routes to query
feedbacktable - Wire response rejection flow to insert feedback row
- Wire extraction rejection flow to insert feedback row alongside the retry
- Wire session failure to insert feedback row
Phase 3: UI
- Build
features/feedback/module (types, hooks, components) - Build
features/evals/module (types, compare logic, reporting components) - Build new
/admin/agent-reviewsection with 5 tabs - Add
agent-reviewentry toADMIN_SECTIONSunder "AI & Agents" group - Delete
app/(app)/sessions/page.tsx, add redirect to/tasks - Keep
/sessions/[id]detail page — linked from admin and entity panels - If PR 707 lands concurrently, add a redirect from
/admin/conversations→/admin/agent-review/conversations
Phase 4: review agent
- Create system agent record (feedback-review-agent)
- Create task template for "Feedback Review"
- Configure heartbeat
- Write system prompt and guardrails
- Ship disabled by default — admin opts in per workspace
Phase 5: evals
- Add "Promote as golden" action to session detail + conversation detail (writes
metadata.golden) - Add protection in session actions: block delete and block status changes when
metadata.goldenis set - Build input-snapshot extractor (
features/evals/server/snapshot.ts) that captures reproducible inputs from session + events - Build replay runner (
features/evals/server/replay.ts) that creates a new session from a snapshot, invokes the session executor, and setsmetadata.eval_source - Build comparison logic per session_type (
compare-agent.ts,compare-response.ts,compare-tool.ts) - Build admin eval UI: golden session manager, replay trigger, pass/fail matrix, divergence drill-down
- Seed persona-driven golden sessions for
pe-analyst,sales-ops,first-time-adminas a starter set - Wire replay failures into
feedbacktable assource_type='session'entries so the review agent picks up regressions
Each phase is independently shippable. Phase 1-2 is the foundation; phase 3-5 can land incrementally.
Trade-offs
Why a new feedback table instead of generalizing session_events?
Session events are append-only audit telemetry — "agent called tool X at time Y." Feedback is evaluation — "the output was bad because Z." They have different lifecycles (events are immutable; feedback has a status machine), different query patterns (events are per-session; feedback is cross-session aggregate), and different readers (events for replay; feedback for learning). Mixing them would blur the intent and bloat the events table.
Why a new shared_context structure instead of an entity type?
Considered making shared_context rows entities of type agent-knowledge or learning. Pros: zero new tables, standard entity CRUD, tags/relations free. Cons: entity overhead on the hot path (every agent prompt loads shared context), entity semantics don't fit ephemeral lifecycle knowledge well, and the admin UI for entities is geared toward business records, not system metadata. The dedicated table is ~50 lines of migration and wins on performance, semantics, and purpose-fit.
Why drop chat_feedback instead of keeping both?
Keeping chat_feedback as a chat-specific table + a generic feedback table means chat thumbs would write to two places, or chat-specific feedback would split from everything else. Consolidating to one table means one query path, one review pipeline, one admin view. Migration is straightforward (tens of rows in practice).
Why keep shared_context as the name?
Preserves zero rename churn across loadSharedContextPrompt(), CONTEXT_KEYS, tests, docs. The semantic meaning ("context shared across all agent runs for this tenant") still fits the new structure. User preference aligned.
Why not merge feedback and shared_context into one table?
Feedback is input (evaluation signal); shared_context is output (distilled knowledge). A single correction in shared_context can be produced by reviewing 5-10 feedback signals. The write patterns, lifecycles, and read patterns are different enough that one table would force either a status='feedback' vs status='rule' split (back to a type column hack) or an awkward denormalization. Two tables is cleaner.
Why defer eval_runs table?
Starting with per-session metadata (sessions.metadata.eval_result, linked to golden via metadata.eval_source) is enough for the first phase. Replays are themselves sessions — listing them, filtering by status, drilling in, all work via the existing session queries. If we need cross-run aggregation dashboards or eval history charts, add the table then. YAGNI.
Why session-centric evals instead of entity+response?
Considered the narrower approach: tag entities as eval set, promote responses as golden, compare new responses. This works for extraction accuracy but misses chat behavior, tool-call correctness, and multi-step agent journeys. Session-centric evals let us regress-test any agent output — including the persona journeys defined in content/docs/personas/ — with one mechanism. The entity+response case is a specialization (response sessions compared dimension-by-dimension). One primitive covers every case.
Why not snapshot session inputs into a dedicated golden_cases table?
Considered extracting golden input snapshots into their own table so original sessions could be freely modified or deleted. Decided against: the session record already contains the inputs (first events in session_events), and snapshotting to metadata.golden.snapshot at promotion time locks in the reproducible view. This avoids a parallel data model and lets admins browse golden sessions through the same session list UI they already use. Immutability is enforced at the action layer rather than at a schema level.
Acceptance Criteria
Feedback intake
-
feedbacktable exists with the schema above, RLS policies, and indexes - Chat thumbs write to
feedbackwithsource_type='chat'and correct context - Response rejection writes to
feedbackwithsource_type='response' - Extraction rejection writes to
feedbackalongside the retry - Session failure writes to
feedbackautomatically -
chat_feedbackdata fully backfilled intofeedback -
chat_feedbacktable dropped after one release
Workspace knowledge
-
shared_contextnormalized table exists with schema above, RLS, indexes - All existing JSON-array rows backfilled into individual rows
-
addCorrection/addLessontools write rows (same signatures) - New
addInsighttool exists -
loadSharedContextPrompt()signature unchanged; reads from new structure - Scoping works:
agent_id-specific rules only load for that agent,entity_type_id-specific rules only load for that type -
active = falserows excluded from prompt injection - Provenance links preserved (source_feedback_id, created_by_agent_id, created_by_user_id)
Feedback review agent
- System agent exists with correct tool groups and permissions
- Heartbeat task runs on schedule
- Review agent can list, read, and process pending feedback
- Review agent creates
shared_contextrows with provenance - Review agent can update agent configs when appropriate
- Review agent marks feedback as applied/dismissed with notes
- All review actions logged in
session_events - Admin can disable review agent per workspace
Admin UI
-
/admin/agent-reviewexists under "AI & Agents" -
agent-reviewinADMIN_SECTIONS - Conversations tab works (from PR 707, updated data source)
- Sessions tab works (moved from user-facing)
- Feedback tab shows unified feed with filters
- Knowledge tab manages shared_context rows (CRUD)
- Evals tab shows golden set reports
-
/sessionsredirects to/tasks - Session detail page
/sessions/[id]remains accessible from admin and entity panels -
/admin/conversationsredirects to/admin/agent-review/conversations
Eval system
- Any completed session can be promoted as golden via
/admin/agent-review/sessionsor/admin/agent-review/conversations - Golden sessions capture reproducible input snapshot in
sessions.metadata.golden.snapshot - Golden sessions are immutable (protected from deletion + status changes while flagged)
- Admin can organize golden sessions into named sets (
pe-extraction,chat-smoke, etc.) - Eval run replays each golden session's inputs against target agent and produces a replay session
- Comparison logic dispatches by
session_type:agent/chat: tool-call set overlap, args jsonDiff, entities created/updated, final message similarityresponse: per-dimension value comparison (numeric tolerance, exact, semantic)tool: output jsonDiff + entity mutations
- Eval report shows pass/fail matrix (golden sessions × agent configs) with drill-down to per-dimension divergence
- Accuracy trend tracked per golden set over time
- Persona-driven golden sets exist for at least
pe-analyst,sales-ops,first-time-admin - Failed replays automatically generate a
feedbackrow linking the golden and replay sessions, picked up by the review agent
Quality gates
- All new migrations are reversible
-
pnpm testpasses with updated and new tests -
pnpm typecheck,pnpm lint,pnpm buildpass -
documents/DATABASE.mdreflects new schema -
/docs/features/agent-systemupdated with feedback + evals sections -
content/docs/data-model.mdxupdated with new tables - Feature docs for
/admin/agent-reviewadded tocontent/docs/features/
Files
Database
supabase/migrations/20260415010000_feedback_table.sql— createfeedback+ backfill fromchat_feedbacksupabase/migrations/20260415010100_shared_context_normalized.sql— restructureshared_context+ backfill from JSON arrayssupabase/migrations/20260415010200_drop_chat_feedback.sql— drop deprecatedchat_feedback(next release after phase 1 ships)
Feedback module (new)
features/feedback/types.ts—FeedbackRecord,FeedbackSourceType,FeedbackStatusfeatures/feedback/server/actions.ts—createFeedback,listFeedback,reviewFeedback,getFeedbackfeatures/feedback/server/queries.ts— tenant-scoped query builders with joinsfeatures/feedback/hooks/use-feedback-list.ts— React Query hookfeatures/feedback/components/feedback-list.tsx— list component with filtersfeatures/feedback/components/feedback-detail.tsx— detail view with source contextapp/api/feedback/route.ts— POST (create), GET (list with filters)app/api/feedback/[id]/route.ts— GET (detail), PATCH (review)
Context module (refactored)
features/context/types.ts— rewritten:SharedContextRecord,SharedContextTypeenumfeatures/context/server/actions.ts—createSharedContextRule,updateSharedContextRule,deactivateSharedContextRule,listSharedContextfeatures/context/lib/load.ts—loadSharedContextPrompt()signature unchanged; reads normalized rowsfeatures/context/lib/format.ts— format functions unchanged externallyfeatures/context/components/shared-context-manager.tsx— admin CRUD componentapp/api/admin/shared-context/route.ts— GET (list), POST (create)app/api/admin/shared-context/[id]/route.ts— PATCH (update), DELETE (hard delete if needed)
Tools (updated)
features/tools/context-tools.ts—addCorrection,addLessonupdated to write rows; newaddInsight,updateSharedContext,deactivateSharedContextfeatures/tools/feedback-tools.ts(new) —listFeedback,getFeedback,reviewFeedback,addFeedback
Feedback intake wiring
features/chat/components/message-actions-bar.tsx— write to/api/feedbackinstead of/api/chats/[id]/feedbackfeatures/chat/feedback/server/actions.ts— rewrite to usefeedbacktablefeatures/responses/server/actions.ts— on rejection, insert feedback rowfeatures/inngest/functions/feedback-rerun.ts— insert feedback row alongside retryfeatures/sessions/server/event-log.ts— on session failure transition, insert feedback row
Feedback review agent
features/agents/system-agents.ts— registerfeedback-review-agentfeatures/tasks/system-tasks.ts— register "Feedback Review" task template- Migration to seed review agent + task
Evals module (new)
features/evals/types.ts—GoldenSessionSnapshot,EvalComparison,Divergence,DimensionResultfeatures/evals/server/actions.ts—promoteSessionToGolden,revokeGoldenStatus,replayGoldenSession,listGoldenSessions,listReplaysForGoldenfeatures/evals/server/snapshot.ts— extract reproducible input snapshot from a session + eventsfeatures/evals/server/replay.ts— construct new session from snapshot, invoke session executorfeatures/evals/lib/compare.ts— top-level comparator dispatching bysession_typefeatures/evals/lib/compare-agent.ts— tool-call sequence, args, entity mutations, final messagefeatures/evals/lib/compare-response.ts— per-dimension value comparison (reuses existing scoring primitives)features/evals/lib/compare-tool.ts— tool output + mutation comparisonfeatures/evals/lib/similarity.ts— helpers (numeric tolerance, keyword presence, optional semantic similarity)features/evals/components/eval-report.tsx— pass/fail matrix + trendfeatures/evals/components/golden-session-manager.tsx— list, promote, organize into setsfeatures/evals/components/replay-diff.tsx— side-by-side divergence drill-downfeatures/sessions/server/actions.ts— addis_goldenhelper (readsmetadata.golden) + guard against delete/status-change when goldenapp/api/admin/evals/golden/route.ts— GET (list golden sessions), POST (promote session)app/api/admin/evals/golden/[id]/route.ts— DELETE (revoke), PATCH (rename set)app/api/admin/evals/runs/route.ts— POST (trigger replay for a set or single session)app/api/admin/evals/runs/[id]/route.ts— GET (replay results + diff)
Admin Agent Review section
app/(app)/admin/agent-review/page.tsx— landing page / tab containerapp/(app)/admin/agent-review/conversations/page.tsx— absorbs PR 707app/(app)/admin/agent-review/conversations/[id]/page.tsx— absorbs PR 707 detailapp/(app)/admin/agent-review/sessions/page.tsx— moved from user-facingapp/(app)/admin/agent-review/feedback/page.tsx— newapp/(app)/admin/agent-review/knowledge/page.tsx— newapp/(app)/admin/agent-review/evals/page.tsx— newfeatures/admin/lib/sections.ts— addagent-reviewentry
Removed / redirected
app/(app)/sessions/page.tsx— delete, redirect to/tasksapp/(app)/admin/conversations/**— redirect to/admin/agent-review/conversations/**chat_feedbacktable — drop after backfill + one release
Docs
content/docs/features/agent-system.mdx— add feedback + evals + review sectionscontent/docs/data-model.mdx— documentfeedbackand newshared_contextcontent/docs/features/agent-review.mdx(new) — admin section guidedocuments/DATABASE.md— update table inventorydocuments/CHANGELOG.md— entry when phase 1 ships
Open Questions
- Should
feedbackrows be user-visible? Users can see their own chat feedback today. Decision: yes for user's own feedback (created_by = user), no for aggregate or other users' feedback (admin-only). RLS handles this. - How aggressive should the review agent be? Over-eager creation of rules pollutes shared_context; under-eager leaves signal on the table. Start conservative — review agent creates rules only when 2+ feedback signals support the same pattern, or an admin manually flags a pattern.
- Per-agent shared_context scoping — opt-in or automatic? When an agent writes a lesson from its own session, should it default to
agent_id = self(narrow scope) oragent_id = null(workspace-wide)? Default to workspace-wide with explicit agent-specific opt-in, since most learnings should generalize. - Eval run scheduling — manual or automatic? Start manual (admin triggers from UI). Phase 2 could add post-deploy automatic runs and scheduled regression checks.
- Backfill of extraction rejections into
feedback? Thefeedback-rerun.tspath is alive today but doesn't record feedback. Backfill historical extraction rejections fromentity_responseswherestatus='rejected'with rejection reasons? Decision: yes, one-time backfill as part of phase 1 migration. - How reproducible are chat session inputs? Chat sessions depend on user_info, workspace context, skills, and shared_context at the time of the original run — all of which change over time. Replay fidelity will drift. Decision: snapshot the full prompt context (entity types, shared_context rules, memories, skills) into
metadata.golden.snapshotat promotion time. Replays use the snapshot, not live context. Admins can opt into "live context" replays if they want to test whether new shared_context rules actually improve behavior on past scenarios. - Semantic similarity for text comparisons — embedding model? Comparing chat final messages or text field values benefits from embedding-based similarity. Decision: phase 5 ships with keyword-overlap + exact-match. Phase 5b adds embedding similarity if the signal-to-noise ratio warrants the cost/complexity.
Tech Debt Pass 4
Close tool-layer tenant scoping gaps, collapse duplicate entity-tool mutations onto shared services, refresh backlog priorities, and restore local verification.
View and Block System v2
Unified view renderer, standalone view route, per-block Suspense loading, real-time AI iteration, and view management (copy, fork, pin).