Sprinter Docs

Unified Agent Intelligence

Consolidate feedback intake, self-improvement, admin review, and eval systems into a thin harness around two primitives — feedback inputs and shared_context outputs — that turn every agent interaction into durable workspace intelligence.

Problem

Amble has six parallel channels for capturing agent feedback, and only one of them closes the loop into durable intelligence. The rest are either write-only dead ends or one-shot fixes that never generalize into workspace knowledge.

What exists today:

ChannelWrites toReads fromEnd-to-end?
Corrections / Lessonsshared_context JSON blobs via addCorrection/addLesson toolsEvery agent prompt via loadSharedContextPrompt()YES
User memoriesuser_memories via saveMemory toolUser-scoped prompt injectionYES (user-scoped only)
Chat feedbackchat_feedback via thumbs-up/down buttonsNothingNO
Response scoringentity_responses via form submissionAggregation views, field promotionPartial (no learning loop)
Extraction rejectionInngest event → feedback-rerun.tsOne-shot retry with feedback instructionLimited (no durable lesson)
Session eventssession_events via appendAudit trail onlyNO

The core problems:

  1. Fragmented intake. Feedback on agent output lives in four places (chat_feedback, entity_responses.status, Inngest rejection events, implicit in session_events). No unified read path means no systematic review.
  2. Brittle storage. Corrections and lessons are stored as JSON arrays inside a key/value shared_context table. Individual items cannot be edited, deactivated, or audited without rewriting the whole array. The 8-item prompt cap is treated as a data cap — any 9th lesson falls off forever.
  3. No review pipeline. Nothing processes chat_feedback, response rejections, or session failures into durable learnings. The one system that does learn durably (corrections/lessons) requires agents to proactively call tools — there is no review loop.
  4. No eval system. Extraction accuracy, agent output quality, and regression detection have no golden-set mechanism. "Did this change improve or regress agent behavior on canonical scenarios?" cannot be answered. Today, scoring only covers entity-response field values — chat behavior, tool-call sequences, and full-session outcomes have no reference set.
  5. Surface sprawl. /sessions is a user-facing page that exposes raw execution machinery customers don't need. PR 707 proposed /admin/conversations as another standalone surface (not yet merged to dev). Admin has no single place to review agent work, manage workspace knowledge, or evaluate quality.

This is vibecoded sprawl. It ships features that look like feedback loops but don't close. The fix is to collapse the surface to a thin harness — two tables, one review task, one admin section — and let the durable knowledge (workspace rules, learned patterns, eval results) become the asset that grows over time.

Goals

  1. Unify feedback intake. All agent-output evaluations (chat thumbs, response rejections, extraction rejections, session failures, admin observations) flow into a single feedback table with consistent shape.
  2. Normalize workspace knowledge. Corrections, lessons, routing, insights, and guidelines become individually-addressable rows in shared_context, not JSON array items. Each carries provenance, scope, active state, and lifecycle metadata.
  3. Close the self-improvement loop. A scheduled feedback review agent reads pending feedback, identifies patterns, and generates durable shared_context entries. Human admins can review, edit, approve, or dismiss.
  4. Add session-centric golden-set evals. Any completed session (chat, extraction, response, tool, mixed) can be flagged as golden. Replay its inputs against a candidate agent or config, compare the new session's tool calls, outputs, and mutations to the golden, and report divergence. Persona-driven canonical sessions become regression coverage for user journeys.
  5. Consolidate admin surfaces. One /admin/agent-review section provides the conversation browser, sessions browser, feedback review, shared_context management, and eval reporting. PR 707's conversation-browser work, if still pending, gets rebased onto this structure rather than shipped standalone.
  6. Remove user-facing surfaces that customers don't need. /sessions redirects to /tasks. Session detail remains for deep drill-down from admin and entity panels.

Non-goals

  • Not rebuilding the agent prompt construction path — loadSharedContextPrompt() keeps its signature; internals change.
  • Not changing user_memories — user-scoped memory stays as-is, separate from workspace knowledge.
  • Not changing entity_responses or criteria_sets — the response/scoring system is sound, just gets wired into feedback generation.
  • Not changing session_events — it stays append-only telemetry; feedback is separate.
  • Not introducing a new primitive for evals — golden sets are sessions flagged as metadata.golden, replayed through the existing session executor, and compared via a new features/evals/lib/compare.ts dispatched by session_type.

Design

Two tables, purpose-built

feedback (new) — inputs to the self-improvement system.

Replaces chat_feedback and absorbs all signals that evaluate agent output. Lightweight, append-mostly, has a processing lifecycle.

CREATE TABLE feedback (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id uuid NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,

  -- Source of the feedback
  source_type text NOT NULL CHECK (source_type IN (
    'chat', 'response', 'extraction', 'tool', 'session', 'observation'
  )),
  rating text NOT NULL CHECK (rating IN ('positive', 'negative', 'neutral')),
  comment text,

  -- Links to the agent output being evaluated
  session_id uuid REFERENCES sessions(id) ON DELETE SET NULL,
  entity_id uuid REFERENCES entities(id) ON DELETE SET NULL,
  agent_id uuid REFERENCES agents(id) ON DELETE SET NULL,
  context jsonb NOT NULL DEFAULT '{}'::jsonb,
  -- context carries source-specific metadata:
  --   chat: { chat_id, message_index }
  --   response: { response_id, criteria_set_id, field_name }
  --   extraction: { field_name, entity_type_id, rejected_value }
  --   tool: { tool_slug, tool_run_id }
  --   session: { failure_reason, event_id }

  -- Processing lifecycle
  status text NOT NULL DEFAULT 'pending' CHECK (status IN (
    'pending', 'reviewed', 'applied', 'dismissed'
  )),
  reviewed_by_agent_id uuid REFERENCES agents(id) ON DELETE SET NULL,
  reviewed_by_user_id uuid REFERENCES auth.users(id) ON DELETE SET NULL,
  reviewed_at timestamptz,
  review_notes text,

  created_by uuid NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,
  created_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX idx_feedback_tenant_status ON feedback(tenant_id, status, created_at DESC);
CREATE INDEX idx_feedback_session ON feedback(session_id) WHERE session_id IS NOT NULL;
CREATE INDEX idx_feedback_entity ON feedback(entity_id) WHERE entity_id IS NOT NULL;
CREATE INDEX idx_feedback_agent ON feedback(agent_id) WHERE agent_id IS NOT NULL;
CREATE INDEX idx_feedback_pending ON feedback(tenant_id, created_at DESC) WHERE status = 'pending';

shared_context (restructured) — outputs of the self-improvement system.

Keeps its current name. Drops the key/value JSON-blob shape. Each correction, lesson, insight, or guideline becomes its own row with individual lifecycle and provenance.

-- New structure (requires drop + recreate, see migration section)
CREATE TABLE shared_context (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id uuid NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,

  -- What kind of knowledge
  type text NOT NULL CHECK (type IN (
    'correction', 'lesson', 'routing', 'insight', 'guideline'
  )),
  content text NOT NULL,
  context text, -- when this applies (e.g., "when extracting valuation for PE deals")

  -- Scoping (NULL = tenant-wide; set for narrower scope)
  agent_id uuid REFERENCES agents(id) ON DELETE CASCADE,
  entity_type_id uuid REFERENCES entity_types(id) ON DELETE CASCADE,

  -- Lifecycle
  active boolean NOT NULL DEFAULT true,

  -- Provenance
  source_feedback_id uuid REFERENCES feedback(id) ON DELETE SET NULL,
  created_by_agent_id uuid REFERENCES agents(id) ON DELETE SET NULL,
  created_by_user_id uuid REFERENCES auth.users(id) ON DELETE SET NULL,

  metadata jsonb NOT NULL DEFAULT '{}'::jsonb,
  created_at timestamptz NOT NULL DEFAULT now(),
  updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX idx_shared_context_tenant_type_active ON shared_context(tenant_id, type, active, created_at DESC);
CREATE INDEX idx_shared_context_agent ON shared_context(agent_id) WHERE agent_id IS NOT NULL;
CREATE INDEX idx_shared_context_entity_type ON shared_context(entity_type_id) WHERE entity_type_id IS NOT NULL;
CREATE INDEX idx_shared_context_source_feedback ON shared_context(source_feedback_id) WHERE source_feedback_id IS NOT NULL;

Why keep the name shared_context:

  • Zero rename churn in code, tests, and docs that reference it.
  • Semantically accurate: context shared across all agent runs for this tenant.
  • loadSharedContextPrompt() keeps its signature — only internals change.
  • CONTEXT_KEYS is removed; types are now enum values on rows, not JSON keys.

The feedback-to-knowledge pipeline

Agent produces output

Human/system evaluates (thumbs, reject, score, flag)

feedback row inserted (status='pending')

Feedback Review Agent (scheduled task, hourly)

Reads pending feedback, groups by pattern

Decides action per feedback:
  • Create shared_context row (correction/lesson/insight)
  • Update agent config (prompt tweaks, tool groups)
  • Update entity/entity_type (structural fix)
  • Dismiss (noise, already addressed, user error)

Marks feedback as applied/dismissed with review_notes

loadSharedContextPrompt() reads active rows, formats by type,
injects top N per type into every future agent prompt

Feedback intake points

Existing channels route into the feedback table:

  • Chat thumbs (existing UI)features/chat/components/message-actions-bar.tsx writes to feedback with source_type='chat', context={chat_id, message_index}. chat_feedback table backfilled and dropped.
  • Response rejection — When a response status transitions to rejected with a rejection_reason, submitResponseAdmin() writes a feedback row with source_type='response', context={response_id, criteria_set_id}. The existing extraction/result-rejected Inngest event continues to fire.
  • Extraction rejectionfeedback-rerun.ts continues to run the one-shot retry, AND writes a feedback row with source_type='extraction', rating='negative', status='applied' (the retry itself is the application). The review agent can still pick up patterns across rejections.
  • Tool results — Agents and admins can call a new addFeedback tool to record observations about tool outputs.
  • Session failures — Sessions that transition to failed status automatically generate a feedback row with source_type='session', rating='negative', context={failure_reason}.
  • Admin observation — Admins can add a feedback row manually from any session or conversation via the review UI ("Flag this response / message / session").

Workspace knowledge authoring

shared_context rows are written by:

  • addCorrection tool — Updated to INSERT a type='correction' row. Signature unchanged: (content: string, context?: string). Now also accepts optional agentId and entityTypeId for scoping.
  • addLesson tool — Updated to INSERT a type='lesson' row. Signature unchanged.
  • New addInsight tool — For agent observations that aren't corrections or lessons but represent durable patterns.
  • Admin UI — Direct CRUD for all five types (correction, lesson, routing, insight, guideline) via /admin/agent-review/knowledge. Routing and guideline rows are admin-authored today; agents can propose them via addLesson and admins can re-type as needed.
  • Feedback review agent — Writes rows via the tools above, setting source_feedback_id for provenance.

shared_context rows are read by:

  • loadSharedContextPrompt(admin, tenantId, { agentId?, entityTypeId? }) — Same signature, new internals. Queries active rows, filters by scope (global + agent-specific + entity-type-specific where relevant), ranks by recency and usage, returns top N per type formatted as prompt sections.
  • Admin UI — list, detail, edit, deactivate, delete.

Feedback Review Agent

A system agent (registered once, available to every tenant via is_system=true per existing agent conventions) that runs on a scheduled heartbeat per tenant that opts in.

Configuration (default):

  • Heartbeat cron: hourly per tenant (skipped when no pending feedback exists)
  • Tool groups: context, entities-read, agents-admin, feedback-review
  • Instructions (base prompt):

    You are the Feedback Review Agent. Every hour, review pending feedback in this workspace. For each feedback or group of related feedback:

    1. Read the source context (chat, entity, session) to understand what happened
    2. Decide if this represents a durable pattern or a one-off
    3. If durable: create a correction, lesson, or insight via addCorrection/addLesson/addInsight with source_feedback_id provenance
    4. If structural: consider updating agent config or entity type schema
    5. If noise: dismiss with review_notes explaining why
    6. Mark feedback as applied or dismissed Prefer to dismiss weak signals rather than pollute shared_context with low-value rules.

Tools (new):

  • listFeedback({ status, sourceType?, agentId?, limit }) — paginate pending feedback
  • getFeedback(id) — detail view with linked source context
  • reviewFeedback(id, action: 'applied' | 'dismissed', notes: string) — mark processed
  • addInsight(content, context?) — new shared_context type for observations
  • updateSharedContext(id, { active?, content?, context? }) — tune existing rules
  • deactivateSharedContext(id, reason) — soft-delete a stale rule

Guardrails:

  • The review agent has a permission role that lets it read but not delete feedback or arbitrary data
  • It cannot create new agents or delete entity types
  • All actions are logged in session_events for audit
  • Admins can disable the heartbeat and review feedback manually

Admin "Agent Review" section

Replaces /admin/conversations (PR 707) and absorbs /sessions. Lives at /admin/agent-review under the "AI & Agents" group in ADMIN_SECTIONS.

Tabs:

  1. Conversations — Chat observability. Queries chats + messages + feedback table. Browse all chats, view transcripts, see thumbs and feedback inline, link to related sessions. Admin can add observation feedback directly from messages. (If PR 707 lands concurrently, its implementation is rebased onto this structure and its /admin/conversations route redirects here.)

  2. Sessions — Execution browser for all session types (agent/response/tool/mixed). Moved from user-facing /sessions page. Filter by status, type, agent, date. Link to session detail transcript.

  3. Feedback — Unified feed across all sources. Filter by status (pending/reviewed/applied/dismissed), source_type, rating, agent. Click through to source (chat message, entity, session). Bulk actions: mark reviewed, dismiss with shared note.

  4. Knowledge — Manage shared_context rows. Tabs within: corrections, lessons, routing, insights, guidelines. Each row shows content, context, scope (agent/entity_type), active state, provenance (which feedback generated it, which agent wrote it). CRUD: add, edit, deactivate, delete.

  5. Evals — Golden set management and accuracy reports. List of eval sets (tagged entity groups), per-set accuracy dashboards, trend charts, drill-down into specific failed cases, link to trigger re-runs.

Eval system (session-centric golden sets)

Any session can be a golden case. An agent session (chat, extraction, response, tool, mixed) captures inputs, execution (all tool calls via session_events), and outputs. Promoting a session as "golden" turns it into a reusable regression test: replay the same inputs against a new agent or config, compare the new session's behavior to the golden one, and report divergence.

This generalizes from "entity extraction accuracy" to "does the agent still behave the way we want for this scenario." Response-level evals (compare extracted field values to expected values) become one case; chat behavior (did the agent use the right tools, produce the right summary) and tool-session behavior (did it call the right API with the right params) are all covered by the same mechanism.

Defining a golden session:

  • Admin reviews a completed session in /admin/agent-review/sessions or /admin/agent-review/conversations
  • Admin clicks "Promote as golden" on a session whose outputs they validated as correct
  • System writes sessions.metadata.golden = { set: '<set-name>', promoted_at, promoted_by, snapshot } where snapshot captures the reproducible inputs (agent_id, initial message, entity_id, view_context, tool_grants, etc.)
  • Golden sessions are immutable: sessions.status is frozen to completed and the session cannot be deleted while golden (soft-enforced in server actions + RLS)
  • Golden sessions can be organized into named sets ('pe-extraction', 'chat-smoke', 'tool-accuracy') via metadata.golden.set

Running an eval:

  • Eval run triggered from admin UI, scheduled, or on post-deploy hook
  • For each golden session in the target set:
    1. Read the input snapshot from sessions.metadata.golden.snapshot
    2. Create a new session with those inputs (possibly using a different agent_id to test a candidate config)
    3. Execute the session to completion via the session executor
    4. Compare the new session's events + outputs to the golden session's events + outputs
  • Writes comparison results to sessions.metadata.eval_result on the replay session, linking back to the golden via metadata.eval_source = golden_session_id

Comparison logic (features/evals/lib/compare.ts), dispatched by session_type:

type ComparisonResult = {
  session_type: SessionType
  dimensions: Record<string, DimensionResult> // field-by-field comparison
  overall_accuracy: number // 0..1
  divergences: Divergence[] // ordered list of mismatches for drill-down
}

// agent | chat sessions: compare tool-call sequence + final message
// - tool_calls: set overlap (order-insensitive) with expected tools
// - tool_args: per-tool, compare input args via jsonDiff
// - final_message: semantic similarity score (optional) + keyword presence
// - entities_created: compare what entities the session created
// - entities_updated: compare which fields were mutated

// response sessions: compare values per criteria dimension
// - numeric: tolerance-based match
// - text: exact or semantic similarity
// - select: exact match
// - relation-rank: order-aware set comparison

// tool sessions: compare tool output and any entity mutations
// - output: jsonDiff against golden output
// - entity mutations: same as agent

// extraction tasks (session_type='agent' with output_type='field'|'fields'):
// - submitResponse values: dimension-by-dimension comparison
// - field metadata: source attribution, confidence

Scoring and reporting:

  • Per dimension: pass/fail + similarity score
  • Per session: overall accuracy % (weighted by dimension importance)
  • Per agent: accuracy across all golden sessions in a set
  • Per set: accuracy trend over time (chart)
  • Admin UI shows pass/fail matrix (golden sessions × agent configs) with drill-down to per-dimension divergence

Self-improvement integration:

  • When a replay fails on a golden session, the divergence becomes a feedback row (source_type='session', context includes golden_session_id and replay_session_id)
  • Feedback review agent picks it up, analyzes the divergence, generates a correction or adjusts agent config
  • Accuracy-trend dashboard surfaces regressions within the admin review section

Persona-driven golden sets:

  • content/docs/personas/*.mdx defines user personas and their jobs-to-be-done
  • Each persona should have at least one golden session capturing a canonical interaction (e.g., pe-analyst asking for deal comparisons, sales-ops adding contacts)
  • Persona-based golden sets give us user-journey regression coverage, not just field-level accuracy

No new tables in phase 5. sessions.metadata.golden and sessions.metadata.eval_result carry all the data. If we later need cross-run aggregation dashboards (accuracy over the last 10 runs), we can add an eval_runs table in phase 5b:

CREATE TABLE eval_runs (
  id uuid PRIMARY KEY,
  tenant_id uuid NOT NULL,
  name text NOT NULL, -- 'golden-pe-extraction', 'chat-smoke-2026-04-15'
  set_name text, -- the golden set being evaluated
  agent_id uuid, -- candidate agent being tested
  triggered_by text, -- 'manual' | 'scheduled' | 'post-deploy'
  started_at timestamptz NOT NULL DEFAULT now(),
  completed_at timestamptz,
  golden_session_count integer,
  passed_count integer,
  failed_count integer,
  results jsonb NOT NULL DEFAULT '{}'::jsonb,
  -- { bySession: { goldenSessionId: { passed, overall_accuracy, divergences } }, ... }
  created_by uuid
);

Phase 5 ships without eval_runs; per-session metadata is enough to list recent replays and drill into divergences. Add the table when the UI needs cross-run charts.

Migration path

Phase 1: schema

  1. Create feedback table
  2. Backfill feedback from chat_feedback (source_type='chat', rating mapping from 'up'/'down')
  3. Create new shared_context table structure (temporary name shared_context_v2, or drop-and-recreate in same migration)
  4. Backfill from old shared_context JSON arrays — expand each array item into a row with appropriate type and parsed content/context
  5. Drop old shared_context, rename shared_context_v2shared_context
  6. Mark chat_feedback as deprecated; drop in next release

Phase 2: code

  1. Rewrite features/context/server/actions.ts and features/context/lib/load.ts to read/write normalized rows. Preserve public signatures.
  2. Update addCorrection/addLesson tools to write rows (signatures unchanged, internals changed)
  3. Update features/chat/feedback/server/actions.ts to write to feedback table
  4. Update chat feedback API routes to query feedback table
  5. Wire response rejection flow to insert feedback row
  6. Wire extraction rejection flow to insert feedback row alongside the retry
  7. Wire session failure to insert feedback row

Phase 3: UI

  1. Build features/feedback/ module (types, hooks, components)
  2. Build features/evals/ module (types, compare logic, reporting components)
  3. Build new /admin/agent-review section with 5 tabs
  4. Add agent-review entry to ADMIN_SECTIONS under "AI & Agents" group
  5. Delete app/(app)/sessions/page.tsx, add redirect to /tasks
  6. Keep /sessions/[id] detail page — linked from admin and entity panels
  7. If PR 707 lands concurrently, add a redirect from /admin/conversations/admin/agent-review/conversations

Phase 4: review agent

  1. Create system agent record (feedback-review-agent)
  2. Create task template for "Feedback Review"
  3. Configure heartbeat
  4. Write system prompt and guardrails
  5. Ship disabled by default — admin opts in per workspace

Phase 5: evals

  1. Add "Promote as golden" action to session detail + conversation detail (writes metadata.golden)
  2. Add protection in session actions: block delete and block status changes when metadata.golden is set
  3. Build input-snapshot extractor (features/evals/server/snapshot.ts) that captures reproducible inputs from session + events
  4. Build replay runner (features/evals/server/replay.ts) that creates a new session from a snapshot, invokes the session executor, and sets metadata.eval_source
  5. Build comparison logic per session_type (compare-agent.ts, compare-response.ts, compare-tool.ts)
  6. Build admin eval UI: golden session manager, replay trigger, pass/fail matrix, divergence drill-down
  7. Seed persona-driven golden sessions for pe-analyst, sales-ops, first-time-admin as a starter set
  8. Wire replay failures into feedback table as source_type='session' entries so the review agent picks up regressions

Each phase is independently shippable. Phase 1-2 is the foundation; phase 3-5 can land incrementally.

Trade-offs

Why a new feedback table instead of generalizing session_events? Session events are append-only audit telemetry — "agent called tool X at time Y." Feedback is evaluation — "the output was bad because Z." They have different lifecycles (events are immutable; feedback has a status machine), different query patterns (events are per-session; feedback is cross-session aggregate), and different readers (events for replay; feedback for learning). Mixing them would blur the intent and bloat the events table.

Why a new shared_context structure instead of an entity type? Considered making shared_context rows entities of type agent-knowledge or learning. Pros: zero new tables, standard entity CRUD, tags/relations free. Cons: entity overhead on the hot path (every agent prompt loads shared context), entity semantics don't fit ephemeral lifecycle knowledge well, and the admin UI for entities is geared toward business records, not system metadata. The dedicated table is ~50 lines of migration and wins on performance, semantics, and purpose-fit.

Why drop chat_feedback instead of keeping both? Keeping chat_feedback as a chat-specific table + a generic feedback table means chat thumbs would write to two places, or chat-specific feedback would split from everything else. Consolidating to one table means one query path, one review pipeline, one admin view. Migration is straightforward (tens of rows in practice).

Why keep shared_context as the name? Preserves zero rename churn across loadSharedContextPrompt(), CONTEXT_KEYS, tests, docs. The semantic meaning ("context shared across all agent runs for this tenant") still fits the new structure. User preference aligned.

Why not merge feedback and shared_context into one table? Feedback is input (evaluation signal); shared_context is output (distilled knowledge). A single correction in shared_context can be produced by reviewing 5-10 feedback signals. The write patterns, lifecycles, and read patterns are different enough that one table would force either a status='feedback' vs status='rule' split (back to a type column hack) or an awkward denormalization. Two tables is cleaner.

Why defer eval_runs table? Starting with per-session metadata (sessions.metadata.eval_result, linked to golden via metadata.eval_source) is enough for the first phase. Replays are themselves sessions — listing them, filtering by status, drilling in, all work via the existing session queries. If we need cross-run aggregation dashboards or eval history charts, add the table then. YAGNI.

Why session-centric evals instead of entity+response? Considered the narrower approach: tag entities as eval set, promote responses as golden, compare new responses. This works for extraction accuracy but misses chat behavior, tool-call correctness, and multi-step agent journeys. Session-centric evals let us regress-test any agent output — including the persona journeys defined in content/docs/personas/ — with one mechanism. The entity+response case is a specialization (response sessions compared dimension-by-dimension). One primitive covers every case.

Why not snapshot session inputs into a dedicated golden_cases table? Considered extracting golden input snapshots into their own table so original sessions could be freely modified or deleted. Decided against: the session record already contains the inputs (first events in session_events), and snapshotting to metadata.golden.snapshot at promotion time locks in the reproducible view. This avoids a parallel data model and lets admins browse golden sessions through the same session list UI they already use. Immutability is enforced at the action layer rather than at a schema level.

Acceptance Criteria

Feedback intake

  • feedback table exists with the schema above, RLS policies, and indexes
  • Chat thumbs write to feedback with source_type='chat' and correct context
  • Response rejection writes to feedback with source_type='response'
  • Extraction rejection writes to feedback alongside the retry
  • Session failure writes to feedback automatically
  • chat_feedback data fully backfilled into feedback
  • chat_feedback table dropped after one release

Workspace knowledge

  • shared_context normalized table exists with schema above, RLS, indexes
  • All existing JSON-array rows backfilled into individual rows
  • addCorrection/addLesson tools write rows (same signatures)
  • New addInsight tool exists
  • loadSharedContextPrompt() signature unchanged; reads from new structure
  • Scoping works: agent_id-specific rules only load for that agent, entity_type_id-specific rules only load for that type
  • active = false rows excluded from prompt injection
  • Provenance links preserved (source_feedback_id, created_by_agent_id, created_by_user_id)

Feedback review agent

  • System agent exists with correct tool groups and permissions
  • Heartbeat task runs on schedule
  • Review agent can list, read, and process pending feedback
  • Review agent creates shared_context rows with provenance
  • Review agent can update agent configs when appropriate
  • Review agent marks feedback as applied/dismissed with notes
  • All review actions logged in session_events
  • Admin can disable review agent per workspace

Admin UI

  • /admin/agent-review exists under "AI & Agents"
  • agent-review in ADMIN_SECTIONS
  • Conversations tab works (from PR 707, updated data source)
  • Sessions tab works (moved from user-facing)
  • Feedback tab shows unified feed with filters
  • Knowledge tab manages shared_context rows (CRUD)
  • Evals tab shows golden set reports
  • /sessions redirects to /tasks
  • Session detail page /sessions/[id] remains accessible from admin and entity panels
  • /admin/conversations redirects to /admin/agent-review/conversations

Eval system

  • Any completed session can be promoted as golden via /admin/agent-review/sessions or /admin/agent-review/conversations
  • Golden sessions capture reproducible input snapshot in sessions.metadata.golden.snapshot
  • Golden sessions are immutable (protected from deletion + status changes while flagged)
  • Admin can organize golden sessions into named sets (pe-extraction, chat-smoke, etc.)
  • Eval run replays each golden session's inputs against target agent and produces a replay session
  • Comparison logic dispatches by session_type:
    • agent/chat: tool-call set overlap, args jsonDiff, entities created/updated, final message similarity
    • response: per-dimension value comparison (numeric tolerance, exact, semantic)
    • tool: output jsonDiff + entity mutations
  • Eval report shows pass/fail matrix (golden sessions × agent configs) with drill-down to per-dimension divergence
  • Accuracy trend tracked per golden set over time
  • Persona-driven golden sets exist for at least pe-analyst, sales-ops, first-time-admin
  • Failed replays automatically generate a feedback row linking the golden and replay sessions, picked up by the review agent

Quality gates

  • All new migrations are reversible
  • pnpm test passes with updated and new tests
  • pnpm typecheck, pnpm lint, pnpm build pass
  • documents/DATABASE.md reflects new schema
  • /docs/features/agent-system updated with feedback + evals sections
  • content/docs/data-model.mdx updated with new tables
  • Feature docs for /admin/agent-review added to content/docs/features/

Files

Database

  • supabase/migrations/20260415010000_feedback_table.sql — create feedback + backfill from chat_feedback
  • supabase/migrations/20260415010100_shared_context_normalized.sql — restructure shared_context + backfill from JSON arrays
  • supabase/migrations/20260415010200_drop_chat_feedback.sql — drop deprecated chat_feedback (next release after phase 1 ships)

Feedback module (new)

  • features/feedback/types.tsFeedbackRecord, FeedbackSourceType, FeedbackStatus
  • features/feedback/server/actions.tscreateFeedback, listFeedback, reviewFeedback, getFeedback
  • features/feedback/server/queries.ts — tenant-scoped query builders with joins
  • features/feedback/hooks/use-feedback-list.ts — React Query hook
  • features/feedback/components/feedback-list.tsx — list component with filters
  • features/feedback/components/feedback-detail.tsx — detail view with source context
  • app/api/feedback/route.ts — POST (create), GET (list with filters)
  • app/api/feedback/[id]/route.ts — GET (detail), PATCH (review)

Context module (refactored)

  • features/context/types.ts — rewritten: SharedContextRecord, SharedContextType enum
  • features/context/server/actions.tscreateSharedContextRule, updateSharedContextRule, deactivateSharedContextRule, listSharedContext
  • features/context/lib/load.tsloadSharedContextPrompt() signature unchanged; reads normalized rows
  • features/context/lib/format.ts — format functions unchanged externally
  • features/context/components/shared-context-manager.tsx — admin CRUD component
  • app/api/admin/shared-context/route.ts — GET (list), POST (create)
  • app/api/admin/shared-context/[id]/route.ts — PATCH (update), DELETE (hard delete if needed)

Tools (updated)

  • features/tools/context-tools.tsaddCorrection, addLesson updated to write rows; new addInsight, updateSharedContext, deactivateSharedContext
  • features/tools/feedback-tools.ts (new) — listFeedback, getFeedback, reviewFeedback, addFeedback

Feedback intake wiring

  • features/chat/components/message-actions-bar.tsx — write to /api/feedback instead of /api/chats/[id]/feedback
  • features/chat/feedback/server/actions.ts — rewrite to use feedback table
  • features/responses/server/actions.ts — on rejection, insert feedback row
  • features/inngest/functions/feedback-rerun.ts — insert feedback row alongside retry
  • features/sessions/server/event-log.ts — on session failure transition, insert feedback row

Feedback review agent

  • features/agents/system-agents.ts — register feedback-review-agent
  • features/tasks/system-tasks.ts — register "Feedback Review" task template
  • Migration to seed review agent + task

Evals module (new)

  • features/evals/types.tsGoldenSessionSnapshot, EvalComparison, Divergence, DimensionResult
  • features/evals/server/actions.tspromoteSessionToGolden, revokeGoldenStatus, replayGoldenSession, listGoldenSessions, listReplaysForGolden
  • features/evals/server/snapshot.ts — extract reproducible input snapshot from a session + events
  • features/evals/server/replay.ts — construct new session from snapshot, invoke session executor
  • features/evals/lib/compare.ts — top-level comparator dispatching by session_type
  • features/evals/lib/compare-agent.ts — tool-call sequence, args, entity mutations, final message
  • features/evals/lib/compare-response.ts — per-dimension value comparison (reuses existing scoring primitives)
  • features/evals/lib/compare-tool.ts — tool output + mutation comparison
  • features/evals/lib/similarity.ts — helpers (numeric tolerance, keyword presence, optional semantic similarity)
  • features/evals/components/eval-report.tsx — pass/fail matrix + trend
  • features/evals/components/golden-session-manager.tsx — list, promote, organize into sets
  • features/evals/components/replay-diff.tsx — side-by-side divergence drill-down
  • features/sessions/server/actions.ts — add is_golden helper (reads metadata.golden) + guard against delete/status-change when golden
  • app/api/admin/evals/golden/route.ts — GET (list golden sessions), POST (promote session)
  • app/api/admin/evals/golden/[id]/route.ts — DELETE (revoke), PATCH (rename set)
  • app/api/admin/evals/runs/route.ts — POST (trigger replay for a set or single session)
  • app/api/admin/evals/runs/[id]/route.ts — GET (replay results + diff)

Admin Agent Review section

  • app/(app)/admin/agent-review/page.tsx — landing page / tab container
  • app/(app)/admin/agent-review/conversations/page.tsx — absorbs PR 707
  • app/(app)/admin/agent-review/conversations/[id]/page.tsx — absorbs PR 707 detail
  • app/(app)/admin/agent-review/sessions/page.tsx — moved from user-facing
  • app/(app)/admin/agent-review/feedback/page.tsx — new
  • app/(app)/admin/agent-review/knowledge/page.tsx — new
  • app/(app)/admin/agent-review/evals/page.tsx — new
  • features/admin/lib/sections.ts — add agent-review entry

Removed / redirected

  • app/(app)/sessions/page.tsx — delete, redirect to /tasks
  • app/(app)/admin/conversations/** — redirect to /admin/agent-review/conversations/**
  • chat_feedback table — drop after backfill + one release

Docs

  • content/docs/features/agent-system.mdx — add feedback + evals + review sections
  • content/docs/data-model.mdx — document feedback and new shared_context
  • content/docs/features/agent-review.mdx (new) — admin section guide
  • documents/DATABASE.md — update table inventory
  • documents/CHANGELOG.md — entry when phase 1 ships

Open Questions

  1. Should feedback rows be user-visible? Users can see their own chat feedback today. Decision: yes for user's own feedback (created_by = user), no for aggregate or other users' feedback (admin-only). RLS handles this.
  2. How aggressive should the review agent be? Over-eager creation of rules pollutes shared_context; under-eager leaves signal on the table. Start conservative — review agent creates rules only when 2+ feedback signals support the same pattern, or an admin manually flags a pattern.
  3. Per-agent shared_context scoping — opt-in or automatic? When an agent writes a lesson from its own session, should it default to agent_id = self (narrow scope) or agent_id = null (workspace-wide)? Default to workspace-wide with explicit agent-specific opt-in, since most learnings should generalize.
  4. Eval run scheduling — manual or automatic? Start manual (admin triggers from UI). Phase 2 could add post-deploy automatic runs and scheduled regression checks.
  5. Backfill of extraction rejections into feedback? The feedback-rerun.ts path is alive today but doesn't record feedback. Backfill historical extraction rejections from entity_responses where status='rejected' with rejection reasons? Decision: yes, one-time backfill as part of phase 1 migration.
  6. How reproducible are chat session inputs? Chat sessions depend on user_info, workspace context, skills, and shared_context at the time of the original run — all of which change over time. Replay fidelity will drift. Decision: snapshot the full prompt context (entity types, shared_context rules, memories, skills) into metadata.golden.snapshot at promotion time. Replays use the snapshot, not live context. Admins can opt into "live context" replays if they want to test whether new shared_context rules actually improve behavior on past scenarios.
  7. Semantic similarity for text comparisons — embedding model? Comparing chat final messages or text field values benefits from embedding-based similarity. Decision: phase 5 ships with keyword-overlap + exact-match. Phase 5b adds embedding similarity if the signal-to-noise ratio warrants the cost/complexity.

On this page