Documentation source
Field Extraction
Per-field agentic extraction pipeline with configurable sources, dependency chains, multi-agent consensus, confidence levels, approval workflows, cascade creation, and cross-entity feedback refinement.
<Callout type="danger" title="ARCHIVED & DEPRECATED">
**This documentation is obsolete.** As of PR 620, the legacy execution engine has been eradicated. This functionality has been completely replaced by **Tasks** and the Unified Sessions primitive. Do not build new features using these patterns.
Please refer to [Tasks](/docs/features/tasks) and [Response System](/docs/features/response-system) for the new execution architecture.
</Callout>
## Overview
Field extraction is the system by which AI agents populate entity fields automatically. Rather than a monolithic "enrich this entity" approach, each field runs its own agent loop with targeted instructions, configurable data sources, and dependency ordering. The result is a traceable, auditable pipeline where every extracted value has provenance (which agent, which sources, what confidence) and flows into the canonical response system.
Extraction is deeply integrated with the workflow engine. Each extractable field becomes a workflow `agent_task` node, with dependencies controlling execution order and `wait_human` nodes pausing the pipeline for manual review or approval.
The feedback refinement loop closes the quality circle: rejected extractions with reasons automatically re-run with the rejection feedback injected as additional instructions. Over time, cross-entity feedback summaries (approval rates, common rejection reasons) improve extraction quality across the entire entity type.
## Key Concepts
### Field Configuration
The `FieldConfig` type drives extraction behavior. It is stored per-field in `EntityTypeConfig.fields`:
```typescript
interface FieldConfig {
label?: string;
displayType?: string;
extraction?: {
instructions: string; // Detailed prompt for the agent
agentSlug?: string; // Override agent (falls back to type default)
sources?: ExtractionSource[]; // Where to search for information
dependsOn?: string[]; // Fields that must be populated first
required?: boolean; // Must be populated for workflow completion
requiresApproval?: boolean; // Pause for human approval before promotion
validation?: string; // Human-readable validation criteria
maxSteps?: number; // Max agent loop iterations (default: 5)
consensus?: {
agentSlugs: string[]; // Multiple agents to run
strategy: "majority" | "unanimous" | "best-confidence";
};
};
humanInput?: boolean; // Requires human input, blocks dependents
connection?: {
/* ... */
}; // Connection field config
actions?: Array<{
/* ... */
}>; // Lifecycle event actions
}
```
### Extraction Sources
```typescript
const EXTRACTION_SOURCES = [
"linked-documents", // Documents connected to this entity
"all-documents", // All documents in the tenant
"connected-entities", // Entities linked via relations
"all-entities", // All entities in the tenant
"web", // Web search via Exa API
] as const;
```
Sources control which extraction-specific tools are available to the agent during field extraction. Each source type enables corresponding tools:
| Source | Tool Enabled |
| -------------------- | ---------------------------------------------------------------------------------- |
| `linked-documents` | `searchLinkedDocuments` -- hybrid search across documents connected to this entity |
| `all-documents` | `searchLinkedDocuments` with tenant-wide scope |
| `connected-entities` | `searchConnectedEntities` -- entities linked via relations |
| `all-entities` | `searchConnectedEntities` with tenant-wide scope |
| `web` | `webSearch` -- Exa API integration |
### Extraction Results
Each field extraction produces an `ExtractionResult` record:
| Field | Type | Description |
| ---------------------- | --------- | ----------------------------------------------- |
| `run_id` | uuid | Parent extraction run |
| `entity_id` | uuid | Target entity |
| `field_name` | string | Which field was extracted |
| `value` | jsonb | The extracted value |
| `sources` | jsonb | Array of `ExtractionSourceRef` with provenance |
| `confidence` | enum | `high`, `medium`, or `low` |
| `agent_id` | string | Which agent extracted the value |
| `status` | enum | `pending`, `approved`, `rejected`, `superseded` |
| `rejection_reason` | text | Why a reviewer rejected this result |
| `reviewed_by` | uuid | Who approved/rejected |
| `reviewed_at` | timestamp | When review occurred |
| `created_entity_ids` | uuid[] | Entities created via cascade |
| `workflow_node_run_id` | uuid | Cross-reference to workflow node |
| `duration_ms` | integer | Extraction time |
| `tokens_used` | integer | Token consumption |
### Confidence Levels
Agents submit a confidence assessment with each extracted value:
- **high** -- Value is clearly stated in a reliable source, or computed from well-defined inputs.
- **medium** -- Value is inferred or derived from partial information.
- **low** -- Value is a best guess, from ambiguous or conflicting sources.
### Extraction Runs
An `ExtractionRun` groups results from a single extraction execution. Runs track:
- **scope** -- `full` (all fields), `field` (specific fields), or `consensus` (multi-agent).
- **triggered_by** -- User ID or system identifier that initiated the extraction.
- **fields_requested** -- Which fields were targeted (null for full runs).
- **duration_ms** -- Total execution time.
- **status** -- `running`, `completed`, `failed`, or `partial`.
## How It Works
### Single-Field Extraction Pipeline
The `extractField()` function in `features/entities/extraction/extract-field.ts` runs a single field through the agent loop:
1. **Build system prompt** -- Includes entity context (title, type, current content), the field's extraction instructions, any additional instructions (from reruns or workflow overrides), and cross-entity context (parent entity, connected entities).
2. **Configure tools** -- Based on the field's `sources` config, the appropriate extraction tools are enabled (`searchLinkedDocuments`, `searchConnectedEntities`, `getEntityField`, `webSearch`). The `submitValue` tool is always available -- it is the final call that returns the extracted value.
3. **Run agent loop** -- `streamText()` executes the agent with the configured tools. The agent searches for information, reasons about the data, and ultimately calls `submitValue` with the extracted value, sources, and confidence level.
4. **Capture source refs** -- Tool calls are intercepted to auto-capture structured `ExtractionSourceRef` entries. When the agent searches documents, the matching documents become sources. When it reads connected entities, those entities become sources.
5. **Persist result** -- The extracted value and sources are saved to `extraction_results` with `pending` status.
### Full Extraction Pipeline
The `runExtraction()` function orchestrates multi-field extraction:
1. **Dependency sorting** -- `sortFieldsByDependency()` uses Kahn's algorithm to produce execution layers. Fields with no dependencies are in layer 0; fields depending on layer-0 fields are in layer 1, and so on.
2. **Layer-by-layer execution** -- Each layer executes in parallel. This maximizes throughput while respecting dependency ordering. Fields within the same layer have no dependencies on each other and can safely run concurrently.
3. **Value locking** -- Fields listed in `entity.metadata.lockedFields` are skipped during extraction. This allows users to manually set a value and prevent agents from overwriting it.
4. **Cross-entity context injection** -- Before extraction begins, the engine pre-fetches the entity's parent (if `parent_id` is set) and connected entities (up to 10 via `entity_relations`). This context is injected into every field extraction's system prompt so agents understand the entity's neighborhood.
5. **Result creation** -- Each field's extraction result is persisted. Previous pending/approved results for the same field are superseded (status set to `superseded`).
### Multi-Agent Consensus
When a field has `consensus` config:
```typescript
consensus: {
agentSlugs: ["analyst", "researcher"],
strategy: "majority"
}
```
The `extractFieldWithConsensus()` function runs all specified agents in parallel on the same field. The strategy determines how to pick the winner:
- **majority** -- The most common value wins. `compareExtractionValues()` handles fuzzy comparison for strings, numbers, and arrays.
- **unanimous** -- All agents must agree. If they disagree, the extraction fails.
- **best-confidence** -- The result with the highest confidence level wins. Ties broken by agent order.
The winning result is stored with `pending` status; losing results are stored as `superseded`. The agreement ratio is tracked in metadata for UI display.
### Approval Workflow
Extracted values start in `pending` status and can be:
- **Approved** -- `approveResult(resultId, userId)` marks the extraction audit row approved and, when the result is linked to an `entity_response`, promotes the canonical response field into entity content through the response system.
- **Rejected** -- `rejectResult(resultId, userId, reason?)` marks the extraction audit row rejected and, when linked, rejects the canonical `entity_response` with the same review note. If a reason is provided, it also triggers the feedback rerun loop.
The important architectural point is that `extraction_results` is now a compatibility/audit surface. Review decisions are synchronized to the response runtime so extraction review cannot drift from the actual promoted/rejected response state.
### Feedback Rerun Loop
When a result is rejected with a reason:
1. The `extraction/result-rejected` Inngest event fires.
2. The `feedbackRerun` Inngest function picks up the event.
3. It re-extracts the field with `additionalInstructions` built from: "The previous value `{rejectedValue}` was rejected because: \{reason\}. Please try again with this feedback."
4. The new result goes through the same approval workflow.
Over multiple iterations, this creates a refinement loop where human feedback progressively improves extraction quality.
### Cross-Entity Feedback
The `getFieldFeedbackSummary(entityTypeSlug, fieldName)` function aggregates feedback across all entities of a given type:
- **Approval rate** -- Percentage of extractions approved vs rejected.
- **Common rejection reasons** -- Aggregated from `rejection_reason` fields.
- **Confidence calibration** -- Whether high-confidence extractions are actually approved more often.
When 5 or more extractions exist for a field, this feedback summary is injected into the extraction prompt, giving agents historical context about what reviewers accept and reject.
### Extraction Cascade
Connection fields with `cascadeExtraction` config trigger automatic sub-entity creation:
```typescript
connection: {
entityTypeSlug: "agent-task",
cascadeExtraction: {
enabled: true,
titleField: "name", // Which property becomes entity title
autoExtract: true // Trigger extraction on created entities
}
}
```
When extraction returns a list value for such a field:
1. `parseCascadeItems()` extracts individual items from the list (strings or objects).
2. `deduplicateAgainstExisting()` checks for existing entities with matching titles (case-insensitive).
3. `processCascadeCreation()` creates new entities of the target type, links them via `entity_relations` and `parent_id`.
4. If `autoExtract: true`, each new entity is created through the normal entity lifecycle path, which emits `entity/created` and triggers its extraction workflow.
5. Created entity IDs are stored on the extraction result's `created_entity_ids` for traceability.
### Value Locking
Users can lock individual field values to prevent extraction from overwriting them. Locked fields are stored in `entity.metadata.lockedFields` as an array of field names. The workflow runtime filters these fields out before seeding node runs, so they never enter the queue.
## API Reference
### Extraction Server Actions (`features/entities/extraction/server/actions.ts`)
| Function | Signature | Description |
| -------------------------- | --------------------------------------------------------- | -------------------------------------------------------------- |
| `createExtractionRun` | `(input) => Promise<ExtractionRun>` | Creates a new extraction run record. |
| `completeExtractionRun` | `(runId, status, durationMs?) => Promise<void>` | Finalizes a run with status and timing. |
| `createExtractionResult` | `(input) => Promise<ExtractionResult>` | Persists a single field extraction result. |
| `getExtractionResults` | `(entityId) => Promise<ExtractionResult[]>` | Latest non-superseded results per field. |
| `getExtractionRunHistory` | `(entityId, limit?) => Promise<ExtractionRun[]>` | Paginated run history. |
| `getFieldHistory` | `(entityId, fieldName) => Promise<ExtractionResult[]>` | All results for a specific field, including superseded. |
| `approveResult` | `(resultId, userId) => Promise<void>` | Approve a pending result. |
| `rejectResult` | `(resultId, userId, reason?) => Promise<void>` | Reject a result, optionally triggering feedback rerun. |
| `supersedePreviousResults` | `(entityId, fieldName) => Promise<void>` | Mark all previous results for a field as superseded. |
| `getFieldFeedbackSummary` | `(entityTypeSlug, fieldName) => Promise<FeedbackSummary>` | Aggregate approval rate and rejection reasons across entities. |
### API Routes
| Route | Method | Description |
| ------------------------------------ | ------ | -------------------------------------------------------------------------------------------------------------------- |
| `/api/extraction/[entityId]` | GET | Latest extraction audit rows per field. Compatibility read surface. |
| `/api/workflows/[entityId]` | GET | Canonical workflow status endpoint. Supports session auth or API key (`entities:read` or `extraction:trigger`). |
| `/api/workflows/[entityId]` | POST | Canonical workflow trigger endpoint. Supports session auth or API key (`extraction:trigger`). |
| `/api/workflows/[entityId]/history` | GET | Canonical workflow-run history endpoint. Supports session auth or API key (`entities:read` or `extraction:trigger`). |
| `/api/extraction/[entityId]/submit` | POST | External agent submission (API key auth, `extraction:submit` scope). |
| `/api/extraction/results/[resultId]` | PATCH | Approve or reject an extraction audit row while synchronizing linked response state. |
### External Agent API
Any external agent (Claude Code, Copilot, OpenClaw, custom scripts) can submit extraction results:
```
POST /api/extraction/[entityId]/submit
Authorization: Bearer sk_*
Content-Type: application/json
{
"fieldName": "revenue",
"value": 1500000,
"sources": [{ "type": "url", "url": "https://..." }],
"confidence": "high",
"agentId": "external-researcher"
}
```
Results go through the same approval workflow as internal extractions. Requires an API key with `extraction:submit` scope.
## For Agents
Agents participate in extraction in two roles:
**As extractors** -- Agents are invoked by the extraction pipeline to populate fields. They receive targeted instructions, access to source-specific tools, and entity context. They must call `submitValue` to return their result.
**As orchestrators** -- Via chat tools, agents can trigger extractions (`triggerWorkflow`), inspect extraction status (`getWorkflowStatus`), and retry failed extractions (`retryNode`).
Extraction-specific tools available during field extraction:
- `searchLinkedDocuments` -- Hybrid search across connected documents. Auto-captures document source refs.
- `searchConnectedEntities` -- Search entities linked via relations. Auto-captures entity source refs.
- `getEntityField` -- Read a specific field from the current or a connected entity. Auto-captures field source refs.
- `submitValue` -- Final call that returns the extracted value, sources, and confidence. Required to complete extraction.
## Design Decisions
**Per-field agent loops instead of per-entity.** Running a separate agent loop per field provides better control (each field gets targeted instructions and tools), better parallelism (independent fields run concurrently), and better traceability (each result is tracked individually). The tradeoff is more agent invocations, but the precision improvement is worth it.
**Dependency sorting via Kahn's algorithm.** Topological sort ensures correct execution order. The layer-based output naturally enables parallel execution within each layer.
**Supersede-on-rerun semantics.** When a field is re-extracted, previous results are superseded rather than deleted. This preserves the full extraction history for auditing and allows comparing old vs new values.
**Feedback injection over fine-tuning.** Instead of fine-tuning models on feedback data, rejection reasons are injected as additional instructions on the next extraction attempt. This is immediate, transparent, and works with any model.
**Cross-entity feedback at 5+ extractions.** The threshold prevents noisy feedback from a single extraction run. With 5+ data points, approval rates and rejection patterns become meaningful signals.
**Value locking as metadata.** Storing locked fields in `entity.metadata.lockedFields` rather than a separate table keeps the locking check local to the entity and avoids additional queries during extraction.
## Related Modules
- **Workflow Engine** (`features/workflows/`) -- Extraction runs behind workflow field `agent_task` nodes, with approval-gated steps parked in `waiting_human`.
- **Document Processing** (`features/documents/`) -- Documents are a primary extraction source via `searchLinkedDocuments`.
- **Agent System** (`features/agents/`) -- Agents execute extractions; agent config determines tool availability.
- **Entity System** (`features/entities/`) -- Entity types, field configs, and content storage.
- **Inngest Functions** (`features/inngest/functions/`) -- `entity-extraction.ts` triggers workflows; `feedback-rerun.ts` handles rejection reruns; `task-completion-cascade.ts` auto-completes parent goals.
- **Shared Context** (`features/context/`) -- Tenant-level corrections injected into extraction prompts.