Documentation source
Document Processing
End-to-end document lifecycle from upload through parsing, chunking, and embedding, with hybrid search, entity linking, signed URL access, and integration as extraction sources.
## Overview
The document processing system manages the full lifecycle of uploaded files -- from initial upload through text extraction, chunking, embedding generation, and search indexing. Documents become first-class data sources for field-population sessions, enabling agents to search document content when populating entity fields.
Documents are dual-represented in the system: a `documents` table row tracks the file and its processing state, while an optional document entity in the entity graph provides the document with tags, relations, and the full entity lifecycle. This dual representation bridges the file storage domain with the entity graph domain.
The processing pipeline runs asynchronously via Inngest, progressing through clearly defined stages. Once processing completes, document chunks are searchable via a hybrid vector+text search system that combines semantic similarity with keyword matching.
## Key Concepts
### Document Record
```typescript
interface DocumentRecord {
id: string;
tenant_id: string;
entity_id: string | null; // Entity this doc was uploaded from
document_entity_id: string | null; // Document's own entity in the graph
title: string | null;
file_name: string | null;
file_path: string | null; // Path in Supabase Storage
file_size: number | null;
file_url?: string | null;
mime_type: string | null;
status: DocumentStatus | null; // High-level: uploaded, processing, ready, error
processing_stage: string | null; // Granular pipeline stage
page_count: number | null;
chunk_count: number | null;
error_message: string | null;
processed_at: string | null;
metadata: Record<string, any> | null;
uploaded_by: string | null;
created_at: string;
}
```
### Processing Stages
The `DocumentProcessingStage` enum tracks granular pipeline progress:
```typescript
type DocumentProcessingStage =
| "uploaded" // File stored, processing not started
| "parsing" // Text extraction in progress
| "parsed" // Text extracted, pages stored
| "chunking" // Breaking text into chunks
| "chunked" // Chunks stored
| "embedding" // Generating vector embeddings
| "embedded" // Embeddings stored
| "ready" // Fully processed, searchable
| "error" // Processing failed
```
The high-level `DocumentStatus` provides a simplified view:
```typescript
type DocumentStatus = "uploaded" | "processing" | "ready" | "error";
```
### Generated-Asset Cataloging (source / origin columns)
As of migration `20260612000000_documents_generated_source.sql`, the `documents` table carries two additional columns that cover AI-produced assets:
- **`source`** (`text`, default `'uploaded'`) — `'uploaded'` for user/agent file uploads; `'generated'` for images and video renders produced by AI tools (`generateImage`, `generateVideo`).
- **`origin`** (`jsonb`) — provenance for generated assets: `{ tool, model?, prompt?, renderId?, entityId? }`. Useful for auditing which model and prompt produced a file.
Generated assets skip the normal processing pipeline (no Inngest `document/uploaded` event, no chunking or embedding) — they arrive as `status: 'ready'` immediately. `listDocuments` returns both uploaded and generated items by default; pass `source: 'generated'` or `source: 'uploaded'` to narrow. See [Asset Studio](/docs/features/studio) for the creation surface that produces generated assets.
### Document Chunks
Documents are split into overlapping chunks for search and embedding:
```typescript
interface DocumentChunkRecord {
id: string;
document_id: string;
chunk_index: number;
page_number: number | null; // Tracks which page this chunk came from
content: string;
metadata: Record<string, any>;
embedding?: number[] | null; // Vector embedding (if generated)
created_at: string;
}
```
Chunks maintain their page number association, enabling page-level citations when chunks match search queries.
### Document Pages
Parsed pages are stored individually in the `document_pages` table:
```typescript
interface DocumentPageRecord {
id: string;
document_id: string;
page_number: number;
content: string;
metadata: Record<string, any>;
created_at: string;
}
```
Page storage serves two purposes: providing the source material for chunking, and enabling page-by-page document browsing in the UI.
### Search Results
Hybrid search returns chunks with scoring from both vector and text search:
```typescript
interface DocumentChunkMatch {
id: string;
document_id: string;
chunk_index: number;
page_number: number | null;
content: string;
metadata: Record<string, unknown>;
similarity: number; // Vector cosine similarity (0-1)
text_rank: number; // Full-text search rank
combined_score: number; // Weighted combination
}
```
### Parser Interface
Parsers handle specific MIME types and produce a standardized `ParsedDocument`:
```typescript
interface DocumentParser {
mimeTypes: string[];
parse(buffer: Buffer, options?: { maxPages?: number }): Promise<ParsedDocument>;
}
interface ParsedDocument {
pages: ParsedPage[];
fullText: string;
metadata: {
pageCount: number;
title?: string;
author?: string;
[key: string]: unknown;
};
}
```
The system selects parsers based on MIME type via `getParser(mimeType)`. Files without a matching parser fall back to plain text extraction.
## How It Works
### Upload Flow
The `uploadDocument()` server action handles the complete upload process:
1. **Store file** -- Upload the file to Supabase Storage under `{tenantId}/{documentId}/{fileName}`.
2. **Create document entity** -- If the tenant has a "document" entity type, create an entity in the entity graph with the document's metadata (file name, MIME type, size, processing status). This gives the document a presence in the entity graph for tagging, relations, and search.
3. **Link to parent entity** -- If the document was uploaded from an entity's detail page (via `entityId`), create an `entity_relation` linking the parent entity to the document entity with the `document` relationship type.
4. **Create document record** -- Insert a row in the `documents` table with `status: "uploaded"` and `processing_stage: "uploaded"`.
5. **Emit processing event** -- Fire the `document/uploaded` Inngest event to trigger asynchronous processing.
6. **Log activity** -- Create an activity record for the upload.
### Processing Pipeline (Inngest)
The `documentProcessing` Inngest function runs the processing pipeline in five steps, each as a separate Inngest step for resilience and retry:
**Step 1: Fetch Document** -- Load the document record and mark it as `processing/parsing`.
**Step 2: Parse** -- Download the file from Supabase Storage, detect the appropriate parser based on MIME type, and extract text content:
- MIME-specific parsers handle formats like PDF, DOCX, and others.
- Fallback: text-like files without a parser are treated as UTF-8 plain text. A **binary** format (PDF/image) whose structural parse throws is left empty rather than coerced to UTF-8 (which would index mojibake) — it becomes an AI-extraction candidate instead.
- **AI vision extraction (Gemini):** images, and PDFs whose text layer is effectively empty (scanned docs), are routed to `gemini-2.5-flash` to transcribe real text. This is cost-gated — see _AI Vision Extraction_ below.
- Parsed pages are stored in `document_pages` (existing pages deleted first for idempotent reprocessing).
- Document metadata is updated with page count, extracted title, author, and the `ai_extracted` / `needs_ai_extraction` flags.
- Processing stage advances to `parsed`.
**Step 3: Chunk** -- Read pages back from the database and split them into overlapping chunks:
- The `chunkPages()` function creates chunks of ~1000 characters with 200-character overlap.
- Chunks maintain page number tracking, so each chunk knows which page(s) it spans.
- Chunks are stored in `document_chunks` (existing chunks deleted first).
- Processing stage advances to `chunked`.
- If no chunks are produced (empty document), the pipeline short-circuits to `ready`.
**Step 4: Embed** -- Generate vector embeddings for each chunk:
- Embeddings are generated in batches of 100 via the `generateEmbeddings()` function (uses OpenAI's embedding API).
- If no OpenAI key is configured, this step completes without embeddings -- the system falls back to text-only search.
- Embeddings are upserted into `document_chunks` using batch operations.
- A 200ms delay between batches prevents rate limiting.
- Processing stage advances to `embedded`.
**Step 5: Finalize** -- Mark the document as `ready`:
- Update `documents.status` to `ready` and `processing_stage` to `ready`.
- Update the document entity's content with page count, chunk count, embedding status, and processing status.
**Step 6: Trigger field population** -- Find entities linked to this document (via direct `entity_id` or `entity_relations` from the document entity) and trigger matching document-ready actions through the actions/sessions runtime. This enables the "upload document, auto-populate fields" flow -- when a document finishes processing, linked entities can re-run field-population work with the new document as a source.
### AI Vision Extraction (Gemini)
Images and scanned PDFs carry little or no machine-readable text, so the parse step can route the raw file to Gemini flash vision (`features/documents/lib/extraction/gemini-vision.ts`) to transcribe its content, reusing the platform AI providers and cost tracking (`COST_SOURCES.extraction`). It is **cost-gated** so it never runs away:
- **Eligibility** -- an image, or a PDF whose average chars-per-page is below `SCANNED_PDF_MIN_CHARS_PER_PAGE` (50).
- **Auto on upload** only when a Google key is configured, the file is under `MAX_INLINE_AI_BYTES` (15 MB), and the page count is known and within `MAX_AUTO_AI_EXTRACTION_PAGES` (20). A parse that produced no page count (0) is **not** auto-extracted — cost can't be bounded — so it defers to the on-demand path.
- **On-demand** -- otherwise the document is flagged `metadata.needs_ai_extraction` and the detail page shows an **"Extract with AI"** button. It re-runs processing with `forceAiExtraction: true` (the existing reprocess endpoint with an optional body), bypassing the page cap.
- **Graceful degradation** -- each Gemini call has a 120 s timeout; a failure or outage never fails the document (it stays flagged for retry). A successful-but-empty result clears the flag so the button can't loop and re-charge.
Requires `GOOGLE_GENERATIVE_AI_API_KEY` (or `GOOGLE_API_KEY`); absent a key the whole feature is skipped and documents fall back to structural text only.
### Failure Handling
If the pipeline exhausts its retries, an Inngest `onFailure` handler flips the document to a terminal `status: "error"` with an `error_message` and updates the document entity's `processing_status`. Without this a transient failure would leave the row stuck in `processing` forever. Use **Reprocess** to retry.
### Concurrency Control
The Inngest function limits concurrency to 3 total concurrent executions, 2 per tenant, and **1 per document** — the per-document cap serializes reprocessing so a double-clicked Reprocess / Extract-with-AI can't run two jobs that race the `document_pages` delete/insert. This prevents resource exhaustion from bulk uploads while ensuring reasonable throughput.
### Hybrid Search
The `searchDocumentsHybrid()` function combines vector similarity with full-text keyword search:
1. **Generate query embedding** -- Convert the search query to a vector using the same embedding model used for chunks. If no embedding API is available, search falls back to text-only.
2. **Execute hybrid search** -- Call the `hybrid_search_document_chunks` PostgreSQL function, which:
- Runs vector cosine similarity against chunk embeddings.
- Runs PostgreSQL full-text search against chunk content.
- Combines scores using Reciprocal Rank Fusion (RRF) with configurable weights (default: 0.7 vector, 0.3 text).
- Returns the top N matches sorted by combined score.
3. **Filter by scope** -- Results can be filtered by tenant ID and optionally by specific document IDs (for entity-scoped extraction searches).
A legacy vector-only search function (`searchDocumentChunks`) is maintained for backward compatibility but is deprecated in favor of the hybrid approach.
### Document-Entity Linking
Documents can be linked to entities in two ways:
- **Direct link** -- `documents.entity_id` stores a direct reference to the entity the document was uploaded from. This is set at upload time.
- **Entity graph link** -- The document's own entity (`document_entity_id`) is connected to other entities via `entity_relations`. This enables documents to be linked to multiple entities and discovered through graph traversal.
The `linkDocumentToEntity()` and `unlinkDocumentFromEntity()` server actions manage entity relations between a document entity and other entities. The document picker in the UI uses these to create and remove associations.
`POST /api/documents/[id]/links` and `DELETE /api/documents/[id]/links` accept both session auth and API-key auth (`documents:write` scope). This enables external agents that create entities via `POST /api/entities/upsert` to also attach documents to those entities in the same API-key workflow without requiring a browser session.
### Client Data Fetching
The document library, document detail, chunk viewer, picker, link dialog, and linked-list surfaces now use React Query hooks for loading and cache invalidation. This keeps document browsing, linking, and deletion behavior consistent across the admin and detail pages while avoiding one-off `useEffect` fetches in each component.
### File Access
Documents are stored in Supabase Storage with private access. The `getDocumentUrl()` function generates signed URLs valid for 1 hour, enabling secure, time-limited downloads without exposing storage credentials.
## API Reference
### Server Actions (`features/documents/server/actions.ts`)
| Function | Signature | Description |
|----------|-----------|-------------|
| `uploadDocument` | `(entityId, file, options?, context?) => Promise<DocumentRecord>` | Upload, store, create entity, emit processing event. |
| `getDocumentById` | `(id) => Promise<DocumentRecord \| null>` | Fetch a single document by ID (tenant-scoped). |
| `listDocuments` | `(entityId?) => Promise<DocumentRecord[]>` | List documents, optionally filtered by entity. |
| `listDocumentsPage` | `(params: ListDocumentsParams) => Promise<{documents, total, page, limit, totalPages}>` | Paginated listing with status, MIME type, and text filters. |
| `getDocumentsByEntity` | `(entityId) => Promise<DocumentRecord[]>` | Documents linked via entity relations (graph traversal). |
| `linkDocumentToEntity` | `(documentId, entityId) => Promise<void>` | Create entity relation between document and entity. Session-auth context. |
| `unlinkDocumentFromEntity` | `(documentId, entityId) => Promise<void>` | Remove entity relation. Session-auth context. |
| `linkDocumentToEntityKeyed` | `(tenantId, documentId, entityId) => Promise<void>` | API-key-friendly variant with explicit tenant scoping. |
| `unlinkDocumentFromEntityKeyed` | `(tenantId, documentId, entityId) => Promise<void>` | API-key-friendly variant with explicit tenant scoping. |
| `getDocumentByIdKeyed` | `(tenantId, id) => Promise<DocumentRecord \| null>` | Fetch a document by ID with explicit tenant scoping (used internally by the keyed link/unlink functions). |
| `deleteDocument` | `(id, tenantIdOverride?) => Promise<void>` | Delete file from storage, remove document record (cascades to chunks/pages), delete associated entity. |
| `getDocumentUrl` | `(id, tenantIdOverride?) => Promise<string>` | Generate a 1-hour signed URL for file download. |
| `getDocumentStats` | `() => Promise<{total, byStatus, byMimeType, totalSize}>` | Aggregate statistics for the document library. |
### HTTP Routes
| Method | Path | Auth | Description |
|--------|------|------|-------------|
| `POST` | `/api/documents/[id]/links` | Session or API key (`documents:write`) | Link a document to an entity. Body: `{ entityId: string }`. |
| `DELETE` | `/api/documents/[id]/links` | Session or API key (`documents:write`) | Unlink a document from an entity. Query param: `entityId`. |
### Search (`features/documents/server/search.ts`)
| Function | Signature | Description |
|----------|-----------|-------------|
| `searchDocumentsHybrid` | `(query, { tenantId, documentIds?, limit?, vectorWeight?, textWeight? }) => Promise<DocumentChunkMatch[]>` | Hybrid vector + text search with configurable weights. |
| `searchDocumentChunks` | `(query, { tenantId, entityId?, limit? }) => Promise<DocumentChunkMatch[]>` | Legacy vector-only search (deprecated). |
### Listing Parameters
```typescript
interface ListDocumentsParams {
entityId?: string;
status?: "uploaded" | "processing" | "ready" | "error";
mimeType?: string; // Partial match (ilike)
q?: string; // Search title and file name
page?: number; // Default: 1
limit?: number; // Default: 25
sort?: "created_at" | "title" | "file_size" | "status";
order?: "asc" | "desc";
}
```
## For Agents
Agents interact with documents primarily through field-population tools:
- **`searchLinkedDocuments`** -- Available during field extraction when `linked-documents` or `all-documents` is in the field's sources config. Performs hybrid search across document chunks and auto-captures document source refs for provenance tracking.
Documents processed by the pipeline become searchable field-population sources. When an entity has linked documents and a field's source config includes `linked-documents`, the agent can search those documents for field values with full page-level citation tracking.
The document-to-field-population pipeline also works automatically: when a document finishes processing, the system fires population events for all linked entities, enabling a "drop a document, watch fields populate" workflow.
### Create records from a spreadsheet document
Spreadsheet and CSV documents can seed the record-import wizard directly from the document detail page ("Create records"). Because the spreadsheet parser stores each worksheet as CSV text in `document_pages.content`, `getSpreadsheetDocumentCsv(documentId)` (in `features/documents/server/document-to-csv.ts`) returns the first sheet's CSV without re-parsing. The wizard (`ImportDialog`, seeded via its `openWithCsv` handle) then runs the standard column→schema mapping and create path (`/api/entities/import` → `parseCsv` → `batchCreateEntities`), producing tenant-scoped, activity-logged records. Only tabular MIME types qualify (`isTabularMimeType` in `features/documents/lib/mime.ts`); PDF/DOCX free-text extraction is out of scope.
## Design Decisions
**Dual representation (document record + entity).** The `documents` table handles file storage concerns (path, size, MIME type, processing status) while the document entity in the entity graph handles information architecture concerns (tags, relations, searchability). This separation lets the document participate in the entity graph without overloading the documents table with entity-specific fields.
**Page-aware chunking.** Chunks maintain their page number association. This enables page-level citations in extraction source refs -- when an agent finds a value in a document, the source ref can point to the specific page, not just the document.
**Hybrid search with RRF.** Pure vector search misses keyword matches; pure text search misses semantic similarity. Reciprocal Rank Fusion combines both signals without requiring careful score normalization. The default 0.7/0.3 weighting favors semantic similarity while preserving keyword precision.
**Graceful embedding degradation.** If no OpenAI key is configured, the system works without embeddings -- hybrid search falls back to text-only via full-text search. This ensures the document system is usable even in environments without embedding API access.
**Inngest step isolation.** Each processing stage (parse, chunk, embed, finalize) is a separate Inngest step. If embedding fails, the document still has its chunks and pages from successful earlier steps. Retrying only re-runs the failed step, not the entire pipeline.
**Auto-population on document ready.** The final processing step triggers document-ready actions for linked entities. This closes the "upload a document, fields auto-populate" loop without requiring manual intervention.
**Dual-auth on document linking routes.** The document linking API (`POST/DELETE /api/documents/[id]/links`) checks for an API key first; if none is present it falls through to session auth. This lets the same endpoint serve both browser-driven UI actions and external agent pipelines that pair document uploads with entity upserts via `POST /api/entities/upsert`. The `*Keyed` server action variants take an explicit `tenantId` so the link/unlink logic can run without a session cookie (which `getActiveTenantId()` would normally read from).
## Related Modules
- **Field Population** (`features/actions/`, `features/sessions/`, `features/responses/`) -- Documents are a primary source. The `searchLinkedDocuments` tool queries document chunks during field-population sessions.
- **Entity System** (`features/entities/`) -- Document entities participate in the entity graph with relations, tags, and search.
- **Inngest Functions** (`features/inngest/functions/`) -- `document-processing.ts` runs the async pipeline; action/session work is triggered on completion.
- **Block System** (`features/blocks/`) -- Document-related blocks can render document previews and status in entity detail views.