Document Processing
End-to-end document lifecycle from upload through parsing, chunking, and embedding, with hybrid search, entity linking, signed URL access, and integration as extraction sources.
Overview
The document processing system manages the full lifecycle of uploaded files -- from initial upload through text extraction, chunking, embedding generation, and search indexing. Documents become first-class data sources for field-population sessions, enabling agents to search document content when populating entity fields.
Documents are dual-represented in the system: a documents table row tracks the file and its processing state, while an optional document entity in the entity graph provides the document with tags, relations, and the full entity lifecycle. This dual representation bridges the file storage domain with the entity graph domain.
The processing pipeline runs asynchronously via Inngest, progressing through clearly defined stages. Once processing completes, document chunks are searchable via a hybrid vector+text search system that combines semantic similarity with keyword matching.
Key Concepts
Document Record
interface DocumentRecord {
id: string;
tenant_id: string;
entity_id: string | null; // Entity this doc was uploaded from
document_entity_id: string | null; // Document's own entity in the graph
title: string | null;
file_name: string | null;
file_path: string | null; // Path in Supabase Storage
file_size: number | null;
file_url?: string | null;
mime_type: string | null;
status: DocumentStatus | null; // High-level: uploaded, processing, ready, error
processing_stage: string | null; // Granular pipeline stage
page_count: number | null;
chunk_count: number | null;
error_message: string | null;
processed_at: string | null;
metadata: Record<string, any> | null;
uploaded_by: string | null;
created_at: string;
}Processing Stages
The DocumentProcessingStage enum tracks granular pipeline progress:
type DocumentProcessingStage =
| "uploaded" // File stored, processing not started
| "parsing" // Text extraction in progress
| "parsed" // Text extracted, pages stored
| "chunking" // Breaking text into chunks
| "chunked" // Chunks stored
| "embedding" // Generating vector embeddings
| "embedded" // Embeddings stored
| "ready" // Fully processed, searchable
| "error" // Processing failedThe high-level DocumentStatus provides a simplified view:
type DocumentStatus = "uploaded" | "processing" | "ready" | "error";Document Chunks
Documents are split into overlapping chunks for search and embedding:
interface DocumentChunkRecord {
id: string;
document_id: string;
chunk_index: number;
page_number: number | null; // Tracks which page this chunk came from
content: string;
metadata: Record<string, any>;
embedding?: number[] | null; // Vector embedding (if generated)
created_at: string;
}Chunks maintain their page number association, enabling page-level citations when chunks match search queries.
Document Pages
Parsed pages are stored individually in the document_pages table:
interface DocumentPageRecord {
id: string;
document_id: string;
page_number: number;
content: string;
metadata: Record<string, any>;
created_at: string;
}Page storage serves two purposes: providing the source material for chunking, and enabling page-by-page document browsing in the UI.
Search Results
Hybrid search returns chunks with scoring from both vector and text search:
interface DocumentChunkMatch {
id: string;
document_id: string;
chunk_index: number;
page_number: number | null;
content: string;
metadata: Record<string, unknown>;
similarity: number; // Vector cosine similarity (0-1)
text_rank: number; // Full-text search rank
combined_score: number; // Weighted combination
}Parser Interface
Parsers handle specific MIME types and produce a standardized ParsedDocument:
interface DocumentParser {
mimeTypes: string[];
parse(buffer: Buffer, options?: { maxPages?: number }): Promise<ParsedDocument>;
}
interface ParsedDocument {
pages: ParsedPage[];
fullText: string;
metadata: {
pageCount: number;
title?: string;
author?: string;
[key: string]: unknown;
};
}The system selects parsers based on MIME type via getParser(mimeType). Files without a matching parser fall back to plain text extraction.
How It Works
Upload Flow
The uploadDocument() server action handles the complete upload process:
-
Store file -- Upload the file to Supabase Storage under
{tenantId}/{documentId}/{fileName}. -
Create document entity -- If the tenant has a "document" entity type, create an entity in the entity graph with the document's metadata (file name, MIME type, size, processing status). This gives the document a presence in the entity graph for tagging, relations, and search.
-
Link to parent entity -- If the document was uploaded from an entity's detail page (via
entityId), create anentity_relationlinking the parent entity to the document entity with thedocumentrelationship type. -
Create document record -- Insert a row in the
documentstable withstatus: "uploaded"andprocessing_stage: "uploaded". -
Emit processing event -- Fire the
document/uploadedInngest event to trigger asynchronous processing. -
Log activity -- Create an activity record for the upload.
Processing Pipeline (Inngest)
The documentProcessing Inngest function runs the processing pipeline in five steps, each as a separate Inngest step for resilience and retry:
Step 1: Fetch Document -- Load the document record and mark it as processing/parsing.
Step 2: Parse -- Download the file from Supabase Storage, detect the appropriate parser based on MIME type, and extract text content:
- MIME-specific parsers handle formats like PDF, DOCX, and others.
- Fallback: files without a parser are treated as UTF-8 plain text.
- Parsed pages are stored in
document_pages(existing pages deleted first for idempotent reprocessing). - Document metadata is updated with page count, extracted title, and author.
- Processing stage advances to
parsed.
Step 3: Chunk -- Read pages back from the database and split them into overlapping chunks:
- The
chunkPages()function creates chunks of ~1000 characters with 200-character overlap. - Chunks maintain page number tracking, so each chunk knows which page(s) it spans.
- Chunks are stored in
document_chunks(existing chunks deleted first). - Processing stage advances to
chunked. - If no chunks are produced (empty document), the pipeline short-circuits to
ready.
Step 4: Embed -- Generate vector embeddings for each chunk:
- Embeddings are generated in batches of 100 via the
generateEmbeddings()function (uses OpenAI's embedding API). - If no OpenAI key is configured, this step completes without embeddings -- the system falls back to text-only search.
- Embeddings are upserted into
document_chunksusing batch operations. - A 200ms delay between batches prevents rate limiting.
- Processing stage advances to
embedded.
Step 5: Finalize -- Mark the document as ready:
- Update
documents.statustoreadyandprocessing_stagetoready. - Update the document entity's content with page count, chunk count, embedding status, and processing status.
Step 6: Trigger field population -- Find entities linked to this document (via direct entity_id or entity_relations from the document entity) and trigger matching document-ready actions through the actions/sessions runtime. This enables the "upload document, auto-populate fields" flow -- when a document finishes processing, linked entities can re-run field-population work with the new document as a source.
Concurrency Control
The Inngest function limits concurrency to 3 total concurrent executions and 2 per tenant. This prevents resource exhaustion from bulk document uploads while ensuring reasonable throughput.
Hybrid Search
The searchDocumentsHybrid() function combines vector similarity with full-text keyword search:
-
Generate query embedding -- Convert the search query to a vector using the same embedding model used for chunks. If no embedding API is available, search falls back to text-only.
-
Execute hybrid search -- Call the
hybrid_search_document_chunksPostgreSQL function, which:- Runs vector cosine similarity against chunk embeddings.
- Runs PostgreSQL full-text search against chunk content.
- Combines scores using Reciprocal Rank Fusion (RRF) with configurable weights (default: 0.7 vector, 0.3 text).
- Returns the top N matches sorted by combined score.
-
Filter by scope -- Results can be filtered by tenant ID and optionally by specific document IDs (for entity-scoped extraction searches).
A legacy vector-only search function (searchDocumentChunks) is maintained for backward compatibility but is deprecated in favor of the hybrid approach.
Document-Entity Linking
Documents can be linked to entities in two ways:
- Direct link --
documents.entity_idstores a direct reference to the entity the document was uploaded from. This is set at upload time. - Entity graph link -- The document's own entity (
document_entity_id) is connected to other entities viaentity_relations. This enables documents to be linked to multiple entities and discovered through graph traversal.
The linkDocumentToEntity() and unlinkDocumentFromEntity() server actions manage entity relations between a document entity and other entities. The document picker in the UI uses these to create and remove associations.
POST /api/documents/[id]/links and DELETE /api/documents/[id]/links accept both session auth and API-key auth (documents:write scope). This enables external agents that create entities via POST /api/entities/upsert to also attach documents to those entities in the same API-key workflow without requiring a browser session.
Client Data Fetching
The document library, document detail, chunk viewer, picker, link dialog, and linked-list surfaces now use React Query hooks for loading and cache invalidation. This keeps document browsing, linking, and deletion behavior consistent across the admin and detail pages while avoiding one-off useEffect fetches in each component.
File Access
Documents are stored in Supabase Storage with private access. The getDocumentUrl() function generates signed URLs valid for 1 hour, enabling secure, time-limited downloads without exposing storage credentials.
API Reference
Server Actions (features/documents/server/actions.ts)
| Function | Signature | Description |
|---|---|---|
uploadDocument | (entityId, file, options?, context?) => Promise<DocumentRecord> | Upload, store, create entity, emit processing event. |
getDocumentById | (id) => Promise<DocumentRecord | null> | Fetch a single document by ID (tenant-scoped). |
listDocuments | (entityId?) => Promise<DocumentRecord[]> | List documents, optionally filtered by entity. |
listDocumentsPage | (params: ListDocumentsParams) => Promise<{documents, total, page, limit, totalPages}> | Paginated listing with status, MIME type, and text filters. |
getDocumentsByEntity | (entityId) => Promise<DocumentRecord[]> | Documents linked via entity relations (graph traversal). |
linkDocumentToEntity | (documentId, entityId) => Promise<void> | Create entity relation between document and entity. Session-auth context. |
unlinkDocumentFromEntity | (documentId, entityId) => Promise<void> | Remove entity relation. Session-auth context. |
linkDocumentToEntityKeyed | (tenantId, documentId, entityId) => Promise<void> | API-key-friendly variant with explicit tenant scoping. |
unlinkDocumentFromEntityKeyed | (tenantId, documentId, entityId) => Promise<void> | API-key-friendly variant with explicit tenant scoping. |
getDocumentByIdKeyed | (tenantId, id) => Promise<DocumentRecord | null> | Fetch a document by ID with explicit tenant scoping (used internally by the keyed link/unlink functions). |
deleteDocument | (id, tenantIdOverride?) => Promise<void> | Delete file from storage, remove document record (cascades to chunks/pages), delete associated entity. |
getDocumentUrl | (id, tenantIdOverride?) => Promise<string> | Generate a 1-hour signed URL for file download. |
getDocumentStats | () => Promise<{total, byStatus, byMimeType, totalSize}> | Aggregate statistics for the document library. |
HTTP Routes
| Method | Path | Auth | Description |
|---|---|---|---|
POST | /api/documents/[id]/links | Session or API key (documents:write) | Link a document to an entity. Body: { entityId: string }. |
DELETE | /api/documents/[id]/links | Session or API key (documents:write) | Unlink a document from an entity. Query param: entityId. |
Search (features/documents/server/search.ts)
| Function | Signature | Description |
|---|---|---|
searchDocumentsHybrid | (query, { tenantId, documentIds?, limit?, vectorWeight?, textWeight? }) => Promise<DocumentChunkMatch[]> | Hybrid vector + text search with configurable weights. |
searchDocumentChunks | (query, { tenantId, entityId?, limit? }) => Promise<DocumentChunkMatch[]> | Legacy vector-only search (deprecated). |
Listing Parameters
interface ListDocumentsParams {
entityId?: string;
status?: "uploaded" | "processing" | "ready" | "error";
mimeType?: string; // Partial match (ilike)
q?: string; // Search title and file name
page?: number; // Default: 1
limit?: number; // Default: 25
sort?: "created_at" | "title" | "file_size" | "status";
order?: "asc" | "desc";
}For Agents
Agents interact with documents primarily through field-population tools:
searchLinkedDocuments-- Available during field extraction whenlinked-documentsorall-documentsis in the field's sources config. Performs hybrid search across document chunks and auto-captures document source refs for provenance tracking.
Documents processed by the pipeline become searchable field-population sources. When an entity has linked documents and a field's source config includes linked-documents, the agent can search those documents for field values with full page-level citation tracking.
The document-to-field-population pipeline also works automatically: when a document finishes processing, the system fires population events for all linked entities, enabling a "drop a document, watch fields populate" workflow.
Design Decisions
Dual representation (document record + entity). The documents table handles file storage concerns (path, size, MIME type, processing status) while the document entity in the entity graph handles information architecture concerns (tags, relations, searchability). This separation lets the document participate in the entity graph without overloading the documents table with entity-specific fields.
Page-aware chunking. Chunks maintain their page number association. This enables page-level citations in extraction source refs -- when an agent finds a value in a document, the source ref can point to the specific page, not just the document.
Hybrid search with RRF. Pure vector search misses keyword matches; pure text search misses semantic similarity. Reciprocal Rank Fusion combines both signals without requiring careful score normalization. The default 0.7/0.3 weighting favors semantic similarity while preserving keyword precision.
Graceful embedding degradation. If no OpenAI key is configured, the system works without embeddings -- hybrid search falls back to text-only via full-text search. This ensures the document system is usable even in environments without embedding API access.
Inngest step isolation. Each processing stage (parse, chunk, embed, finalize) is a separate Inngest step. If embedding fails, the document still has its chunks and pages from successful earlier steps. Retrying only re-runs the failed step, not the entire pipeline.
Auto-population on document ready. The final processing step triggers document-ready actions for linked entities. This closes the "upload a document, fields auto-populate" loop without requiring manual intervention.
Dual-auth on document linking routes. The document linking API (POST/DELETE /api/documents/[id]/links) checks for an API key first; if none is present it falls through to session auth. This lets the same endpoint serve both browser-driven UI actions and external agent pipelines that pair document uploads with entity upserts via POST /api/entities/upsert. The *Keyed server action variants take an explicit tenantId so the link/unlink logic can run without a session cookie (which getActiveTenantId() would normally read from).
Related Modules
- Field Population (
features/actions/,features/sessions/,features/responses/) -- Documents are a primary source. ThesearchLinkedDocumentstool queries document chunks during field-population sessions. - Entity System (
features/entities/) -- Document entities participate in the entity graph with relations, tags, and search. - Inngest Functions (
features/inngest/functions/) --document-processing.tsruns the async pipeline; action/session work is triggered on completion. - Block System (
features/blocks/) -- Document-related blocks can render document previews and status in entity detail views.
Chat System
Multi-agent AI chat with conversation types, participant models, message persistence via AI SDK v6 parts, agent selection, entity scoping, and tool result rendering.
Feed System
Personalized entity feed with configurable sorting, filtering, card and compact views, for-you personalization, activity enrichment, digest summarization, and block integration.