Document Processing

End-to-end document lifecycle from upload through parsing, chunking, and embedding, with hybrid search, entity linking, signed URL access, and integration as extraction sources.

Overview

The document processing system manages the full lifecycle of uploaded files -- from initial upload through text extraction, chunking, embedding generation, and search indexing. Documents become first-class data sources for field-population sessions, enabling agents to search document content when populating entity fields.

Documents are dual-represented in the system: a documents table row tracks the file and its processing state, while an optional document entity in the entity graph provides the document with tags, relations, and the full entity lifecycle. This dual representation bridges the file storage domain with the entity graph domain.

The processing pipeline runs asynchronously via Inngest, progressing through clearly defined stages. Once processing completes, document chunks are searchable via a hybrid vector+text search system that combines semantic similarity with keyword matching.

Key Concepts

Document Record

interface DocumentRecord {
  id: string;
  tenant_id: string;
  entity_id: string | null;         // Entity this doc was uploaded from
  document_entity_id: string | null; // Document's own entity in the graph
  title: string | null;
  file_name: string | null;
  file_path: string | null;         // Path in Supabase Storage
  file_size: number | null;
  file_url?: string | null;
  mime_type: string | null;
  status: DocumentStatus | null;    // High-level: uploaded, processing, ready, error
  processing_stage: string | null;  // Granular pipeline stage
  page_count: number | null;
  chunk_count: number | null;
  error_message: string | null;
  processed_at: string | null;
  metadata: Record<string, any> | null;
  uploaded_by: string | null;
  created_at: string;
}

Processing Stages

The DocumentProcessingStage enum tracks granular pipeline progress:

type DocumentProcessingStage =
  | "uploaded"   // File stored, processing not started
  | "parsing"    // Text extraction in progress
  | "parsed"     // Text extracted, pages stored
  | "chunking"   // Breaking text into chunks
  | "chunked"    // Chunks stored
  | "embedding"  // Generating vector embeddings
  | "embedded"   // Embeddings stored
  | "ready"      // Fully processed, searchable
  | "error"      // Processing failed

The high-level DocumentStatus provides a simplified view:

type DocumentStatus = "uploaded" | "processing" | "ready" | "error";

Document Chunks

Documents are split into overlapping chunks for search and embedding:

interface DocumentChunkRecord {
  id: string;
  document_id: string;
  chunk_index: number;
  page_number: number | null;  // Tracks which page this chunk came from
  content: string;
  metadata: Record<string, any>;
  embedding?: number[] | null; // Vector embedding (if generated)
  created_at: string;
}

Chunks maintain their page number association, enabling page-level citations when chunks match search queries.

Document Pages

Parsed pages are stored individually in the document_pages table:

interface DocumentPageRecord {
  id: string;
  document_id: string;
  page_number: number;
  content: string;
  metadata: Record<string, any>;
  created_at: string;
}

Page storage serves two purposes: providing the source material for chunking, and enabling page-by-page document browsing in the UI.

Search Results

Hybrid search returns chunks with scoring from both vector and text search:

interface DocumentChunkMatch {
  id: string;
  document_id: string;
  chunk_index: number;
  page_number: number | null;
  content: string;
  metadata: Record<string, unknown>;
  similarity: number;      // Vector cosine similarity (0-1)
  text_rank: number;       // Full-text search rank
  combined_score: number;  // Weighted combination
}

Parser Interface

Parsers handle specific MIME types and produce a standardized ParsedDocument:

interface DocumentParser {
  mimeTypes: string[];
  parse(buffer: Buffer, options?: { maxPages?: number }): Promise<ParsedDocument>;
}

interface ParsedDocument {
  pages: ParsedPage[];
  fullText: string;
  metadata: {
    pageCount: number;
    title?: string;
    author?: string;
    [key: string]: unknown;
  };
}

The system selects parsers based on MIME type via getParser(mimeType). Files without a matching parser fall back to plain text extraction.

How It Works

Upload Flow

The uploadDocument() server action handles the complete upload process:

Store file -- Upload the file to Supabase Storage under {tenantId}/{documentId}/{fileName}.
Create document entity -- If the tenant has a "document" entity type, create an entity in the entity graph with the document's metadata (file name, MIME type, size, processing status). This gives the document a presence in the entity graph for tagging, relations, and search.
Link to parent entity -- If the document was uploaded from an entity's detail page (via entityId), create an entity_relation linking the parent entity to the document entity with the document relationship type.
Create document record -- Insert a row in the documents table with status: "uploaded" and processing_stage: "uploaded".
Emit processing event -- Fire the document/uploaded Inngest event to trigger asynchronous processing.
Log activity -- Create an activity record for the upload.

Processing Pipeline (Inngest)

The documentProcessing Inngest function runs the processing pipeline in five steps, each as a separate Inngest step for resilience and retry:

Step 1: Fetch Document -- Load the document record and mark it as processing/parsing.

Step 2: Parse -- Download the file from Supabase Storage, detect the appropriate parser based on MIME type, and extract text content:

MIME-specific parsers handle formats like PDF, DOCX, and others.
Fallback: files without a parser are treated as UTF-8 plain text.
Parsed pages are stored in document_pages (existing pages deleted first for idempotent reprocessing).
Document metadata is updated with page count, extracted title, and author.
Processing stage advances to parsed.

Step 3: Chunk -- Read pages back from the database and split them into overlapping chunks:

The chunkPages() function creates chunks of ~1000 characters with 200-character overlap.
Chunks maintain page number tracking, so each chunk knows which page(s) it spans.
Chunks are stored in document_chunks (existing chunks deleted first).
Processing stage advances to chunked.
If no chunks are produced (empty document), the pipeline short-circuits to ready.

Step 4: Embed -- Generate vector embeddings for each chunk:

Embeddings are generated in batches of 100 via the generateEmbeddings() function (uses OpenAI's embedding API).
If no OpenAI key is configured, this step completes without embeddings -- the system falls back to text-only search.
Embeddings are upserted into document_chunks using batch operations.
A 200ms delay between batches prevents rate limiting.
Processing stage advances to embedded.

Step 5: Finalize -- Mark the document as ready:

Update documents.status to ready and processing_stage to ready.
Update the document entity's content with page count, chunk count, embedding status, and processing status.

Step 6: Trigger field population -- Find entities linked to this document (via direct entity_id or entity_relations from the document entity) and trigger matching document-ready actions through the actions/sessions runtime. This enables the "upload document, auto-populate fields" flow -- when a document finishes processing, linked entities can re-run field-population work with the new document as a source.

Concurrency Control

The Inngest function limits concurrency to 3 total concurrent executions and 2 per tenant. This prevents resource exhaustion from bulk document uploads while ensuring reasonable throughput.

Hybrid Search

The searchDocumentsHybrid() function combines vector similarity with full-text keyword search:

Generate query embedding -- Convert the search query to a vector using the same embedding model used for chunks. If no embedding API is available, search falls back to text-only.
Execute hybrid search -- Call the hybrid_search_document_chunks PostgreSQL function, which:
- Runs vector cosine similarity against chunk embeddings.
- Runs PostgreSQL full-text search against chunk content.
- Combines scores using Reciprocal Rank Fusion (RRF) with configurable weights (default: 0.7 vector, 0.3 text).
- Returns the top N matches sorted by combined score.
Filter by scope -- Results can be filtered by tenant ID and optionally by specific document IDs (for entity-scoped extraction searches).

A legacy vector-only search function (searchDocumentChunks) is maintained for backward compatibility but is deprecated in favor of the hybrid approach.

Document-Entity Linking

Documents can be linked to entities in two ways:

Direct link -- documents.entity_id stores a direct reference to the entity the document was uploaded from. This is set at upload time.
Entity graph link -- The document's own entity (document_entity_id) is connected to other entities via entity_relations. This enables documents to be linked to multiple entities and discovered through graph traversal.

The linkDocumentToEntity() and unlinkDocumentFromEntity() server actions manage entity relations between a document entity and other entities. The document picker in the UI uses these to create and remove associations.

POST /api/documents/[id]/links and DELETE /api/documents/[id]/links accept both session auth and API-key auth (documents:write scope). This enables external agents that create entities via POST /api/entities/upsert to also attach documents to those entities in the same API-key workflow without requiring a browser session.

Client Data Fetching

The document library, document detail, chunk viewer, picker, link dialog, and linked-list surfaces now use React Query hooks for loading and cache invalidation. This keeps document browsing, linking, and deletion behavior consistent across the admin and detail pages while avoiding one-off useEffect fetches in each component.

File Access

Documents are stored in Supabase Storage with private access. The getDocumentUrl() function generates signed URLs valid for 1 hour, enabling secure, time-limited downloads without exposing storage credentials.

API Reference

Server Actions (`features/documents/server/actions.ts`)

Function	Signature	Description
`uploadDocument`	`(entityId, file, options?, context?) => Promise<DocumentRecord>`	Upload, store, create entity, emit processing event.
`getDocumentById`	`(id) => Promise<DocumentRecord \| null>`	Fetch a single document by ID (tenant-scoped).
`listDocuments`	`(entityId?) => Promise<DocumentRecord[]>`	List documents, optionally filtered by entity.
`listDocumentsPage`	`(params: ListDocumentsParams) => Promise<{documents, total, page, limit, totalPages}>`	Paginated listing with status, MIME type, and text filters.
`getDocumentsByEntity`	`(entityId) => Promise<DocumentRecord[]>`	Documents linked via entity relations (graph traversal).
`linkDocumentToEntity`	`(documentId, entityId) => Promise<void>`	Create entity relation between document and entity. Session-auth context.
`unlinkDocumentFromEntity`	`(documentId, entityId) => Promise<void>`	Remove entity relation. Session-auth context.
`linkDocumentToEntityKeyed`	`(tenantId, documentId, entityId) => Promise<void>`	API-key-friendly variant with explicit tenant scoping.
`unlinkDocumentFromEntityKeyed`	`(tenantId, documentId, entityId) => Promise<void>`	API-key-friendly variant with explicit tenant scoping.
`getDocumentByIdKeyed`	`(tenantId, id) => Promise<DocumentRecord \| null>`	Fetch a document by ID with explicit tenant scoping (used internally by the keyed link/unlink functions).
`deleteDocument`	`(id, tenantIdOverride?) => Promise<void>`	Delete file from storage, remove document record (cascades to chunks/pages), delete associated entity.
`getDocumentUrl`	`(id, tenantIdOverride?) => Promise<string>`	Generate a 1-hour signed URL for file download.
`getDocumentStats`	`() => Promise<{total, byStatus, byMimeType, totalSize}>`	Aggregate statistics for the document library.

HTTP Routes

Method	Path	Auth	Description
`POST`	`/api/documents/[id]/links`	Session or API key (`documents:write`)	Link a document to an entity. Body: `{ entityId: string }`.
`DELETE`	`/api/documents/[id]/links`	Session or API key (`documents:write`)	Unlink a document from an entity. Query param: `entityId`.

Search (`features/documents/server/search.ts`)

Function	Signature	Description
`searchDocumentsHybrid`	`(query, { tenantId, documentIds?, limit?, vectorWeight?, textWeight? }) => Promise<DocumentChunkMatch[]>`	Hybrid vector + text search with configurable weights.
`searchDocumentChunks`	`(query, { tenantId, entityId?, limit? }) => Promise<DocumentChunkMatch[]>`	Legacy vector-only search (deprecated).

Listing Parameters

interface ListDocumentsParams {
  entityId?: string;
  status?: "uploaded" | "processing" | "ready" | "error";
  mimeType?: string;   // Partial match (ilike)
  q?: string;          // Search title and file name
  page?: number;       // Default: 1
  limit?: number;      // Default: 25
  sort?: "created_at" | "title" | "file_size" | "status";
  order?: "asc" | "desc";
}

For Agents

Agents interact with documents primarily through field-population tools:

searchLinkedDocuments -- Available during field extraction when linked-documents or all-documents is in the field's sources config. Performs hybrid search across document chunks and auto-captures document source refs for provenance tracking.

Documents processed by the pipeline become searchable field-population sources. When an entity has linked documents and a field's source config includes linked-documents, the agent can search those documents for field values with full page-level citation tracking.

The document-to-field-population pipeline also works automatically: when a document finishes processing, the system fires population events for all linked entities, enabling a "drop a document, watch fields populate" workflow.

Design Decisions

Dual representation (document record + entity). The documents table handles file storage concerns (path, size, MIME type, processing status) while the document entity in the entity graph handles information architecture concerns (tags, relations, searchability). This separation lets the document participate in the entity graph without overloading the documents table with entity-specific fields.

Page-aware chunking. Chunks maintain their page number association. This enables page-level citations in extraction source refs -- when an agent finds a value in a document, the source ref can point to the specific page, not just the document.

Hybrid search with RRF. Pure vector search misses keyword matches; pure text search misses semantic similarity. Reciprocal Rank Fusion combines both signals without requiring careful score normalization. The default 0.7/0.3 weighting favors semantic similarity while preserving keyword precision.

Graceful embedding degradation. If no OpenAI key is configured, the system works without embeddings -- hybrid search falls back to text-only via full-text search. This ensures the document system is usable even in environments without embedding API access.

Inngest step isolation. Each processing stage (parse, chunk, embed, finalize) is a separate Inngest step. If embedding fails, the document still has its chunks and pages from successful earlier steps. Retrying only re-runs the failed step, not the entire pipeline.

Auto-population on document ready. The final processing step triggers document-ready actions for linked entities. This closes the "upload a document, fields auto-populate" loop without requiring manual intervention.

Dual-auth on document linking routes. The document linking API (POST/DELETE /api/documents/[id]/links) checks for an API key first; if none is present it falls through to session auth. This lets the same endpoint serve both browser-driven UI actions and external agent pipelines that pair document uploads with entity upserts via POST /api/entities/upsert. The *Keyed server action variants take an explicit tenantId so the link/unlink logic can run without a session cookie (which getActiveTenantId() would normally read from).

Field Population (features/actions/, features/sessions/, features/responses/) -- Documents are a primary source. The searchLinkedDocuments tool queries document chunks during field-population sessions.
Entity System (features/entities/) -- Document entities participate in the entity graph with relations, tags, and search.
Inngest Functions (features/inngest/functions/) -- document-processing.ts runs the async pipeline; action/session work is triggered on completion.
Block System (features/blocks/) -- Document-related blocks can render document previews and status in entity detail views.

Document Processing

On this page