Documentation source
Source Sync
Feed-based and single-URL ingestion of external content into source-item entities, with configurable scraping strategies.
## Overview
Source Sync manages the ingestion of external content — RSS feeds, keyword searches, authenticated web pages — into the platform's `source` and `source-item` entity types. Each `source` entity represents a configured feed (a PubMed query, a news RSS, a competitor blog). Each `source-item` entity represents one article or document extracted from that feed.
The module provides:
- **Feed-level sync** — scheduled or on-demand sync of a configured source, producing batches of source-item entities
- **Single-URL ingestion** — the `extractUrlToArticle` composer and `ingestUrl` agent tool let agents and capture hooks ingest an arbitrary URL outside of feed config
- **Pluggable scraping strategies** — Firecrawl, browser-agent, and readable-HTML fallback are composable primitives any tenant feature can call
Source Sync is a platform module. No product-specific slugs appear in the core code. The `source` and `source-item` entity type slugs are configurable via `getSourceEntityTypeSlug()` / `getSourceItemEntityTypeSlug()` from `features/source-sync/config.ts`.
## Key Concepts
### Entity types
| Entity type | Purpose |
| ------------- | -------------------------------------------------- |
| `source` | A configured feed (URL, type, schedule, strategy) |
| `source-item` | One article/document extracted from a source feed |
The slugs for both are read from `features/source-sync/config.ts` (`SOURCE_ENTITY_TYPE_SLUG`, `SOURCE_ITEM_ENTITY_TYPE_SLUG`) so they can be overridden per deployment.
### Scrape strategies
| Strategy | When used |
| -------------- | ----------------------------------------------------------- |
| `firecrawl` | Default. Requires `FIRECRAWL_API_KEY`. Best for public HTML |
| `browser` | Authenticated pages; requires an external browser connection |
| `http` | Plain fetch + readable-HTML parser. No external dependency |
| `auto` | Module selects based on source type and environment |
### `ScrapeStrategy` type
```ts
type ScrapeStrategy = "auto" | "http" | "firecrawl" | "browser"
```
### `ExtractUrlResult` type
Normalized output from `extractUrlToArticle` and the underlying primitives:
```ts
interface ExtractUrlResult {
title: string | null
text: string | null
excerpt: string | null
author: string | null
publishedAt: string | null
canonicalUrl: string | null
rawHtml: string | null
extractor: "firecrawl" | "browser-agent" | "readable" | null
}
```
### `HtmlArticleContent` type
Output of `extractArticleFromHtml`:
```ts
interface HtmlArticleContent {
title: string | null
summary: string | null
body: string | null
author: string | null
publishedAt: string | null
canonicalUrl: string | null
imageUrl: string | null
}
```
## How It Works
### Feed sync flow
1. A `source` entity is created with `source_type`, `url`, `scrape_strategy`, and schedule fields.
2. A scheduled dispatcher or manual trigger calls the source-sync admin action.
3. For each new item discovered in the feed, the pipeline calls the appropriate scraper (`firecrawlScrapeHtml` → `extractArticleFromHtml`, or `extractArticleWithBrowserAgent`, or plain `fetch` + readable HTML).
4. A `source-item` entity is created via `createEntityKeyed`. Relations to the parent `source` are attached automatically.
### Single-URL ingestion flow (added 2026-04-21)
Any agent with the `ingestUrl` tool, or server code calling `extractUrlToArticle` directly, can ingest a single URL without a configured feed:
1. `extractUrlToArticle({ url, strategy? })` composes the same Firecrawl → browser-agent → readable-HTML fallback chain used by feed sync.
2. The `ingestUrl` tool wraps this, creates the `source-item` entity, and optionally attaches relations by slug.
3. The capture URL routing hook (`features/capture/server/route-url-to-source-item.ts`) fires this path automatically when a capture contains a URL and the tenant has a scout agent configured.
### Capture → source-item routing
When a capture is created:
1. `route-url-to-source-item.ts` scans the capture body for a URL.
2. If found, checks whether the tenant has `source-item` configured AND an agent with `ingestUrl` in its `customTools`.
3. If both checks pass, fires a `capture.url.routed` Inngest event invoking the scout agent.
4. A soft modality-slug hint is included as `linkToEntities` if the capture body contains a matching slug string.
## API Reference
### `extractUrlToArticle(params)` — `features/source-sync`
Composes the full extraction chain for a single URL.
```ts
import { extractUrlToArticle } from "@/features/source-sync"
const result = await extractUrlToArticle({
url: "https://example.com/article",
strategy: "firecrawl", // optional, defaults to "firecrawl"
browserConnection: null, // AgentConnectionInternal | null
contentSelector: ".article-body", // optional CSS selector
traceContext: undefined,
})
// result.extractor tells you which path succeeded
```
Parameters are passed as an object to allow future additions without breaking callers.
### `extractArticleFromHtml(params)` — `features/source-sync`
Parse article content from an HTML string using Cheerio + readability heuristics.
```ts
import { extractArticleFromHtml } from "@/features/source-sync"
const article = extractArticleFromHtml({
html: rawHtmlString,
pageUrl: "https://example.com/article",
contentSelector: null, // optional override
})
// returns HtmlArticleContent
```
### `firecrawlScrapeHtml(url)` — `features/source-sync`
Scrape a URL via the Firecrawl API. Returns `{ html, title, description, finalUrl }`. Requires `FIRECRAWL_API_KEY`. Throws on network or API errors.
```ts
import { firecrawlScrapeHtml, isFirecrawlEnabled } from "@/features/source-sync"
if (isFirecrawlEnabled()) {
const scraped = await firecrawlScrapeHtml(url)
}
```
### `extractArticleWithBrowserAgent(params)` — `features/source-sync`
Extract article content via a configured browser-agent external connection. Used for authenticated or JavaScript-heavy pages.
```ts
import { extractArticleWithBrowserAgent } from "@/features/source-sync"
const article = await extractArticleWithBrowserAgent({
connection, // AgentConnectionInternal
url,
contentSelector: null,
traceContext: undefined,
})
```
### `shouldUseFirecrawl(strategy, sourceType)` / `shouldUseBrowserAgent(strategy, sourceType)` — `features/source-sync`
Strategy-selection predicates. Feed sync uses these internally; exposed for custom ingestion paths.
### `getSourceEntityTypeSlug()` / `getSourceItemEntityTypeSlug()` — `features/source-sync`
Return the configured entity type slugs. Always use these rather than hardcoding `"source"` / `"source-item"`.
### `ingestUrl` agent tool — `features/tools/source-tools.ts`
Registered slug: `"ingestUrl"`. Category: `"source"`. Required permission: `entities.team.create`.
```ts
// Input schema
{
url: string // required, absolute URL
sourceSlug?: string // optional parent source entity slug for provenance
linkToEntities?: Array<{
slug: string // existing entity slug in this tenant
relationType: string // default "about"
}>
}
// Return value
{
entityId: string
slug: string
title: string | null
excerpt: string | null
extractor: "firecrawl" | "browser-agent" | "readable" | null
linked: Array<{ targetSlug, targetId, relationshipType }>
unresolvedSlugs: string[]
}
```
The tool throws `"Entity type not found: {slug}"` if the tenant has not configured the `source-item` entity type.
### `ingest-url-to-source-item` task convention
Tenants that want a UI-triggerable ingestion task should create a DB task row with slug `ingest-url-to-source-item`, assigned to their scout agent, `trigger_type: "manual"`, `output_type: "entity"`. This is a convention, not platform-enforced — the task is optional per tenant.
## For Agents
The primary agent-facing affordance is the `ingestUrl` tool:
```
Tool: ingestUrl
Input: { url, sourceSlug?, linkToEntities? }
```
Agents should:
1. Call `ingestUrl` to create a source-item from a URL.
2. Optionally pass `linkToEntities` to attach the source-item to relevant modality, topic, or claim entities immediately.
3. After ingestion, note the returned `entityId` — downstream tasks (grading, claim extraction) will reference it.
4. If `unresolvedSlugs` is non-empty, warn the user — those relation targets were not found.
For reading evidence grades on a source-item, use `getEntity` with `includeResponses: true` (see [Entity System](/docs/features/entity-system)).
## Design Decisions
**Params-object signature for `extractUrlToArticle`** — positional args would require callers to pass `undefined` for every optional after the first. The params object pattern is consistent with the rest of the platform and allows non-breaking additions.
**`ingestUrl` registered in `features/tools/source-tools.ts`, not `features/custom/`** — the tool has zero product-specific logic. Any tenant with `source-item` configured benefits. Keeping it in platform ensures it ships to all forks.
**`extractUrlToArticle` falls back silently** — network or scraper failures return an `EMPTY_RESULT` (all nulls, `extractor: null`) rather than throwing. The `ingestUrl` tool still creates the entity with `title` falling back to the raw URL, so the user sees a record and can fix metadata manually. Throwing on every transient scrape failure would break bulk sweep tasks.
**Capture routing is a thin hook, not a task trigger** — the capture hook fires an Inngest event rather than directly enqueuing a session-executor task. This decouples capture latency from ingestion latency and keeps the capture server action fast.
**No `@modality` mention grammar in this release** — the soft slug-match hint (capture body contains a string matching a modality slug) ships as a convenience. A formal `@slug` mention parser is deferred to `_backlog/idea-capture-mention-parser.md`.
## Related Modules
- [Entity System](/docs/features/entity-system) — `source` and `source-item` are standard entity types
- [Research Library](/docs/features/research-library) — DOC'S worked example using source-sync as the ingestion layer
- [Capture](/docs/features/capture) — capture URL routing hook fires `ingestUrl`
- [Tool System](/docs/features/tool-system) — `ingestUrl` registration and execution
- [Inngest](/docs/integrations/inngest) — feed sync scheduling and `recompute-claim-aggregates-on-response`