Source Sync
Feed-based and single-URL ingestion of external content into source-item entities, with configurable scraping strategies.
Overview
Source Sync manages the ingestion of external content — RSS feeds, keyword searches, authenticated web pages — into the platform's source and source-item entity types. Each source entity represents a configured feed (a PubMed query, a news RSS, a competitor blog). Each source-item entity represents one article or document extracted from that feed.
The module provides:
- Feed-level sync — scheduled or on-demand sync of a configured source, producing batches of source-item entities
- Single-URL ingestion — the
extractUrlToArticlecomposer andingestUrlagent tool let agents and capture hooks ingest an arbitrary URL outside of feed config - Pluggable scraping strategies — Firecrawl, browser-agent, and readable-HTML fallback are composable primitives any tenant feature can call
Source Sync is a platform module. No product-specific slugs appear in the core code. The source and source-item entity type slugs are configurable via getSourceEntityTypeSlug() / getSourceItemEntityTypeSlug() from features/source-sync/config.ts.
Key Concepts
Entity types
| Entity type | Purpose |
|---|---|
source | A configured feed (URL, type, schedule, strategy) |
source-item | One article/document extracted from a source feed |
The slugs for both are read from features/source-sync/config.ts (SOURCE_ENTITY_TYPE_SLUG, SOURCE_ITEM_ENTITY_TYPE_SLUG) so they can be overridden per deployment.
Scrape strategies
| Strategy | When used |
|---|---|
firecrawl | Default. Requires FIRECRAWL_API_KEY. Best for public HTML |
browser | Authenticated pages; requires an external browser connection |
http | Plain fetch + readable-HTML parser. No external dependency |
auto | Module selects based on source type and environment |
ScrapeStrategy type
type ScrapeStrategy = "auto" | "http" | "firecrawl" | "browser"ExtractUrlResult type
Normalized output from extractUrlToArticle and the underlying primitives:
interface ExtractUrlResult {
title: string | null
text: string | null
excerpt: string | null
author: string | null
publishedAt: string | null
canonicalUrl: string | null
rawHtml: string | null
extractor: "firecrawl" | "browser-agent" | "readable" | null
}HtmlArticleContent type
Output of extractArticleFromHtml:
interface HtmlArticleContent {
title: string | null
summary: string | null
body: string | null
author: string | null
publishedAt: string | null
canonicalUrl: string | null
imageUrl: string | null
}How It Works
Feed sync flow
- A
sourceentity is created withsource_type,url,scrape_strategy, and schedule fields. - A scheduled dispatcher or manual trigger calls the source-sync admin action.
- For each new item discovered in the feed, the pipeline calls the appropriate scraper (
firecrawlScrapeHtml→extractArticleFromHtml, orextractArticleWithBrowserAgent, or plainfetch+ readable HTML). - A
source-itementity is created viacreateEntityKeyed. Relations to the parentsourceare attached automatically.
Single-URL ingestion flow (added 2026-04-21)
Any agent with the ingestUrl tool, or server code calling extractUrlToArticle directly, can ingest a single URL without a configured feed:
extractUrlToArticle({ url, strategy? })composes the same Firecrawl → browser-agent → readable-HTML fallback chain used by feed sync.- The
ingestUrltool wraps this, creates thesource-itementity, and optionally attaches relations by slug. - The capture URL routing hook (
features/capture/server/route-url-to-source-item.ts) fires this path automatically when a capture contains a URL and the tenant has a scout agent configured.
Capture → source-item routing
When a capture is created:
route-url-to-source-item.tsscans the capture body for a URL.- If found, checks whether the tenant has
source-itemconfigured AND an agent withingestUrlin itscustomTools. - If both checks pass, fires a
capture.url.routedInngest event invoking the scout agent. - A soft modality-slug hint is included as
linkToEntitiesif the capture body contains a matching slug string.
API Reference
extractUrlToArticle(params) — features/source-sync
Composes the full extraction chain for a single URL.
import { extractUrlToArticle } from "@/features/source-sync"
const result = await extractUrlToArticle({
url: "https://example.com/article",
strategy: "firecrawl", // optional, defaults to "firecrawl"
browserConnection: null, // AgentConnectionInternal | null
contentSelector: ".article-body", // optional CSS selector
traceContext: undefined,
})
// result.extractor tells you which path succeededParameters are passed as an object to allow future additions without breaking callers.
extractArticleFromHtml(params) — features/source-sync
Parse article content from an HTML string using Cheerio + readability heuristics.
import { extractArticleFromHtml } from "@/features/source-sync"
const article = extractArticleFromHtml({
html: rawHtmlString,
pageUrl: "https://example.com/article",
contentSelector: null, // optional override
})
// returns HtmlArticleContentfirecrawlScrapeHtml(url) — features/source-sync
Scrape a URL via the Firecrawl API. Returns { html, title, description, finalUrl }. Requires FIRECRAWL_API_KEY. Throws on network or API errors.
import { firecrawlScrapeHtml, isFirecrawlEnabled } from "@/features/source-sync"
if (isFirecrawlEnabled()) {
const scraped = await firecrawlScrapeHtml(url)
}extractArticleWithBrowserAgent(params) — features/source-sync
Extract article content via a configured browser-agent external connection. Used for authenticated or JavaScript-heavy pages.
import { extractArticleWithBrowserAgent } from "@/features/source-sync"
const article = await extractArticleWithBrowserAgent({
connection, // AgentConnectionInternal
url,
contentSelector: null,
traceContext: undefined,
})shouldUseFirecrawl(strategy, sourceType) / shouldUseBrowserAgent(strategy, sourceType) — features/source-sync
Strategy-selection predicates. Feed sync uses these internally; exposed for custom ingestion paths.
getSourceEntityTypeSlug() / getSourceItemEntityTypeSlug() — features/source-sync
Return the configured entity type slugs. Always use these rather than hardcoding "source" / "source-item".
ingestUrl agent tool — features/tools/source-tools.ts
Registered slug: "ingestUrl". Category: "source". Required permission: entities.team.create.
// Input schema
{
url: string // required, absolute URL
sourceSlug?: string // optional parent source entity slug for provenance
linkToEntities?: Array<{
slug: string // existing entity slug in this tenant
relationType: string // default "about"
}>
}
// Return value
{
entityId: string
slug: string
title: string | null
excerpt: string | null
extractor: "firecrawl" | "browser-agent" | "readable" | null
linked: Array<{ targetSlug, targetId, relationshipType }>
unresolvedSlugs: string[]
}The tool throws "Entity type not found: {slug}" if the tenant has not configured the source-item entity type.
ingest-url-to-source-item task convention
Tenants that want a UI-triggerable ingestion task should create a DB task row with slug ingest-url-to-source-item, assigned to their scout agent, trigger_type: "manual", output_type: "entity". This is a convention, not platform-enforced — the task is optional per tenant.
For Agents
The primary agent-facing affordance is the ingestUrl tool:
Tool: ingestUrl
Input: { url, sourceSlug?, linkToEntities? }Agents should:
- Call
ingestUrlto create a source-item from a URL. - Optionally pass
linkToEntitiesto attach the source-item to relevant modality, topic, or claim entities immediately. - After ingestion, note the returned
entityId— downstream tasks (grading, claim extraction) will reference it. - If
unresolvedSlugsis non-empty, warn the user — those relation targets were not found.
For reading evidence grades on a source-item, use getEntity with includeResponses: true (see Entity System).
Design Decisions
Params-object signature for extractUrlToArticle — positional args would require callers to pass undefined for every optional after the first. The params object pattern is consistent with the rest of the platform and allows non-breaking additions.
ingestUrl registered in features/tools/source-tools.ts, not features/custom/ — the tool has zero product-specific logic. Any tenant with source-item configured benefits. Keeping it in platform ensures it ships to all forks.
extractUrlToArticle falls back silently — network or scraper failures return an EMPTY_RESULT (all nulls, extractor: null) rather than throwing. The ingestUrl tool still creates the entity with title falling back to the raw URL, so the user sees a record and can fix metadata manually. Throwing on every transient scrape failure would break bulk sweep tasks.
Capture routing is a thin hook, not a task trigger — the capture hook fires an Inngest event rather than directly enqueuing a session-executor task. This decouples capture latency from ingestion latency and keeps the capture server action fast.
No @modality mention grammar in this release — the soft slug-match hint (capture body contains a string matching a modality slug) ships as a convenience. A formal @slug mention parser is deferred to _backlog/idea-capture-mention-parser.md.
Related Modules
- Entity System —
sourceandsource-itemare standard entity types - Research Library — DOC'S worked example using source-sync as the ingestion layer
- Capture — capture URL routing hook fires
ingestUrl - Tool System —
ingestUrlregistration and execution - Inngest — feed sync scheduling and
recompute-claim-aggregates-on-response