Source Sync

Feed-based and single-URL ingestion of external content into source-item entities, with configurable scraping strategies.

Overview

Source Sync manages the ingestion of external content — RSS feeds, keyword searches, authenticated web pages — into the platform's source and source-item entity types. Each source entity represents a configured feed (a PubMed query, a news RSS, a competitor blog). Each source-item entity represents one article or document extracted from that feed.

The module provides:

Feed-level sync — scheduled or on-demand sync of a configured source, producing batches of source-item entities
Single-URL ingestion — the extractUrlToArticle composer and ingestUrl agent tool let agents and capture hooks ingest an arbitrary URL outside of feed config
Pluggable scraping strategies — Firecrawl, browser-agent, and readable-HTML fallback are composable primitives any tenant feature can call

Source Sync is a platform module. No product-specific slugs appear in the core code. The source and source-item entity type slugs are configurable via getSourceEntityTypeSlug() / getSourceItemEntityTypeSlug() from features/source-sync/config.ts.

Key Concepts

Entity types

Entity type	Purpose
`source`	A configured feed (URL, type, schedule, strategy)
`source-item`	One article/document extracted from a source feed

The slugs for both are read from features/source-sync/config.ts (SOURCE_ENTITY_TYPE_SLUG, SOURCE_ITEM_ENTITY_TYPE_SLUG) so they can be overridden per deployment.

Scrape strategies

Strategy	When used
`firecrawl`	Default. Requires `FIRECRAWL_API_KEY`. Best for public HTML
`browser`	Authenticated pages; requires an external browser connection
`http`	Plain fetch + readable-HTML parser. No external dependency
`auto`	Module selects based on source type and environment

`ScrapeStrategy` type

type ScrapeStrategy = "auto" | "http" | "firecrawl" | "browser"

`ExtractUrlResult` type

Normalized output from extractUrlToArticle and the underlying primitives:

interface ExtractUrlResult {
  title: string | null
  text: string | null
  excerpt: string | null
  author: string | null
  publishedAt: string | null
  canonicalUrl: string | null
  rawHtml: string | null
  extractor: "firecrawl" | "browser-agent" | "readable" | null
}

`HtmlArticleContent` type

Output of extractArticleFromHtml:

interface HtmlArticleContent {
  title: string | null
  summary: string | null
  body: string | null
  author: string | null
  publishedAt: string | null
  canonicalUrl: string | null
  imageUrl: string | null
}

How It Works

Feed sync flow

A source entity is created with source_type, url, scrape_strategy, and schedule fields.
A scheduled dispatcher or manual trigger calls the source-sync admin action.
For each new item discovered in the feed, the pipeline calls the appropriate scraper (firecrawlScrapeHtml → extractArticleFromHtml, or extractArticleWithBrowserAgent, or plain fetch + readable HTML).
A source-item entity is created via createEntityKeyed. Relations to the parent source are attached automatically.

Single-URL ingestion flow (added 2026-04-21)

Any agent with the ingestUrl tool, or server code calling extractUrlToArticle directly, can ingest a single URL without a configured feed:

extractUrlToArticle({ url, strategy? }) composes the same Firecrawl → browser-agent → readable-HTML fallback chain used by feed sync.
The ingestUrl tool wraps this, creates the source-item entity, and optionally attaches relations by slug.
The capture URL routing hook (features/capture/server/route-url-to-source-item.ts) fires this path automatically when a capture contains a URL and the tenant has a scout agent configured.

Capture → source-item routing

When a capture is created:

route-url-to-source-item.ts scans the capture body for a URL.
If found, checks whether the tenant has source-item configured AND an agent with ingestUrl in its customTools.
If both checks pass, fires a capture.url.routed Inngest event invoking the scout agent.
A soft modality-slug hint is included as linkToEntities if the capture body contains a matching slug string.

API Reference

`extractUrlToArticle(params)` — `features/source-sync`

Composes the full extraction chain for a single URL.

import { extractUrlToArticle } from "@/features/source-sync"

const result = await extractUrlToArticle({
  url: "https://example.com/article",
  strategy: "firecrawl",           // optional, defaults to "firecrawl"
  browserConnection: null,          // AgentConnectionInternal | null
  contentSelector: ".article-body", // optional CSS selector
  traceContext: undefined,
})
// result.extractor tells you which path succeeded

Parameters are passed as an object to allow future additions without breaking callers.

`extractArticleFromHtml(params)` — `features/source-sync`

Parse article content from an HTML string using Cheerio + readability heuristics.

import { extractArticleFromHtml } from "@/features/source-sync"

const article = extractArticleFromHtml({
  html: rawHtmlString,
  pageUrl: "https://example.com/article",
  contentSelector: null, // optional override
})
// returns HtmlArticleContent

`firecrawlScrapeHtml(url)` — `features/source-sync`

Scrape a URL via the Firecrawl API. Returns { html, title, description, finalUrl }. Requires FIRECRAWL_API_KEY. Throws on network or API errors.

import { firecrawlScrapeHtml, isFirecrawlEnabled } from "@/features/source-sync"

if (isFirecrawlEnabled()) {
  const scraped = await firecrawlScrapeHtml(url)
}

`extractArticleWithBrowserAgent(params)` — `features/source-sync`

Extract article content via a configured browser-agent external connection. Used for authenticated or JavaScript-heavy pages.

import { extractArticleWithBrowserAgent } from "@/features/source-sync"

const article = await extractArticleWithBrowserAgent({
  connection,        // AgentConnectionInternal
  url,
  contentSelector: null,
  traceContext: undefined,
})

`shouldUseFirecrawl(strategy, sourceType)` / `shouldUseBrowserAgent(strategy, sourceType)` — `features/source-sync`

Strategy-selection predicates. Feed sync uses these internally; exposed for custom ingestion paths.

`getSourceEntityTypeSlug()` / `getSourceItemEntityTypeSlug()` — `features/source-sync`

Return the configured entity type slugs. Always use these rather than hardcoding "source" / "source-item".

`ingestUrl` agent tool — `features/tools/source-tools.ts`

Registered slug: "ingestUrl". Category: "source". Required permission: entities.team.create.

// Input schema
{
  url: string            // required, absolute URL
  sourceSlug?: string    // optional parent source entity slug for provenance
  linkToEntities?: Array<{
    slug: string         // existing entity slug in this tenant
    relationType: string // default "about"
  }>
}

// Return value
{
  entityId: string
  slug: string
  title: string | null
  excerpt: string | null
  extractor: "firecrawl" | "browser-agent" | "readable" | null
  linked: Array<{ targetSlug, targetId, relationshipType }>
  unresolvedSlugs: string[]
}

The tool throws "Entity type not found: {slug}" if the tenant has not configured the source-item entity type.

`ingest-url-to-source-item` task convention

Tenants that want a UI-triggerable ingestion task should create a DB task row with slug ingest-url-to-source-item, assigned to their scout agent, trigger_type: "manual", output_type: "entity". This is a convention, not platform-enforced — the task is optional per tenant.

For Agents

The primary agent-facing affordance is the ingestUrl tool:

Tool: ingestUrl
Input: { url, sourceSlug?, linkToEntities? }

Agents should:

Call ingestUrl to create a source-item from a URL.
Optionally pass linkToEntities to attach the source-item to relevant modality, topic, or claim entities immediately.
After ingestion, note the returned entityId — downstream tasks (grading, claim extraction) will reference it.
If unresolvedSlugs is non-empty, warn the user — those relation targets were not found.

For reading evidence grades on a source-item, use getEntity with includeResponses: true (see Entity System).

Design Decisions

Params-object signature for extractUrlToArticle — positional args would require callers to pass undefined for every optional after the first. The params object pattern is consistent with the rest of the platform and allows non-breaking additions.

ingestUrl registered in features/tools/source-tools.ts, not features/custom/ — the tool has zero product-specific logic. Any tenant with source-item configured benefits. Keeping it in platform ensures it ships to all forks.

extractUrlToArticle falls back silently — network or scraper failures return an EMPTY_RESULT (all nulls, extractor: null) rather than throwing. The ingestUrl tool still creates the entity with title falling back to the raw URL, so the user sees a record and can fix metadata manually. Throwing on every transient scrape failure would break bulk sweep tasks.

Capture routing is a thin hook, not a task trigger — the capture hook fires an Inngest event rather than directly enqueuing a session-executor task. This decouples capture latency from ingestion latency and keeps the capture server action fast.

No @modality mention grammar in this release — the soft slug-match hint (capture body contains a string matching a modality slug) ships as a convenience. A formal @slug mention parser is deferred to _backlog/idea-capture-mention-parser.md.

Entity System — source and source-item are standard entity types
Research Library — DOC'S worked example using source-sync as the ingestion layer
Capture — capture URL routing hook fires ingestUrl
Tool System — ingestUrl registration and execution
Inngest — feed sync scheduling and recompute-claim-aggregates-on-response

On this page