R RockAI docs

Build your agent

Knowledge sources

Sources are how an agent learns about your business. This page covers every kind of source, the ingestion pipeline, and what to expect after you click "Add".

Source types

TypeUse it forWhat we ingest
urlOne specific pageCrawl + extract main content + chunk + embed
sitemapA whole site at onceRead sitemap, fan out to one CrawlPageJob per URL
feedRSS / Atom blogsSame as sitemap but reads <item> entries
textFAQs, snippets, anything you can pasteSkip the crawl, chunk + embed directly
notionNotion pages or databasesOAuth into Notion, fetch via API, treat each page as a document
google_docGoogle Docs (Workspace)OAuth, fetch via Drive API, ingest as a document
google_sheetGoogle Sheets tabsOAuth, pull every non-empty row, one Document body with Row N: … markers
sqlRemote MySQL / PostgreSQLDirect PDO read-only SELECT; one Document body with Row N: … markers. Credentials AES-256-GCM at rest.
filePDF / DOCX / XLSX uploadsParsed via Cloudflare toMarkdown (free tier), chunked + embedded
woocommerce_productsWP / WooCommerce storesSynced by the companion WordPress plugin
autoPages visitors land onAuto-queued by AutoIndexPageVisit from /v1/widget/init

SQL database source (MySQL + PostgreSQL)

Connect a read-only MySQL or PostgreSQL database directly as a knowledge source. Every non-empty row your SELECT returns becomes part of the agent's training data, with citations the LLM can use as Row N references.

Setup steps:

  1. Open /app/agents/{id}/sources, scroll to Add SQL database.
  2. Pick the driver (MySQL or PostgreSQL) β€” port autofills to 3306 or 5432.
  3. Paste host, database name, username, password.
  4. Paste a SELECT query β€” every row that returns becomes part of the agent knowledge base.
  5. Optional column mapping: pick a title_column (used as the row label) and a body_column (used as the row content). When omitted, every column is concatenated as colname: value pairs.
  6. Click Connect & sync. The first SyncSqlSourceJob runs on the crawl queue, indexes the rows, and flips the source to indexed.

Hard rules enforced on every query:

  • SSRF guard. Host runs through UrlSafetyGuard::assertSafe() β€” same allowlist the crawler uses. 127.0.0.1, RFC1918 ranges, link-local 169.254.0.0/16 (AWS metadata IP), and any name that resolves to a private IP are all refused. The source flips to failed with "Unsafe SQL host: Refusing to crawl an internal / loopback / link-local host."
  • SELECT-only. Query must start with SELECT (case-insensitive). Multi-statement (containing ;) is rejected. The keywords INSERT, UPDATE, DELETE, DROP, ALTER, TRUNCATE, GRANT, REVOKE, CREATE, REPLACE, CALL, EXEC, EXECUTE, MERGE, LOAD are blocked.
  • Read-only transaction. The connector wraps the query in START TRANSACTION READ ONLY (MySQL) / BEGIN TRANSACTION READ ONLY (PostgreSQL) so even a permissive query string can't mutate state.
  • 5,000-row cap per sync. Protects worker memory and the Document table from runaway queries. Lower-bound your SELECT with a LIMIT if you have a big table you don't want fully indexed.
  • 10-second connect timeout. Hosts that can't be reached fail fast β€” buyers see the error inline.

Credentials at rest:

The host, port, database name, username, and password are stored in sources.credentials_encrypted (text column) and encrypted via Laravel's encrypted:array cast β€” AES-256-GCM with the install's APP_KEY. The non-sensitive bits (driver, query, title/body column mappings, optional label) live in the regular config JSON column.

Reading the raw column produces ciphertext only; the plaintext is visible only inside worker memory during a sync run and never logged. Rotating APP_KEY will require buyers to re-enter the password (standard Laravel behaviour β€” see Security).

What gets indexed:

Each row becomes a labeled line in a single Document body. When you set title_column = "title" and body_column = "body", a row with { title: "Welcome", body: "Hello world" } renders as:

Welcome: Hello world

Without a body column it falls back to a key/value join of every non-null column:

Row 1 β€” id: 42, name: ACME Corp, plan: Pro, last_seen: 2026-05-13

The Document body is then chunked and embedded through the same IndexDocumentJob pipeline every other source uses, so SQL rows surface in retrieval just like crawled pages or pasted text.

Re-sync today: manual via the source's Refresh action. Periodic auto-sync isn't scheduled yet β€” same as Notion / Google Doc / Google Sheet. File a card if you want a cron.

Drivers not shipped yet: MSSQL and Oracle. Both require PHP extensions (pdo_sqlsrv / oci8) that aren't bundled by default and aren't universal across CodeCanyon hosts. Open a feature request if you need them.

Add a source

Open /app/agents/{id}/sources. The Add source modal handles all types in one form. Behind the scenes:

  1. Validate. URLs must be http/https; private hosts (10.x, 192.168.x, 127.x, ::1) are blocked to prevent SSRF.
  2. Create the source row with status = pending.
  3. Dispatch a job β€” CrawlSourceJob for url/sitemap/feed; IngestNotionPageJob/IngestGoogleDocJob for connected sources; IndexTextSourceJob for pasted text.
  4. The job runs on the crawl queue, fetches content, creates Document rows, then dispatches IndexDocumentJob on the index queue.
  5. The status flips from pending β†’ crawling β†’ done (or failed with an error message you can read in the UI).

"Indexing didn't finish" β€” diagnosing failed pages

A page can land in the Knowledge view with 0 chunks and an amber "Indexing didn't finish" badge. The expanded row now shows the actual error from sources.error when present, plus the crawler used and last-fetched timestamp. Most failures map to one of these:

  • JavaScript-only / SPA pages with no SSR fallback β€” Cloudflare Browser Rendering executes JS, but if the app fully hydrates client-side and exposes no scrapeable text, the extractor returns under the 200-character floor and the source is marked failed.
  • Bot challenge / Cloudflare protection β€” the fetched HTML is the challenge page, not the real content. Detected via detectBlocker() heuristics.
  • Login wall β€” site requires auth; we don't run authenticated crawls.
  • Soft 404 β€” many sites return a 200-OK "not found" page when a URL is mistyped. looksLike404() rejects these.
  • Queue worker behind β€” the document row was created but IndexDocumentJob hasn't run yet. Wait a minute and refresh; if it persists, check Queue Health.

Click Reindex on a failed row to retry the same pipeline (URL re-crawl + re-index, or file re-parse for uploads). Persistent failures usually mean the URL itself is unscrapable β€” try a different page on the same site, or upload the content as a file.

Switching embedding models

The Cloudflare Vectorize index is provisioned at the exact dimension of the embedding model that was active when it was first created. Changing CLOUDFLARE_EMBED_MODEL from a 768-dim model (bge-base-en-v1.5) to a 1024-dim model (bge-m3, bge-large-en-v1.5) or back will cause every IndexDocumentJob to crash with:

Cloudflare 40012: invalid vector for id="...", expected 768 dimensions, and got 1024 dimensions

Pitchbar now detects this BEFORE sending the upsert, surfaces an actionable error, and ships a recovery command. Known model→dim map (auto-applied when VECTOR_DIM env is unset):

ModelDim
@cf/baai/bge-small-en-v1.5384
@cf/baai/bge-base-en-v1.5 (default)768
@cf/baai/bge-large-en-v1.51024
@cf/baai/bge-m31024
text-embedding-3-small1536
text-embedding-3-large3072
text-embedding-ada-0021536

Recovery: drop the existing index, recreate at the new dim, and re-dispatch IndexDocumentJob for every document:

php artisan vector:rebuild-index           # interactive β€” asks before proceeding
php artisan vector:rebuild-index --force   # for automation / CI
php artisan vector:rebuild-index --dim=1024  # override the resolved dim

The command resets every Source to pending, deletes every Chunk row, drops the Vectorize index, recreates it at the target dim, and queues a re-index job per document onto the index queue. File-backed Documents re-index from the persisted text on disk; URL-only Documents need a manual Reindex click (which triggers CrawlPageJob to re-fetch).

Auto-discovery

On the sources page, the Discover button takes a domain and probes it for crawlable pages without you having to list them. We:

  • Read robots.txt for sitemap declarations.
  • Probe a sitemap directly when present.
  • Try a small set of common paths: /about, /pricing, /features, /products, /faq, /docs, /help, /support, /contact.
  • Return a checkable list. Tick which to ingest, hit Add selected.

Sitemap fan-out

Adding a Source of type sitemap dispatches one CrawlPageJob per URL in the sitemap, staggered by a small per-page delay so Cloudflare Browser Rendering doesn't rate-limit on burst. The discoverer (SitemapDiscoverer) handles three input shapes:

  • Domain root (https://example.com) β€” probes /sitemap.xml + /sitemap_index.xml.
  • Direct sitemap URL (https://example.com/sitemap.xml or https://example.com/products/sitemap.xml) β€” fetched verbatim. Pre-fix the discoverer used to append a second /sitemap.xml here and 404 the request.
  • Sitemap-index (the <sitemapindex> XML many CMSes β€” WordPress, Shopify, Webflow β€” emit by default) β€” recurses one level into each child sitemap and aggregates page URLs.

Output is deduped (so a URL listed in two child sitemaps gets indexed once) and capped at services.crawl.max_pages_per_source (default 500, override via CRAWL_MAX_PAGES_PER_SOURCE). The cap used to be 25 β€” a buyer adding a 100-URL sitemap silently lost 75 pages β€” the new default is generous enough for most marketing / docs sites. Very large catalogues should split the sitemap by section anyway.

Crawler strategies

The crawler is provider-driven. In order of preference:

  1. Cloudflare Browser Rendering β€” preferred. Full JS rendering, fast, no SSRF risk because egress is on Cloudflare. Used when CLOUDFLARE_ACCOUNT_ID + CLOUDFLARE_API_TOKEN are set.
  2. Browserless β€” fallback when BROWSERLESS_TOKEN is set. Same headless-Chrome behavior on a different vendor.
  3. Plain HTTP β€” last resort for server-rendered sites. No JS execution. Free.

Once HTML is in hand, ReadabilityExtractor strips nav, footer, ads, etc., leaving the article body. Pages under 200 chars or detected as 404s are dropped.

File upload parsing

Direct file uploads (Sources → Upload files) are parsed locally first, then handed to the same chunk + embed pipeline crawled pages use. The parser is picked by file extension:

ExtensionParserNetwork call?
.pdf, .docx, .doc, .xlsx, .xls, .odt, .odsCloudflare Workers AI toMarkdown when CF creds are configured; Smalot\PdfParser / PhpOffice\PhpWord otherwiseYes — one multipart POST per file to /ai/tomarkdown (free of cost, 0 Neurons)
.csvLeague\Csv — emits one segment per row formatted as col: value | col: valueNo
.md, .markdown, .txtPlain text, split on H1/H2 headingsNo

The Cloudflare path is preferred for binary office formats because Smalot and PhpWord are unreliable on real-world documents: Word- exported PDFs that put body text in one big content stream, scanned PDFs with a thin text layer, and DOCX files with nested tables or text frames all tend to extract poorly. Workers AI's toMarkdown returns structured markdown (headings, lists, tables preserved) which feeds the chunker much better.

Pricing: toMarkdown is free for every format above. Only image-to-markdown conversion consumes Workers AI Neurons (we do not send images). When Cloudflare credentials are absent (BYOK OpenAI customers, fresh installs), or when the Cloudflare call fails, the in-process PHP parsers take over so PDF / DOCX / CSV / TXT / MD uploads never silently break.

Spreadsheet uploads (.xlsx, .xls, .ods, .odt) require Cloudflare Workers AI. There is no local fallback. When those formats are uploaded to a workspace without CLOUDFLARE_ACCOUNT_ID + CLOUDFLARE_API_TOKEN configured, the source row is created with status=failed and the error stamp includes an actionable hint: "Spreadsheet / OpenDocument formats need Cloudflare Workers AI." Admins who need Excel ingestion on a BYOK-OpenAI install should export to CSV (the local League\Csv parser handles that format with no external dependency).

Whichever parser ran, the resulting text is persisted under storage/app/private/uploads/{source_id}/segment-N.txt. That's the file the Reindex button reads — you don't need to re-upload the original to re-index.

Chunking and embedding

The extractor's text goes into Chunker, a recursive splitter that prefers semantic boundaries:

  1. Split on markdown headings, then blank lines (paragraphs).
  2. Pack paragraphs greedily up to a target size (~2000 chars / ~500 tokens).
  3. If a paragraph is too big, fall back to sentence boundaries.
  4. Char-window as the absolute last resort.
  5. Add a small overlap between chunks so cross-chunk facts stay linkable.

Each chunk is embedded in a batch (default 100 chunks per call) and upserted into the vector store with metadata: agent_id, document_id, chunk_id, url, workspace_id, source_id, lang.

Crawl retry policy

Each CrawlPageJob attempts up to 3 times with backoff [30s, 90s, 180s]. The retry path is split by failure class:

  • Rate-limited (HTTP 429, "too many requests" upstream) β€” releases back to the queue with a fresh 60-second delay without burning a retry slot. Every fan-out page on the same workspace tends to hit the same 429 wave; the shared wait is productive.
  • Permanent failure (curl DNS resolve / connection refused, HTTP 400 / 401 / 403 / 404 / 410 / 451, malformed URL) β€” short-circuits via fail() so the Source row gets the real reason immediately instead of being stranded behind two more retries that will deterministically fail.
  • Transient (5xx, network blip) β€” normal retry with backoff.
  • Per-job timeout β€” 90s. failOnTimeout=true so a worker SIGTERM still flips the source to failed with a customer-readable error.

Buyer-facing error messages on the Sources list are sanitized via SourceErrorPresenter β€” raw upstream JSON envelopes (Cloudflare 401 bodies, Browserless stack traces) get rewritten to friendly lines like "We couldn't reach this page" or "The crawl service is busy right now β€” we will retry automatically." Operators still see the full raw message under Show details.

Reindex and preview

From the sources list, each row has:

  • Reindex β€” re-runs the crawl + chunk + embed pipeline. For uploaded files the reindex reads back the persisted segment text under storage/app/private/uploads/{source_id}/segment-N.txt β€” no need to re-upload the original. If the persisted file is missing (pre-fix uploads or a disk wipe) the UI surfaces a "Re-upload" prompt.
  • Preview β€” shows the extracted documents and a sample of chunks so you can spot bad extraction (e.g. nav bar polluting the text).
  • Delete β€” removes the source, its documents, its chunks, and the corresponding vector points.

Notion and Google Docs

Both use OAuth. Connect once from /app/integrations; the token is encrypted at rest. After connecting, the source modal lets you pick pages or documents directly.

Re-syncs are manual (per-source Reindex button) β€” we don't poll your Notion / Drive on a schedule. If you change a Notion page, click Reindex on that source.

"My agent doesn't know about the file I just uploaded"

Cloudflare Vectorize has eventual consistency on metadata-filtered queries β€” even after an upsert returns 200 OK, an agent_id-filtered query against that vector typically returns 0 hits for the first 30 to 60 seconds while the metadata index propagates across edge regions.

Practical consequence: a freshly uploaded file shows up as status=indexed in the Sources page immediately, but the agent won't be able to answer questions about it until the propagation window closes. The upload-success banner reminds the admin of this. If the agent still doesn't return relevant chunks after a minute, open the source's Preview to confirm the extracted text isn't empty β€” that's a parser-side issue, not a vector-side one.

Same gotcha applies to the very first upload after creating a Cloudflare Vectorize index for the first time β€” the index itself has a ~2 minute provisioning lag before any queries return results, even unfiltered ones.

Storage and retention

  • Postgres β€” sources, documents, chunks (text + metadata).
  • Vector store β€” embeddings. Cloudflare Vectorize when configured, Qdrant otherwise.
  • R2 / object storage β€” original artifacts (PDFs, images) when uploaded.

Deleting a source cascades: documents, chunks, and vector points all go in one transaction. There's no soft-delete on sources.