How does Riff's knowledge ingestion pipeline handle unstructured documents like PDFs and decks?

Question

Riff · Accepted Answer

TL;DR

Yes, Riff ingests PDFs, sales decks, and other unstructured documents natively. Content moves through an 8-stage pipeline that extracts, chunks, embeds, and conflict-checks everything before it ever surfaces in a buyer conversation.

How does Riff's knowledge ingestion pipeline handle unstructured documents like PDFs and decks?

Riff processes unstructured documents — PDFs, Word files, presentations, spreadsheets, Google Drive content — through a structured 8-stage ingestion pipeline designed to turn raw, scattered content into a trustworthy, queryable knowledge base. The pipeline begins with Extract, where Riff pulls raw text and structure out of whatever format the document is in, followed by Chunk, which breaks content into independently retrievable segments sized for semantic search.

What separates Riff from simpler retrieval approaches is what happens after storage. Once chunks are embedded as vectors and stored via pgvector, the pipeline enters an Analyze stage that examines content for inconsistencies across sources. If a sales deck says one thing and a product PDF says another — a common reality in fast-moving B2B companies — Riff catches that conflict automatically. The Synthesize stage then resolves it using a ranked source-authority hierarchy, ensuring higher-trust content (like curated golden corpus or trained responses) always wins over lower-authority material like marketing copy.

For presales and revenue teams, this matters because buyers ask questions that cut across every document a company has ever published. Without conflict detection, an AI assistant becomes a liability — surfacing contradictions that undermine trust at exactly the wrong moment in a deal.

How It Works

Upload → Extract: Riff accepts PDFs, Word docs, Excel files, Google Drive content, presentations, and more — pulling raw text and structure from each format
Chunk → Embed: Content is segmented into meaningful units, then converted into vector representations stored in pgvector for semantic retrieval
Analyze → Synthesize: Cross-source contradiction detection runs automatically; conflicts are resolved via source-authority ranking, not left to chance
Update: The knowledge base stays current as content is added or changed — no manual re-indexing required
Caveat: Specific handling for password-protected files or proprietary deck formats is not documented in available KB — contact Riff for details on edge cases

Competitive Context

Capability	Riff	Typical Alternatives
Unstructured doc ingestion (PDF, decks)	Native, multi-format	Often limited to plain text or specific formats
Cross-source conflict detection	Automated, built into pipeline	Rarely included; requires manual curation
Source authority hierarchy	Yes — ranked trust levels resolve contradictions	Manual tagging at best; usually absent
Vector storage	pgvector (purpose-built for semantic search)	Varies; many use generic databases

Key Takeaway

Riff's ingestion pipeline is built for the reality of B2B knowledge: messy, multi-format, and often contradictory across teams. The combination of structured chunking, semantic embedding, and automated conflict resolution means sales decks and product PDFs don't just get stored — they get reconciled into answers buyers can trust. This makes Riff particularly well-suited for companies with 50–300 employees where content lives across many tools and no one has time to manually audit for consistency.

What other content types can Riff ingest beyond documents?

Riff also ingests video URLs, sales and support call transcripts, public website content, images and diagrams, and ad-hoc notes — meeting knowledge where it already lives across a typical GTM stack.

How does Riff decide which source to trust when content conflicts?

Riff uses a ranked source-authority hierarchy where curated or trained content outranks marketing copy or website text. This ensures the most reliable information wins without requiring manual intervention on every conflict.

Verified 2026-05-24