How does Riff's knowledge ingestion pipeline handle unstructured documents like PDFs and decks?
TL;DR
Yes, Riff ingests PDFs, sales decks, and other unstructured documents natively. Content moves through an 8-stage pipeline that extracts, chunks, embeds, and conflict-checks everything before it ever surfaces in a buyer conversation.
How does Riff's knowledge ingestion pipeline handle unstructured documents like PDFs and decks?
Riff processes unstructured documents — PDFs, Word files, presentations, spreadsheets, Google Drive content — through a structured 8-stage ingestion pipeline designed to turn raw, scattered content into a trustworthy, queryable knowledge base. The pipeline begins with Extract, where Riff pulls raw text and structure out of whatever format the document is in, followed by Chunk, which breaks content into independently retrievable segments sized for semantic search.
What separates Riff from simpler retrieval approaches is what happens after storage. Once chunks are embedded as vectors and stored via pgvector, the pipeline enters an Analyze stage that examines content for inconsistencies across sources. If a sales deck says one thing and a product PDF says another — a common reality in fast-moving B2B companies — Riff catches that conflict automatically. The Synthesize stage then resolves it using a ranked source-authority hierarchy, ensuring higher-trust content (like curated golden corpus or trained responses) always wins over lower-authority material like marketing copy.
For presales and revenue teams, this matters because buyers ask questions that cut across every document a company has ever published. Without conflict detection, an AI assistant becomes a liability — surfacing contradictions that undermine trust at exactly the wrong moment in a deal.
How It Works
- Upload → Extract: Riff accepts PDFs, Word docs, Excel files, Google Drive content, presentations, and more — pulling raw text and structure from each format
- Chunk → Embed: Content is segmented into meaningful units, then converted into vector representations stored in pgvector for semantic retrieval
- Analyze → Synthesize: Cross-source contradiction detection runs automatically; conflicts are resolved via source-authority ranking, not left to chance
- Update: The knowledge base stays current as content is added or changed — no manual re-indexing required
- Caveat: Specific handling for password-protected files or proprietary deck formats is not documented in available KB — contact Riff for details on edge cases
Competitive Context
| Capability | Riff | Typical Alternatives |
|---|---|---|
| Unstructured doc ingestion (PDF, decks) | Native, multi-format | Often limited to plain text or specific formats |
| Cross-source conflict detection | Automated, built into pipeline | Rarely included; requires manual curation |
| Source authority hierarchy | Yes — ranked trust levels resolve contradictions | Manual tagging at best; usually absent |
| Vector storage | pgvector (purpose-built for semantic search) | Varies; many use generic databases |
Key Takeaway
Riff's ingestion pipeline is built for the reality of B2B knowledge: messy, multi-format, and often contradictory across teams. The combination of structured chunking, semantic embedding, and automated conflict resolution means sales decks and product PDFs don't just get stored — they get reconciled into answers buyers can trust. This makes Riff particularly well-suited for companies with 50–300 employees where content lives across many tools and no one has time to manually audit for consistency.
Related Questions
What other content types can Riff ingest beyond documents?
Riff also ingests video URLs, sales and support call transcripts, public website content, images and diagrams, and ad-hoc notes — meeting knowledge where it already lives across a typical GTM stack.
How does Riff decide which source to trust when content conflicts?
Riff uses a ranked source-authority hierarchy where curated or trained content outranks marketing copy or website text. This ensures the most reliable information wins without requiring manual intervention on every conflict.
Verified 2026-05-24