Retrieval-augmented generation (RAG) is the default pattern for enterprise SaaS AI features in 2026 — copilots over docs, support assistants, internal search, and workflow suggestions. Most RAG implementations fail in production because teams treat retrieval as a demo problem, not an engineering discipline.
The production RAG stack
- 1Ingestion pipeline — parse PDFs, HTML, tickets, CRM notes with consistent metadata.
- 2Chunking strategy — semantic boundaries, not fixed token splits; preserve tables and lists.
- 3Embedding + vector store — version embeddings when models change; tenant isolation mandatory.
- 4Retrieval layer — hybrid search (keyword + vector), reranking, access-control filters.
- 5Generation layer — grounded prompts, citation requirements, refusal policies.
- 6Eval + observability — golden questions, hallucination rate, latency, cost per query.
Multi-tenant RAG is non-negotiable
B2B SaaS means strict tenant boundaries. Every chunk, embedding, and retrieval query must enforce org_id (and often role) filters before the model sees context. Cross-tenant leakage is a company-ending bug — not a support ticket.
Evals separate demos from products
Build a golden set of 50–200 questions per use case with expected citations. Run evals on every prompt change, embedding upgrade, and model swap. Track answer faithfulness, citation accuracy, and 'I don't know' rate when context is insufficient.
Can you show me eval results, audit logs, and how you prevent data leakage between customers? If the answer is no, the feature is not enterprise-ready.
When RAG is not enough
Some workflows need tool use, transactional APIs, or fine-tuned models — not just document retrieval. Architecture should compose RAG with agents carefully: retrieval for knowledge, tools for actions, humans for approvals on high-risk state changes.



