What Is RAG? A Plain-English Guide to the Tech That Keeps AI From Making Things Up

Customer Support 2026-05-11 · Satsuma Creative · 9 min read

RAG (Retrieval-Augmented Generation) makes sure your AI only answers from the data you give it — and admits when it doesn't know. This article explains vectors, chunking, retrieval, and reranking in plain English, and why getting RAG right is much harder than getting it working.

TL;DR

  • The analogy: an LLM is a well-read assistant who doesn't know your company.RAG RAG is the workflow of "look up your company's documents first, then have the assistant answer based on what was found."
  • Technical core: chunk your company's documents → compute embeddings (vectors) → embed the visitor's question too → find the most similar passages and stuff them into the prompt → the LLM answers based on those passages
  • Vector = encoding semantic position as numbers. Sentences with similar meanings have similar vectors, so "can I get a refund?" can find the "refund policy" passage.
  • The last mile:don't let the LLM improvise — the system prompt enforces "if it's not written, return UNKNOWN; don't make things up."
  • Most AI customer-support SaaS products have mediocre RAG because their chunking, embedding model, and system prompt are all generic templates, not tuned for any single customer.

Or to put it another way: how do you make an AI answer only from what you give it, and never make things up?


Let's start with an analogy

Imagine you hire a fresh graduate as an assistant. You can train them in two ways:

Option one: have themmemorizethe entire employee handbook by rote. When tested, they answer from memory. - Drawback: memory drifts, gets confused, and fills in gaps with invention - And when the handbook is updated, theystill remember the old version— you have to retrain them to change it

Option two: teach themhow to look things up in the handbook. When tested, they first turn to the right chapter,read only those pages, then answer. - Upside: answers always come from the latest version of the handbook - When they can't find an answer, they honestly say "the handbook doesn't cover this; let me ask my manager" - Adding new chapters requires no retraining — they can use them immediately

Option two is RAG

The LLM (large language model) is treated as the "assistant who looks things up," not the "assistant who memorizes the book."


RAG stands for Retrieval-Augmented Generation

Retrieval-Augmented Generation. Broken down into three actions:

Three actions:

  1. RR etrieval — pull content relevant to the question from the knowledge base
  2. AA ugmented — stuff the retrieved content into the prompt
  3. GG eneration — have the LLM answer based on that content

Order matters:retrieve first, then generate. The LLM doesn't see "all the knowledge in the world" — only "the passages you've filtered down for it."


What's that "vector" thing? In plain English

The most mysterious part in the middle is "how do you find content relevant to the question out of a pile of documents?" The traditional approach is keyword search (SQL LIKE), but it has two fatal flaws:

Question: "What do I do if my account is hacked?" - Knowledge base entry: "Procedure for handling suspicious logins" - Keyword search:no match(neither "account" nor "hacked" appears in the title)

The solution: convert each passage of text into a string of numbers called a "vector," representing itssemantic position

More concretely:

「我帳號被盜了」     → [0.21, -0.45, 0.78, ..., 0.12]
「異常登入處理」      → [0.19, -0.43, 0.81, ..., 0.15]   ← 數字很像,在同一區域
「TVC 廣告報價」     → [-0.55, 0.92, -0.30, ..., 0.61]  ← 數字差很多,在遠處

This string of numbers is computed by an AI called an "embedding modelembedding model," which has read billions of sentences and learned to place things with similar meanings at nearby positions in numerical space

So retrieval becomes: convert the question to a vector too, then findthe passages mathematically closest to it. This is called "cosine similaritycosine similarity

" — basically, the smaller the angle between two vectors, the more similar they are.

You don't need to understand the math. Just remember this picture:meaning = position


. Nearby positions are related.

RAG Is Easy to Build, Hard to Do WellSounds easy, right? Chunk the documents, vectorize, search, hand off to the LLM. Butmaking RAG answer accurately

involves four engineering pitfalls:

Pitfall 1: Chunking strategy

  • How do you chunk a 100-page PDF?
  • Too small (50 characters per chunk): information is fragmented, and the LLM can't piece together a complete answer

Too large (2,000 characters per chunk): too much content packed into one chunk, the vector represents an "average meaning," and search becomes impreciseIn practice:300–500 characters per chunk, with 50-character overlap between adjacent chunks

(to avoid splitting key sentences in half).Chunking also needs to respectnatural boundaries

— you can't split a single FAQ answer in half. Chunk by markdown headings, lists, and paragraph structure, not by length.

Pitfall 2: Choice of embedding modelThe best embedding models for Chinese and English are

  • different text-embedding-3-large、Cohere embed-v3
  • Best for general English: OpenAIparaphrase-multilingual-MiniLM-L12-v2Reliable for Chinese/English bilingual:
  • Cohere (what our Xiao-Ai uses)bge-m3gte-Qwen2

Best for pure Chinese (early 2026):BGE

the same knowledge base performs 30% worsePitfall 3: Hybrid Search

Pure vector search has one drawback:for things requiring exact matching — names, product codes, numeric IDs — vectors lose precisionExample: a player asks "

How do I complete quest S-1A? " - Pure vector: "S-1A" might be treated as a generic symbol and match answers from other quests - Add BM25 keyword search: "S-1A" hits exactly, results are ranked togetherIn practice, you need

dense (vector) + sparse (BM25) hybrid search

, then use a reranker model to re-rank the top results.

Pitfall 4: RerankingAfter vector search finds the top 20 candidates, use a more precise model (usually smaller but deeper) to re-rank the top 3–5.Why? Embedding models are trained for "similarity"; rerankers are trained for "

relevance


" — not the same thing. Embeddings pull out things with similar "shape"; rerankers confirm which ones "actually answer the question."

Skipping reranking typically drops top-1 hit rate by 15–20%.

RAG's "Last Mile": Don't Let the LLM Improvise After doing all four things above, you'll still see the LLM improvise — it reads the knowledge base snippet and still fills in gaps on its own.

Example: - Q: "How much does Plan A cost?" - Knowledge base snippet retrieved: "Introduction to Plan A... (no price listed)" - LLM answer: "Plan A starts at NT$3,000." ←made up:

1. system prompt 明寫:「只能用知識庫回答。沒寫的事一律說『我不確定,
   請真人協助』。不要憑常識補腦。」

2. 每次回應強制輸出 ACTION tag([ANSWER]/[UNKNOWN]/[HANDOFF]),系統按
   tag 路由——不准 LLM 自己 fallback 到掰

3. 答案附 citation,讓 LLM 知道會被追溯

At this point, the problem isn't just technical — it's

rules designDo these three things together and the LLM will actually behave. → Want to see it in action?


Xiao-Ai on the Satsuma site →

runs on exactly this architecture. Ask her "Do you do e-commerce?" and she won't make something up — she'll say "I'm not sure about that. Would you like to leave your email for Satsuma?"

Why Most AI Customer-Support SaaS Have Mediocre RAGThe techniques are all public. Why do different teams produce such different results?The answer is

economics , not technology. Engineering step
Done carefully Done off the shelf Chunking strategy
Tuned per customer One ruleset for all Embedding model
Chosen by language / domain One model for everything Hybrid search
Reranking Enabled based on KB content Pure vector by default
Add a reranker layer Skipped to save cost Rules design

One SOP per customerGeneric system prompt

A SaaS that charges $3,000 a month can't fine-tune all of this for your KB. The economics don't work.→ If you want RAG that scores above 80, what you need is


custom delivery

, not SaaS.Conclusion: RAG Is the Answer, but Not a Silver Bullet RAG fixes the "AI making things up" bug, but only

well-built RAG

does. Building it isn't hard; building it well is.:

  1. Here's how you can evaluate AI customer support on the market:
  2. Ask vendors three questions
  3. What embedding model do you use? How does it perform on Chinese? (If they can't answer, it's off-the-shelf SaaS)

Do answers come with citations? (If they say "yes, but the feature isn't enabled yet" = they haven't built it)


Will it make things up when it doesn't know? (Have them demo live and ask something not in the KB): - Only vendors who can answer all three actually know how to build RAG. - Further reading Why AI customer support always misses the point →Full introduction to AI Coworkers → - Want to play with RAG yourself? Go to the


Satsuma Creative

bottom-right corner of the Satsuma homepage