What goes wrong when you wire ChatGPT directly into customer support? (5 real cases)
Wiring ChatGPT, Claude, or Gemini straight into a support chatbot looks simple. In production, five fatal failure modes show up. This piece breaks down each one with real cases and the technical reasons behind them.
TL;DR
"Get an engineer to hook up the ChatGPT API as customer support" sounds cheap, but these five failures all started exactly that way:
- E-commerce: Customer asks about refunds, AI invents a "3-5 day" timeline (actual is 7-14 days) → flood of complaints
- SaaS: Customer asks about plan differences, AI claims Product A has Product B's features → refund wave
- Gaming: Player asks for compensation, AI casually promises "we'll make it up to you" (unauthorized)
- Finance: AI misstates legal terms, gets reported to regulators
- Subscription: AI claims we have features only competitors offer → customers sign up, discover the truth, leave brutal reviews
- Common root cause: No RAG, no ACTION routing, no system prompt constraints — the LLM is free to improvise
- It's fixable, butnot just by "hooking up the API" — you need RAG + constraints + monitoring + continuous KB maintenance
Or put more bluntly:Why does your ChatGPT feel like a god, but turn into an idiot the moment it's on your website?
Over the past six months I've seen five companieshook up the ChatGPT/Claude API as customer support themselves, and all of them pulled the plug within a month of going live.
It's not an API problem, it's not bad engineering —it's a pattern problem. Below are five real cases (details anonymized), each with the technical reason behind it.
If you're weighing "let's just hook up the LLM API ourselves to save money," please take 8 minutes to read this.
Case 1: E-commerce — customer asks about refunds, AI invents a timeline
Setup: A beauty e-commerce brand wired up the ChatGPT API into a front-end chat widget. System prompt: "You are a customer service rep for XX Company. Please warmly help customers solve their problems."
What broke:
Customer: "How long does a refund take?" ChatGPT: "Hi! Our refunds typically arrive within 7-14 business days, thanks for your patience ✨"
In reality, the company's refund process is: - 5-7 days internal review after submission - Another 14 business days to land after approval - Total: 19-21 days
→ Customers didn't get their money after 7 days and tore the brand apart on Dcard. The company paid out 30x the order value in goodwill vouchers.
Technical reason: When the LLM sees words like "refund" and "how long," it pulls from training data"the average answer for similar situations worldwide"— most e-commerce sites are 7-14 days, so that's what it used. Ithas no idea what your company's actual process is。
→ this bug can't be fixed. No amount of better prompting will solve it.
Case 2: SaaS — customer asks about plan differences, AI crosses product lines
Setup: A B2B SaaS sells both "Business" (NT$3,000/month) and "Enterprise" (NT$15,000/month). Differences: SLA, SSO, API quota.
They wired up the Claude API and dumped all product pages from the website into the prompt.
What broke:
Customer: "Does Business include SSO?" Claude: "Yes! Business fully supports SSO and SAML 2.0 integration."
In reality: - Businessdoes not have SSO(that's Enterprise-only) - Customer signed, discovered the truth, terminated the contract + filed a complaint - Sales rep had to apologize over fried chicken
Technical reason: Claude saw one long document blending two products and couldn't reliably figure outwhich plan SSO belongs to. When the question itself is ambiguous ("Business" could mean either product), it picked "the one that looks more feature-rich" — this is called plausibility bias(plausibility bias).
→ The fix isn't a longer prompt, it'sstructured retrieval(RAG)+ explicit product-tier metadata. The LLM shouldn't be guessing which one is "Business" — the system should identify the product first, then feed the correct passage.
Case 3: Gaming — player asks for compensation, AI makes promises on the spot
Setup: A gaming company just launched a new event and the top-up package had a bug. Players flooded in with questions.
Engineers rushed to wire up GPT-4 as the first line of triage. System prompt: "Please patiently soothe player emotions and help them solve their problems."
What broke:
Player: "I topped up 1,500 but didn't get the package!!" AI: "We're truly sorry for the trouble! I'llsend you two replacement packagesas compensation, plus 5 mystery crystalsplease check your inbox 🎁"
In reality: - The game has no "double replacement" compensation policy - "Mystery crystals" aren't an item in this game (the LLM made it up) - Player screenshotted it to the community → went viral → not honoring it gets you sued for fraud, honoring it sets the precedent for every player going forward
→ A disaster. The head of customer service was summoned that same day.
Technical reason: The LLM has been trained on massive amounts of "customer soothing" examples (from Amazon, Starbucks, and other CS training corpora) — its idea ofwarmthis making promises. It doesn't know your company's compensation policy, so it defaults to "the most generous-sounding, well-meaning promise."
→ Fix: beyond RAG-constrained knowledge bases, the system prompt must explicitly say "Do not promise any compensation, gifts, or discounts." Plus, set an ACTION tag — the moment "compensation"-related language is detected, [ESCALATE] hand off to a human.
Case 4: Finance — AI misstates legal terms
Setup: An online brokerage wired up GPT-4 to answer account, trading, and tax questions.
What broke:
Customer: "Are dividends subject to 30% withholding tax?" AI: "Foreign company dividends are subject to a 30% US withholding tax, but you can file Form W-8BEN to reduce it to 10%."
Sounds professional, right? But this answer: - Applies to the US, butthe W-8BEN calculation logic is different for Taiwanese individual investors - And this brokerage is a Taiwan-domestic broker thatdoesn't offer US equities
→ The customer took the AI's answer to their accountant,who thought he was talking nonsense; the customer complained that the brokerage "gave wrong info via AI"; the brokerage received acompliance warning letter from the FSC。
Technical reason:In heavily regulated domains like law, tax, and medicine, the LLM produces answers thatsound rightbut are actuallyoutdated to a specific training cutofforfor the wrong jurisdiction. In high-stakes industries this is fatal.
→ In these industries, 99% of the time you should not use a general LLM as your customer-facing first line. Instead:
1. The knowledge base should contain only "legally reviewed" fixed passages
2. Any legal or tax question must be force-routed to a [ESCALATE] compliance-certified human
3. Maintain compliance logs in the back office (who asked what, what the AI replied)
Case 5: Subscription service — AI claims competitor features as its own
Setup: A SaaS subscription service wired up the OpenAI API and dumped 30 pages of product info into the prompt.
What broke:
Customer: "Do you support Slack integration?" AI: "Yes!We fully support Slack integration — real-time notifications, bot interactions, andSlash command。」
In reality:The product has no Slack integration. The LLM grabbed the Slack integration descriptions ofcompetitors(Asana, Notion, Linear) from its training data and pasted them onto this company.
→ Customer signed a monthly contract, discovered the truth, demanded a refund.And tweeted about it, getting 12,000 retweets。
Technical reason:LLMs have no instinct to say "I don't know". When the answer to a question isn't in the specific data it has seen, it'll reach forcompetitors with similar functionalityand adapt their answer. This is worst in questions like "feature comparisons」、「integrations」、「API support".
→ Fix: strict RAG + hard-code the system prompt with "Only answer about features documented in the knowledge base below. For anything not documented, always reply 'I'm not certain about this feature — let me connect you with a specialist to confirm.'」。
The common root cause behind all five
Flatten the above and you'll seethe same mechanism is at work every time:
An LLM's training objective is "produce text that sounds plausible," not "only speak when you actually know."
"I don't know" is, for an LLM, acounterintuitive training target— in most fine-tuning data, the assistant "tries its best to help". So unless you hard-code the rules, its default isthe attitude of helping the customer solve their problem— even when it has no idea what the answer is.
This mechanism is fine in scenarios like "helping a student with homework" (the student verifies). But in scenarios like "answering customer questions on behalf of a company", it'slife or death。
Can it be fixed? Yes, but not just by "hooking up the API"
The architecture belowis achievable for any company willing to invest(three months with engineers and you're there):
1. RAG retrieval(向量資料庫,只回答有寫的事)
參考 → blog/02-rag-shi-shen-me.html
2. system prompt 寫死:「沒寫的事一律 [UNKNOWN],不准補腦,
不准承諾任何補償/優惠/功能」
3. ACTION tag 強制路由:
[ANSWER] → 顯示回答
[UNKNOWN] → 轉真人或引導留聯絡
[ESCALATE]→ 高風險詞自動觸發(法律/稅務/補償)
4. citation 系統:每個答案附「來自知識庫第 X 條」,
約束 LLM 不亂講
5. 後台 audit:管理員可看每一輪對話,
揪出「答對問題但語氣不對」、「KB 缺漏」等問題
Our own Satsuma Xiao'ai runs on exactly this architecture.If a visitor pushes her on a question not in the KB, she won't make it up— that's the thing we've worked hardest to guarantee.
→ Try to trip her up yourself →
So why is everyone still rushing to hook up ChatGPT?
I interviewed those 5 companies above and4 of them were engineer-led at the start。
Engineers' logic usually goes: 1. "The OpenAI API is just a few lines of code — I'll knock it out" 2. "Management wants AI, let's ship a demo first" 3. "The cost is just API tokens, dirt cheap" 4. "If I write a longer prompt, that should control it, right?"
Each line is reasonable on its own,but together they kill you. The reason: engineers aren't trained to build "customer-facing services" — they build "internal tools」。
". Internal tools can tolerate hallucination (a coding assistant has a bug, the engineer debugs it themselves). Customer-facing services — every hallucination hits brand trust directly.
→ This isn't an engineering problem, it's a business design problem。
Conclusion: don't try to save this money
"Wire up ChatGPT as customer support yourself"looks like a 60% budget cut in the short term(SaaS monthly fee vs. OpenAI token cost), but you'll pay twice over on the hidden costs below:
| Hidden cost | Estimate |
|---|---|
| First major failure (goodwill vouchers / refunds / legal) | NT$50,000 - 500,000 |
| Engineering hours (prompt tuning, KB cleanup) | 200-400 hours |
| Damage to customer trust (impossible to quantify, very real) | — |
| Rebuilding the knowledge base after bug fixes | 80-160 hours |
→ In most cases, buying SaaS or hiring a custom-build vendor is actually cheaper.
If you still want to wire it up yourself, at least implement the 5 things outlined above.
To put it plainly
- Can tolerate hallucination risk: pure internal tools, prototype demos — OK
- E-commerce / SaaS / subscription services / general service businesses: use our selection guide to decide between SaaS and custom
- Finance / medicine / law / game compensation: never use a general LLM as your first line. If you do build, it has to be a tightly constrained custom solution — happy to chat if we're a fit
- Want to see a real demo:Satsuma Xiao'ai— she'll honestly tell you "I don't know"
Further reading: - Why does AI customer support always miss the point? → - What is RAG? → - Stop buying AI customer support SaaS →
Satsuma Creative
Integrated marketing creative agency. We've seen too many companies spend more money trying to "save money."