What AI Chatbots Miss About Small Business (I Tested 8 of Them)

I asked eight AI chatbots the same three questions a small business owner might ask. Across the 24 responses, three things were consistently absent. Not one chatbot asked me how many hours I was losing or what the gap was costing me. Not one quantified the cost of doing nothing in my specific business. Not one gave me anything I could save and share with my accountant.

The advice itself was mostly good, which is the part I want to be clear about up front. ChatGPT walked through a four-week implementation plan without being asked. Claude opened the burnout scenario with genuine acknowledgment before moving into operational suggestions. Meta AI suggested a "4D audit" — delete, delegate, defer, do — which is a clean framework I'd be happy to use myself. None of this was bad advice. The post isn't about whether AI is useful for small business problems. It's about a specific gap that showed up across all eight chatbots, and that gap is what I built HiddenDrain to close.

What follows is how the experiment ran, what showed up, and what I think it means if you're trying to figure out where AI tools fit into the operational side of your business.

The setup

Three scenarios, all written to mirror how an actual owner types into a chat box rather than how a prompt engineer crafts a query.

Scenario 1 (generic chaos): "My small business feels chaotic and I'm not sure what's actually broken. Every day feels like firefighting and I never get to the things I keep telling myself I'll do. Where do I even start figuring out what's wrong?"

Scenario 2 (home services): "I run a 6-technician home services business (plumbing, light electrical, appliance repair) across 2 locations. We're profitable but it feels like we're always one bad day away from a disaster. Bookings get mixed up, technicians sometimes show up to the wrong job, and my dispatcher is constantly putting out fires instead of planning the week. What should I actually do about this?"

Scenario 3 (owner overwhelm): "I'm a small business owner working 60+ hours a week and I'm burning out. My team is fine but I'm the one who ends up dealing with everything — every decision, every problem, every customer complaint. I don't have time to think about strategy because I'm too busy keeping the wheels on. How do I fix this?"

I pasted each prompt into eight chatbots — ChatGPT (default), Claude Sonnet 4.6, Microsoft Copilot Smart, DeepSeek instant, Gemini 3 Fast, Grok fast, Meta AI instant, and Perplexity default — and took whatever they gave back, no follow-up questions, no prompt tweaks. Then ran the same three scenarios through HiddenDrain (HD from here on, since it'll come up often) with answers a real owner might give. Twenty-seven outputs in total, captured between May 11 and 12, 2026.

I scored each one against eight dimensions — not a definitive list of what owners care about, just the things that seemed to separate "reading material" from "something I could act on Monday morning." Did the model ask for a number, did it apply a named framework, did it produce something saveable, how many recommendations did it offer, did it close inviting more conversation, did it acknowledge the emotional weight, how many words did it produce, and did it tell me what to do tomorrow.

The comparison

Source	Asked for a number?	Named framework?	Saveable deliverable?	Recommendations (avg)	Closing question?	Emotional acknowledgment?	Words (avg)	First-step bridge?
ChatGPT	No	No	No	~8	No	Mild	~1430	No
Claude	No	No	No	~5	Yes	Mild	~500	No
Copilot	No	No	No	~5	Yes	Mild	~420	No
DeepSeek	Partial	No	No	~6	Yes (P1, P3)	Warm	~670	Partial
Gemini	No	Yes (branded)	No	~4	Yes	No	~500	No
Grok	Partial	No	No	~7	Yes (P1, P2)	Mild	~800	No
Meta AI	Partial	Yes (named)	No	~6	Yes	Warmest	~700	No
Perplexity	No	Yes (SWOT, 5 Whys)	No	~3	Suggestions only	No	~300	No
HiddenDrain	Yes	TIMWOODS	Yes (PDF)	3	No	Yes (when warranted)	~350	Yes

Scan any column. The pattern shows up before you've read a word of what follows.

What the chatbots got right

Worth being clear about the strengths first, because there are plenty.

On Prompt 3, all eight chatbots named the underlying pattern — some called it the founder trap, some called it hub-and-spoke, some called it reactive mode. ChatGPT named "the central nervous system" image. Meta AI called it "firefighter-in-chief" and offered the cleanest reframe of any model: become "fire marshal" instead. Those are good diagnoses produced from a one-line prompt.

On Prompt 2, the chatbots converged. All eight recommended field service management software, seven named Jobber and Housecall Pro, and five named ServiceTitan (most correctly noted it's overkill for a 6-tech business). There is no plausible reading where chatbots "miss" the answer to a well-specified home services scheduling problem — the tools the owner needs are well-known, and the AI knows them.

Seven of eight also ended at least one of their three responses with a clarifying question. Claude's "what part of this feels most stuck for you?" Meta AI's "which of these feels most doable to start with?" That conversational openness is a real chatbot advantage — if you keep talking, they keep responding. And three of the eight (Meta AI, DeepSeek, Claude) opened with empathy on Prompt 3, which is what an exhausted owner actually needs before any operational advice can land.

Claude or Meta AI or DeepSeek will give you good advice if you ask about a business problem. The reason I'm writing the rest of this post is that good advice and a diagnostic aren't the same thing — one of them produces action while the other produces something to read.

What they missed

Five things showed up consistently across the eight chatbots. None of them is an AI failure on its own. Together they describe what chat-shaped interaction can't easily do.

Gap 1: Nobody asked for a number

The chatbots were given one shot, no follow-up, so they couldn't ask a clarifying question mid-response. That's a real limitation of the test — in a multi-turn conversation, some of them might. But the question worth asking is whether a typical SMB owner conducts a multi-turn diagnostic conversation, or pastes their problem once and reads the reply. Most do the second. And even within the single response, none of them paused to say "tell me roughly how many hours per week first" or "how often does this happen" before continuing — the kind of front-loaded clarifier you'd expect from a consultant who needed the data before answering. They suggested the owner do a time audit later. None of them ran one in the conversation.

The reason "we lose 10–15 hours a week to admin" is a useful sentence is that the owner said it out loud. A category of waste in a list isn't the same as a number the owner has committed to.

Gap 2: Specific language got echoed at the surface, not at depth

Every chatbot used "6 technicians" or "60+ hours" back at me — those were in the prompt. None of them mentioned the property management client I lost in the home services scenario, because none of them asked. HD surfaced that detail by asking specifically about lost customers; the full Q&A is on the raw outputs page if you want to see how it played out. Worth being honest about: of course the chatbots didn't ask, the test was one-shot. HD didn't "do better" by some general capability. HD asked because the lost-customer question is built into its diagnostic. The chatbots don't have one.

So the difference here isn't "AI can't be personalised." Sounding personalised is the part chatbots do well. Being personalised at depth requires asking questions the owner doesn't know to volunteer.

Gap 3: Too many recommendations

ChatGPT averaged eight recommendations per response. Grok averaged seven. Meta AI averaged six and often more counting sub-items. The owner who reads any of these walks away with a list. The owner who acts on three things this week is rare; the owner who acts on eight is rarer.

Perplexity was the exception with about three per response — but Perplexity was also the shortest and least diagnostic. Constraint without depth is its own problem.

Gap 4: Chat output mostly stays in chat

A few of the chatbots offer "share conversation" links, which work — sort of. They're hosted URLs someone else can read. But a link isn't a document. It can be revoked, it depends on the platform staying up, and it doesn't sit in a folder alongside your accountant's files. It's a partial workaround, not a deliverable. For owners who actually want to share findings with a co-founder, an advisor, or an accountant in a durable form, chat output is the wrong shape.

Gap 5: Industry context is performative

Every chatbot said "in a home services business" or "in a small business like yours" somewhere. None of them changed the substance of the advice based on industry. Recommendations for a restaurant, a law firm, and a home services company would be roughly interchangeable across most of the eight models.

To be fair: HD's industry dropdown also doesn't currently change the report substantively. The industry input is collected but mostly not yet used to vary the diagnosis. The difference is that HD's recommendations end up industry-specific when the owner mentions industry-specific tools in their answers (Jobber, Notion, QuickBooks), because HD's questions surface those mentions. The chatbots have no question that does that work. So the gap is one HD partly shares — and is something I'm actively working on closing.

None of these are failures of AI; they're what chat-shaped interaction can't easily produce, and they're what an SMB owner needs. If you want the longer version of why generic AI tooling so often fails to move the needle for small businesses, I went deeper on the structural reasons in Why AI keeps failing small businesses.

Three things worth telling you about

A few specific findings the table doesn't capture.

The Claude contamination. In an earlier pilot, Claude's Prompt 1 response mentioned TIMWOODS unprompted — which surprised me. It was the only mention of a named lean framework in any of the early runs, and it landed on the vaguest prompt. Tracing it back, my Claude profile had a HiddenDrain-related skill installed, and the session context was bleeding through. When I re-ran on a clean profile, Claude did not mention TIMWOODS in any of the three responses. Across the fresh eight-chatbot run captured here, the word TIMWOODS appears zero times in 24 responses. Two methodological lessons: profile context bleeds, even when it isn't supposed to, so test from a clean account if you're benchmarking AI tools yourself. And the corrected finding actually strengthens HD's position — without contamination, no chatbot reaches for the TIMWOODS framework on its own.

Perplexity localized. Perplexity's home services response contained the phrase "Mondays north of Yerevan." I'm based in Yerevan and Perplexity picked up my IP. Worth knowing if you test search-grounded models yourself: they lean on session signals (IP, location, recent search context) more than chat-only models do.

DeepSeek is the most number-anchored chatbot. Its Prompt 2 response recommended a "$50–100/month scheduler," a "$25/day lead tech bonus" for cross-coverage, and "1% of weekly revenue into an emergency ops account." Its Prompt 3 set specific thresholds for delegation ("approve refunds up to $50 without asking me"). And its Prompt 1 opened with the most diagnostic-shaped line of any chatbot: "You don't need a full business audit. You need a diagnostic, not a solution yet." That line is essentially the case for HiddenDrain stated by a competitor model. I'm not going to pretend I didn't notice.

What to do with this if you're an SMB owner

A few practical takeaways.

If your problem is vague and you're not sure what's broken, the answer you get will vary a lot between chatbots. Try two if you have the time — the answer that emerges across two models is more reliable than either one alone.

If your problem is well-specified and tools-shaped (the home services scenario is the canonical case), all eight chatbots converge on roughly the same correct answer. Pick the most compact response and ignore the rest. ChatGPT's 1,800-word essay won't serve you better than Claude's 600-word answer when the substance is the same.

If you want a number, you have to ask for it yourself. Tell the chatbot how many hours you suspect you're losing, or how often the problem happens, or what your team rate is. Without your input, none of them will quantify.

If you want a saveable deliverable, chat output is the wrong format. Take the conversation, paste it into a document, edit it down to a one-page summary, and treat that as your output. Or use a structured tool that produces one.

If your problem is overwhelm rather than operational chaos, lead with the human dimension. Meta AI, DeepSeek, and Claude will acknowledge what you said before moving to advice. The other five jump straight to recommendations, which lands poorly because under stress people can't easily absorb operational suggestions that arrive before the problem has been heard. Knowing which is which saves you from typing "I'm burning out" and getting back a Gantt chart.

I built HiddenDrain around the specific gap I kept noticing — nobody asks the questions an owner doesn't know to ask. The free diagnostic does. It asks for hours, frequency, dollar impact, the lost-customer story, what's been tried before. The output is a structured PDF that names which TIMWOODS categories show up in the diagnosis (typically 4–7 of the 8 categories, depending on the business), assigns each a severity rating, ties them to the owner's own answers, and ends with three recommendations with a first-step bridge under each one.

For Scenario 2 in the experiment, HD surfaced the $20,000/year lost property management client because it asked the customer-loss question. None of the eight chatbots produced that number, because none of them asked. That single data point was the most expensive thing about that hypothetical business, and only one of the nine tools surfaced it.

I'd rather you finish this post with a clearer view of what AI is and isn't good at for your business than as a converted lead. The chatbots are good at a lot. They're not built to do the diagnostic work.

Methodology

For anyone who wants to verify, reproduce, or run their own version:

Test window: May 11–12, 2026

Models tested: ChatGPT default, Claude Sonnet 4.6, Microsoft Copilot Smart, DeepSeek instant, Gemini 3 Fast, Grok fast, Meta AI instant, Perplexity default. All accessed through public consumer chat interfaces on free tiers.

HiddenDrain version: May 12, 2026 build. Three things had shipped in this build that aren't in older HD reports. First, a short emotional acknowledgment that opens the report when the owner's answers contain burnout/overwhelm language ("Before the findings — what you described is genuinely common and structurally hard to escape on your own…") rather than going straight to the Drain Level score. Second, the ROI calculator now shows a range with a ±30% confidence band rather than a single point figure — so instead of "$25,574 annual savings," the report shows "$18,000–$32,000" with the assumptions called out. Third, each of the three recommendations has a "First step this week" sentence under it pointing to one concrete action the owner can take in the next seven days. These changes shipped because earlier HD reports felt clinical on emotional scenarios, looked falsely precise in the ROI math, and pointed to destinations without footholds.

Method: each prompt sent once, one-shot, with no prompt engineering and no follow-up. The full output of each response was captured. Any clarifying question the chatbot asked at the end was treated as part of the output rather than answered.

Three contamination events worth disclosing:

(1) An earlier pilot had Claude's Prompt 1 response referencing TIMWOODS unprompted. Tracing it back, my Claude profile had a HiddenDrain-related skill installed, and the session context surfaced the framework. The fresh run captured here was run from a clean profile and shows zero TIMWOODS references across the 24 chatbot responses. Both runs are saved.

(2) Perplexity's Prompt 2 response contained the phrase "Mondays north of Yerevan." Perplexity inserted my city as a sample geography based on my IP. None of the other seven chatbots localized in any detectable way.

(3) Perplexity's Prompt 3 response referenced "your 6 technicians and overwhelmed dispatcher" — details from Prompt 2, not Prompt 3. Perplexity's interface treated my three prompts as a continuous thread rather than independent sessions. The other seven chatbots were tested in fresh sessions per prompt and didn't show this behaviour. If you're testing AI tools yourself, start a new chat per scenario.

One more disclosure that matters: I worked with Claude (Anthropic) to design this experiment, score the outputs, and draft this post. Claude is also one of the eight chatbots being scored. I scored conservatively to compensate for the obvious bias risk, and you can verify the scoring against the raw outputs published here. The framing, conclusions, and decision to publish are mine. Claude was a research partner, not a co-author.

Models change. The findings in this post are a snapshot of where eight chatbots were on May 12, 2026. If you re-run these prompts in six months, you may get different answers — that's expected. The structural gaps (no quantification, no saveable deliverable, no constrained recommendation set) are the parts likely to age slowly. The scores will not.

If you find a different result in your own testing, or a gap I missed, I'd want to know about it. The point of running an experiment in public is that other people can run it back.