AI Testing

Fluent, Confident, and Wrong: Why English QA Is Not Enough for Global AI Agents

The most dangerous multilingual AI failure is a fluent output that feels native enough to trust, but is wrong enough to mislead.

Hemraj Bedassee Photo
Hemraj Bedassee
June 10, 2026
Illustration of a hand balancing a globe marked with location pins, with orbiting lines around it on an orange background.

That is the hidden risk for companies deploying LLM-powered chatbots, copilots, and agents across global markets. A model can answer in Hindi, Bangla, Arabic, Spanish, Mandarin, French, Portuguese, Indonesian, or Swahili and still misunderstand the user’s intent, miss the cultural context, mishandle safety policy, or take the wrong action.

For enterprise AI leaders, this changes the quality question. The question is no longer: “Does the model support this language?”

The better question is: “Can we prove that this model behaves reliably, safely, and usefully for real users in this market, in this language, with this dialect, under realistic pressure?”

Most teams cannot prove that yet.

The World Is Multilingual. AI Evaluation Is Still Too English-Centric.

English is the dominant language of global business and much of the public web, but it is not the language of most customer interactions worldwide.

Current speaker estimates show the scale of the gap. English has around 1.49 billion total speakers globally. Mandarin Chinese has around 1.18 billion. Hindi has more than 600 million. Spanish has more than 560 million. Other major languages such as Standard Arabic, French, Bangla, Portuguese, Indonesian, Urdu, Russian, Japanese, German, and others represent hundreds of millions more users. [1]

This matters because global AI products are not deployed into a single “multilingual” market. They are deployed into many language markets with different realities.

A Hindi-speaking customer in India may code-switch between Hindi and English. A Bangla speaker in Bangladesh may use local idioms that do not translate cleanly. Arabic users may write in Modern Standard Arabic, Egyptian Arabic, Gulf Arabic, Levantine Arabic, Maghrebi Arabic, or Arabizi. Spanish varies across Spain, Mexico, Colombia, Argentina, Chile, and U.S. bilingual communities. Portuguese in Brazil is not Portuguese in Portugal. French in France is not the same as French in Senegal, Côte d’Ivoire, Quebec, or Mauritius.

A model that performs well in English does not automatically perform well across those contexts.

Many Foundation Models Were Built on English-Heavy Data

The issue starts upstream.

Many foundation models were trained on internet-scale corpora where English is heavily represented. Even when models are marketed as multilingual, the balance of training data is often uneven.

Meta’s Llama 3 announcement is a useful public example. Meta stated that “over 5%” of Llama 3’s pre-training dataset consisted of high-quality non-English data covering more than 30 languages, while also noting that the company did not expect the same level of performance in those languages as in English. [2]

That is not a criticism of Llama 3 specifically. It is a useful illustration of a broader industry reality: multilingual capability often exists, but it is not evenly distributed.

A model can be impressive in English, strong in a few high-resource languages, acceptable in some medium-resource languages, and fragile in lower-resource or dialect-rich contexts. The user does not see that unevenness. They see a confident answer in their language.

That confidence is the problem.

“Multilingual” Does Not Mean the Model Understands Like a Native Speaker

LLMs do not understand language the way humans do. They learn statistical patterns from text. If one language dominates the training mix, the model’s representations, reasoning behavior, and failure modes may be shaped by that dominant language.

Recent research has investigated whether multilingual LLMs make key internal decisions in English-like representation spaces even when the prompt and answer are in another language. One 2025 paper found that several models appear to perform key semantic decisions in a space closest to English before producing output in the target language. [3]

In plain terms, a model may look multilingual at the surface but still be English-shaped underneath.

That does not mean every non-English answer is just a crude translation. Frontier models are more sophisticated than that. But it does mean product teams should be careful with a dangerous assumption: that if the final answer is fluent, the underlying reasoning was equally faithful to the user’s language and context.

A model can produce polished French while missing the legal nuance. It can produce fluent Arabic while misunderstanding the dialect. It can answer in Hindi while failing to handle Hinglish. It can respond in Bangla while flattening a culturally specific phrase into a generic English-like interpretation. Fluency is not reliability.

Translation Is Not the Same as Meaning

The real world is full of language traps.

  • A word can be technically correct and still wrong in context
  • A phrase can be harmless in one region and offensive in another
  • A banking term can differ by country
  • A healthcare instruction can change meaning depending on formality, politeness, or local usage
  • A customer complaint can include sarcasm, idioms, abbreviations, slang, or mixed-language input

Consider a few practical examples:

  • A user in India writes: “My claim is stuck only.”
    • A literal system may treat this as awkward English. A local reviewer understands that “only” may be used for emphasis and that the user is frustrated about a stalled insurance claim.
  • A user in Bangladesh uses a Bangla phrase that implies urgency but does not literally contain the word “urgent.”
    • A model may classify it as a normal support request and miss the escalation path.
  • An Arabic-speaking user switches between Modern Standard Arabic and Egyptian dialect.
    • A model may answer in formal Arabic while missing the actual intent expressed in dialect.
  • A Spanish-speaking user asks about “cuotas.”
    • Depending on market and domain, this could mean installments, fees, quotas, or subscription charges.
  • A French-speaking user in Mauritius uses local phrasing or mixes French, English, and Kreol-influenced expressions.
    • A generic French evaluation may not catch whether the answer feels natural or useful in that market.

They are normal human communication.

Safety Gaps Become Worse Outside English

The safety problem is even more serious.

Most AI safety evaluation and red teaming have historically been more mature in English than in many other languages. Research on multilingual safety has repeatedly found that models can behave differently across languages, with weaker safety performance in some non-English and lower-resource contexts. [4]

This matters because safety policies are not just strings to be translated, they require judgment.

  • A harmful request may be phrased indirectly
  • A jailbreak may use dialect, slang, code-mixing, or cultural references
  • A user may pressure the model through a multi-turn conversation
  • A prompt injection may hide in localized instructions, encoded text, or a non-English document retrieved by a RAG system.

If safety testing is only performed in English, the company may have validated the strongest part of the system while leaving weaker language paths exposed.

The uncomfortable truth is this: a model can refuse correctly in English and comply unsafely in another language.

With AI Agents, Language Failure Becomes Operational Failure

The risk escalates when the LLM is not just answering questions but acting as an agent.

A chatbot gives advice. An agent does things.

It may retrieve account data, update a profile, prepare a refund, route a claim, generate a compliance note, change a booking, trigger a workflow, escalate a ticket, or summarize evidence for a human reviewer.

In that world, misunderstanding is no longer just a content-quality issue. It becomes operational risk.

If an English-speaking user says “cancel my card,” the workflow may be clear. But what happens when a user in another language uses a phrase that could mean cancel, block, freeze, replace, pause, or dispute? What happens when a user uses local terminology for a payment reversal? What happens when a model interprets a culturally polite refusal as consent? What happens when it misses that the user is describing fraud, self-harm, harassment, financial distress, or a medical emergency?

For AI agents, multilingual QA must validate more than the final message. It must validate the full path:

  • Did the model understand the user’s intent?
  • Did it ask for clarification when the language was ambiguous?
  • Did it retrieve the right evidence?
  • Did it apply the correct policy for that market?
  • Did it use tools with the right permissions?
  • Did it stop before taking a high-risk action?
  • Did it escalate when a local user would reasonably expect escalation?
  • Did it produce an answer that a native speaker would trust?

A fluent wrong answer is bad. A fluent wrong action is worse.

What Global AI QA Should Look Like

A serious multilingual AI testing strategy should include at least five layers.

1. Native-Speaker Functional Evaluation

Native speakers should test whether the system understands real user intent, not just whether the output is grammatically correct.

  • This includes natural prompts, messy prompts, incomplete prompts, slang, local abbreviations, dialects, and code-mixed language.

2. Market-Specific Localization Testing

Localization testing should check whether the AI experience fits the market.

  • That includes tone, terminology, date and currency formats, legal disclaimers, accessibility, cultural references, right-to-left behavior, and local support expectations.

3. Multilingual Safety and Red Teaming

Safety testing should be performed across the actual languages and regions in scope.

  • This includes harmful requests, jailbreak attempts, prompt injection, misinformation, sensitive content, privacy handling, encoded data, coercive prompts, and multi-turn manipulation.

4. Agent Workflow Validation

For AI agents, testers must validate the workflow, not just the response.

  • That means checking intent recognition, tool selection, retrieval quality, permission boundaries, escalation logic, audit trails, and stop conditions across languages.

5. Human Evaluation Against Real-World Criteria

Human evaluation should measure whether the answer is accurate, useful, safe, culturally appropriate, and operationally correct.

  • A native speaker should be able to say: “This is what a real user in this market would mean, and this is how a trustworthy system should respond.”

How Testlio Can Help

This is where Testlio’s model is directly relevant. Global AI quality cannot be solved only by automated benchmarks or English-language evals. It requires human judgment from people who understand the language, the culture, the device context, and the real user environment.

Testlio can help companies validate multilingual AI systems through:

Native-Speaker Testing

Testlio’s global testing model supports in-market testing with native speakers who can identify nuance that automated translation and non-native review often miss. This is especially important for languages with regional variation, informal usage, or domain-specific vocabulary.

Multilingual QA and Localization Testing

For AI products entering new markets, Testlio can validate whether the experience is linguistically correct, culturally appropriate, and usable in the target locale. That includes UI behavior, translated content, AI-generated responses, local terminology, and real-device experience.

Safety Testing and Red Teaming

Testlio can help test whether AI systems remain safe across languages, including lower-resource languages and code-mixed user behavior. This includes adversarial prompts, unsafe requests, prompt injection, privacy risks, harmful content, and policy compliance.

Agent Workflow Validation

For LLM-powered agents, Testlio can validate whether the system correctly moves from user intent to action. That includes checking tool calls, escalation paths, permissions, evidence retrieval, and action boundaries across different languages and markets.

Market-Specific Human Evaluation

Testlio can help create evaluation rubrics for specific markets and use cases, then run human evaluations with testers who understand what “good” means locally. The result is not just a score. It is actionable evidence about where the AI experience is reliable, where it is fragile, and where it needs improvement.

Final thoughts

If your AI system will serve global users, do not treat multilingual support as a launch checkbox.

Treat it as a quality risk.

Before deploying an LLM-powered agent in Hindi, Bangla, Arabic, Spanish, Mandarin, French, Portuguese, Indonesian, or any other target language, ask three hard questions:

  • Can we prove the model understands real users in that language?
  • Can we prove the safety behavior holds under local language pressure?
  • Can we prove the agent takes the right action, not just produces fluent text?

If the answer is no, then English QA has given you confidence in only part of the system.

Global AI needs global evidence.

Source notes