Loading...

Please wait while we load the content.

Alibaba Warns Top AI Models Still Plagued By Translation Errors Despite Advances | Folio3 AI

Jan 26, 2026

Alibaba Warns Top AI Models Still Plagued By Translation Errors Despite Advances

Key takeaways Alibaba researchers found hallucination rates ranging from 33% to nearly 60% across 17 leading AI translation models, including GPT-4-class systems. Even top-performing models like GPT-4...

Key takeaways

Despite significant advances in artificial intelligence, even the most sophisticated large language models continue to produce frequent translation errors that undermine their reliability, according to new research from Alibaba released in late October.

In a paper published on October 28, 2025, researchers from Alibaba International Digital Commerce and Tianjin University unveiled findings that challenge the perceived accuracy of AI translation systems.

Their evaluation of 17 leading models revealed hallucination rates spanning from 33% to nearly 60%, depending on model architecture and language combinations.

"A critical challenge in addressing LLM hallucinations is the inadequacy of existing evaluation benchmarks," the researchers stated in their paper. They warned that current testing methods allow many models to achieve near-zero hallucination rates on traditional evaluations, "masking their true vulnerabilities."

The research team, led by Xinwei Wu, Heng Liu, Jiang Zhou, Xiaohu Zhao, Linlong Xu, Longyue Wang, Weihua Luo, and Kaifu Zhang, developed HalloMTBench, a new benchmark specifically designed to expose weaknesses in modern AI translation systems.

The benchmark includes 5,435 expert-verified samples across 11 English-to-X language pairs and is now publicly available on HuggingFace.

New framework exposes hidden translation failures

The researchers introduced a diagnostic framework that categorizes hallucinations into two main types: instruction detachment, where models translate into the wrong language or fail to translate entirely, and source detachment, where content is incorrectly added or omitted from translations.

"This taxonomy provides a clear and actionable lens for analyzing LLM translation behaviors," the research team explained.

Using this framework, they evaluated prominent models including GPT-4o, Claude-3.7-Sonnet, and various open-source alternatives.

GPT-4o-mini achieved the lowest hallucination rate, closely followed by Claude-3.7-Sonnet and GPT-4o. At the opposite end of the spectrum, ByteDance's Seed-X-PPO-7B exhibited the highest error rate.

The findings "confirm that susceptibility to translation hallucination remains a pervasive issue, even among otherwise state-of-the-art models," according to the researchers.

Distinct patterns emerge across models and languages

The study revealed that error patterns varied significantly between different AI systems. Qwen3-Max showed a strong tendency toward adding extraneous content, while GPT-4o-mini and Gemini-2.0-Flash were more likely to produce output in an incorrect language.

The research identified several "hallucination triggers" that increase error rates. Smaller open-source models proved more susceptible to mistakes than larger proprietary systems.

Models enhanced with reinforcement learning techniques tended to produce more wrong-language errors. Text length also played a role, with very short texts (0-29 characters) or very long passages (over 499 characters) triggering higher failure rates.

Language-specific performance gaps were notable as well. English-Portuguese, English-Japanese, and English-Vietnamese translations showed the highest hallucination rates, while English-Chinese translations were less affected.

Implications for AI translation reliability

The research underscores concerns about the practical deployment of AI translation systems in real-world applications.

The researchers emphasized that language-specific performance gaps "underscore the necessity of broad linguistic coverage in evaluation," cautioning that assessments limited to a few languages can "paint an incomplete, overly optimistic picture."

These distinct "hallucination fingerprints" demonstrate that "models fail in fundamentally different ways," the team noted.

They concluded that collecting diverse samples across models and language pairs "is not just a reasonable approach but a necessary one to build a comprehensive and unbiased benchmark."

The findings come as businesses and organizations increasingly rely on AI-powered translation for international communications, e-commerce, and content localization.

The research suggests that despite the impressive capabilities of large language models, significant reliability challenges remain in multilingual applications.

The complete dataset and evaluation tools are available through HuggingFace, enabling other researchers and developers to assess translation hallucinations in their own AI systems.

MORE NEWS

OpenAI Signs $10 Billion Computing Deal With Cerebras Systems

Key takeaways OpenAI has signed a multi-year, $10 billion agreement with AI chipmaker Cerebras Systems to secure computing infrastructure. The deal will deliver 750 megawatts of computing power throug...

Jan 26, 2026

Elon Musk’s xAI Restricts Grok Chatbot After Global Outcry Over Sexualized AI Images

KEY TAKEAWAYS: xAI implemented restrictions preventing Grok from editing images of real people in revealing clothing after global backlash California Attorney General Rob Bonta launched investigation...

TSMC-Posts-Record-Fourth-Quarter-Profit-Driven-By-AI-Chip-Demand

Loading...

10 Mins

99 %

22 + Years

Jan 26, 2026

Alibaba Warns Top AI Models Still Plagued By Translation Errors Despite Advances

Key takeaways

New framework exposes hidden translation failures

Distinct patterns emerge across models and languages

Implications for AI translation reliability

Read more:

MORE NEWS

Related News

Mar 25, 2026

OpenAI Signs $10 Billion Computing Deal With Cerebras Systems

Jan 26, 2026

Elon Musk’s xAI Restricts Grok Chatbot After Global Outcry Over Sexualized AI Images

Jan 26, 2026

TSMC Posts Record Fourth Quarter Profit Driven By AI Chip Demand

Loading...

Jan 26, 2026

Alibaba Warns Top AI Models Still Plagued By Translation Errors Despite Advances

Key takeaways

New framework exposes hidden translation failures

Distinct patterns emerge across models and languages

Implications for AI translation reliability

Read more:

MORE NEWS

Related News

Mar 25, 2026

OpenAI Signs $10 Billion Computing Deal With Cerebras Systems

Jan 26, 2026

Elon Musk&#8217;s xAI Restricts Grok Chatbot After Global Outcry Over Sexualized AI Images

Jan 26, 2026

TSMC Posts Record Fourth Quarter Profit Driven By AI Chip Demand

Elon Musk’s xAI Restricts Grok Chatbot After Global Outcry Over Sexualized AI Images