Key takeaways
Despite significant advances in artificial intelligence, even the most sophisticated large language models continue to produce frequent translation errors that undermine their reliability, according to new research from Alibaba released in late October.
In a paper published on October 28, 2025, researchers from Alibaba International Digital Commerce and Tianjin University unveiled findings that challenge the perceived accuracy of AI translation systems.
Their evaluation of 17 leading models revealed hallucination rates spanning from 33% to nearly 60%, depending on model architecture and language combinations.
"A critical challenge in addressing LLM hallucinations is the inadequacy of existing evaluation benchmarks," the researchers stated in their paper. They warned that current testing methods allow many models to achieve near-zero hallucination rates on traditional evaluations, "masking their true vulnerabilities."
The research team, led by Xinwei Wu, Heng Liu, Jiang Zhou, Xiaohu Zhao, Linlong Xu, Longyue Wang, Weihua Luo, and Kaifu Zhang, developed HalloMTBench, a new benchmark specifically designed to expose weaknesses in modern AI translation systems.
The benchmark includes 5,435 expert-verified samples across 11 English-to-X language pairs and is now publicly available on HuggingFace.
New framework exposes hidden translation failures
The researchers introduced a diagnostic framework that categorizes hallucinations into two main types: instruction detachment, where models translate into the wrong language or fail to translate entirely, and source detachment, where content is incorrectly added or omitted from translations.
"This taxonomy provides a clear and actionable lens for analyzing LLM translation behaviors," the research team explained.
Using this framework, they evaluated prominent models including GPT-4o, Claude-3.7-Sonnet, and various open-source alternatives.
GPT-4o-mini achieved the lowest hallucination rate, closely followed by Claude-3.7-Sonnet and GPT-4o. At the opposite end of the spectrum, ByteDance's Seed-X-PPO-7B exhibited the highest error rate.
The findings "confirm that susceptibility to translation hallucination remains a pervasive issue, even among otherwise state-of-the-art models," according to the researchers.
Distinct patterns emerge across models and languages
The study revealed that error patterns varied significantly between different AI systems. Qwen3-Max showed a strong tendency toward adding extraneous content, while GPT-4o-mini and Gemini-2.0-Flash were more likely to produce output in an incorrect language.
The research identified several "hallucination triggers" that increase error rates. Smaller open-source models proved more susceptible to mistakes than larger proprietary systems.
Models enhanced with reinforcement learning techniques tended to produce more wrong-language errors. Text length also played a role, with very short texts (0-29 characters) or very long passages (over 499 characters) triggering higher failure rates.
Language-specific performance gaps were notable as well. English-Portuguese, English-Japanese, and English-Vietnamese translations showed the highest hallucination rates, while English-Chinese translations were less affected.
Implications for AI translation reliability
The research underscores concerns about the practical deployment of AI translation systems in real-world applications.
The researchers emphasized that language-specific performance gaps "underscore the necessity of broad linguistic coverage in evaluation," cautioning that assessments limited to a few languages can "paint an incomplete, overly optimistic picture."
These distinct "hallucination fingerprints" demonstrate that "models fail in fundamentally different ways," the team noted.
They concluded that collecting diverse samples across models and language pairs "is not just a reasonable approach but a necessary one to build a comprehensive and unbiased benchmark."
The findings come as businesses and organizations increasingly rely on AI-powered translation for international communications, e-commerce, and content localization.
The research suggests that despite the impressive capabilities of large language models, significant reliability challenges remain in multilingual applications.
The complete dataset and evaluation tools are available through HuggingFace, enabling other researchers and developers to assess translation hallucinations in their own AI systems.
Read more: