MT Quality Gets a Boost with Shared Translation Memories from TAUS Data Association
Pilot conducted by Microsoft
Quality of output is highlighted as the main barrier to greater adoption of machine translation. This study illustrates how training statistical machine translation (SMT) engines using domain specific data improves output quality. See also the results of the Cross Language pilot, which used human evaluation.
TDA members are seeking quality improvements to better meet the demand for real-time translation as well as to need increase productivity rates in the industry.
At the MT Summit in Ottawa (August 28) Chris Wendt from Microsoft presented the findings from a recent pilot project using translation memories from more than ten TDA members to train the Microsoft statistical machine translation engine. The main tests were performed on Chinese and German language with customization done for Sybase iAnywhere. Additional tests were run on Polish and Japanese languages with customization for Adobe and Dell. Consistently the BLEU scores went up significantly with increases between 22% and 74% compared to engines trained on Microsoft or general available data only.The conclusions at a glance:
- The best results were achieved using the maximum available data within the industry domain of computer-related technical documents for the translation model training, followed by a language model training using the customer specific target language data.
- The diversity in training data from 13 to 15 different TDA member organizations in the IT sector had a positive effect on the results. Microsoft’s large data pool by itself did not give the hoped-for boost in MT quality output.
- A system can be customized with small amounts of target language material, as long as there is a diversity set of in-domain parallel data available.
- Small data providers benefit more from sharing translation memories than large data providers, but all data providers benefit.
- An MT system trained with combined data can deliver significantly improved translation quality, compared to a system trained only with the provider’s own data plus baseline training.
Overview of BLEU scores and percentage increases in quality

Microsoft’s MT engine is a linguistically informed SMT engine. It uses parallel data for translation model training and target language data for target language model customization. It applies certain techniques to do syntactic reordering. The translation training data set consisted of respectively 7.5 million segments for Chinese and 8.5 million for German.

A subset of the data was reserved for the BLEU scoring.
Chris Wendt's presentation
Ask a question: This e-mail address is being protected from spambots. You need JavaScript enabled to view it



