The data normalization issue
By TAUS Data Association
As data sharing is gradually entering the mainstream, one key issue that all players need to address is translation memory quality. Part of the TDA agenda is to deliver a uniquely authoritative – or curated - source of language data, especially as anyone can scrape the web and build parallel corpora of extremely uneven, and therefore unproductive; translation data.
There are two key pain points in data quality – source data from the author, and parallel corpus data in TMs. The User Conference looked at both of these in a panel session moderated by Karen Combe of PTC devoted to data cleaning strategies among data providers. Typical issues on the TM side include the excessive number of inline tags, irrelevant bits of data, mistranslations of homonyms, acronyms spelled out in target versions, one into two sentence mismatches, punctuation inconsistencies and upper/lower case mismatches among others all cause avoidable problems in the SMT training process. The question is: how can these be fixed or avoided, ideally with an automated solution?
Intel argued that there were instances that could be cleaned automatically (e.g. trademark codes; formatting, suspect characters, and converting escape sequences back into characters), and others that need to be thrown out, possibly up from 2 to 6% of all segments. The art is to find the sweet spot between adjusting and ditching.
As a contrast, ProMT, a hybrid engine, views such items as irregular characters and incomplete sentences/internal tags as useful data that help understand the text during run-time parses. Post editors also need to see these metadata. So “irrelevant” data are in fact left untouched as they are there for a reason. Everything else can be handled by the dictionary or by grammar rules, including tagged ‘Don’t Translates’.
In a world of very high volume data, Microsoft had no problem with being “pretty liberal about throwing away data at training time”. On the other hand, it called for a standard to handle “factoids” such as numbers and number data inside sentences. TDA could if necessary mask factoid data in TMs. Microsoft finds that data cleaning requires half a person month to update TM resources, and also suggested that its cleaning tools should be shared inside TDA.
Source data improvements
Microsoft expects 5 simple style-guide rules to be applied to its source data to boost engine training and translation: keep sentences short, correct spelling and punctuation, and run the spell and grammar checkers. Other more complex authoring rules did not seem to impact output quality.
See slides from the normalization of translation memories presentation at the TAUS User Conference 2009
Normalization of translation memories - TAUS User Conference 2009


