Common issues in TMX files
By TAUS Data Association
This note provides guidance to ensure good quality TMX files are uploaded into the TDA repository. The top 10 most common issues and their solutions are listed.
1. Missing tags. It seems that sometimes users edit their TMX files and accidentally take out some important tags, such as < /seg > or < /tuv >.
Solution: Validate the TMX files with an XML tool.
2. Mismatch on locale codes such as using “ES-XY” for Spanish (Spain). TDA uses “ES-ES” for Spanish (Spain).
Solution: Replace “ES-XY” with “ES-ES”.
3. Mismatch on selected locales for the job and the locales in the TMX file. For example, selecting “English (United Kingdom)” when the TMX file has “English (United States)” as the locale.
Solution: Re-upload the job with the correct locale.
4. Having multiple source or target locales within the same TMX file. For example, having some TUs with English UK as the source locale and some TUs with English US as the source locale.
Solution: Separate the English UK and English US TUs into two different TMX files.
5. Invalid XML characters in the TMX file.
Solution: Validate the TMX file with an XML tool or LISA’s TMXCheck tool.
6. Using “all” as the srclang in the header.
< ?xml version="1.0" ? >
< !DOCTYPE tmx SYSTEM "tmx14.dtd" >
< tmx version="1.4" >
< header creationtool="Logoport"
creationtoolversion="4.11"
segtype="sentence"
o-tmf="Logoport"
adminlang="EN-US"
srclang="*all*"
datatype="rtf" >
< /header >
Solution: Replace “all” to one source language code such as EN-US.
7. Using BOM for UTF8 TMX file.
< feff >< ?xml version="1.0" ? >
< !DOCTYPE tmx SYSTEM "tmx14.dtd" >
< tmx version = "1.4" >
< header creationtool="Logoport"
creationtoolversion="4.11"
segtype="sentence"
o-tmf="Logoport"
adminlang="EN-US"
srclang="*all*"
datatype="rtf"
Solution: Remove the BOM at the beginning of the file.
8. Selecting the wrong locale. For example, the TMX file has NL as the target locale code but the user selected Danish as the target locale on the TM Sharing Form
9. Corrupted segments
< tu creationdate="20040619T024911Z" creationid="YORIKO" >
< tuv xml:lang="EN-US" >
< seg >US >" target='_new'>Intel Customer Support.< /seg >
< /tuv >
< tuv xml:lang="JA" >
< seg >" target='_new'>????{\f63 }????{\f63 }??????????????< /seg >
< /tuv >
< /tu >
< tu >
< tuv xml:lang="JA" >
< seg >< bpt i="1" type="font" >{\f1 < /bpt >CONTACT < ept i="1" >}< /ept >< bpt i="2" type="font" >{\f38 < /bpt >ƒtƒ@ƒCƒ‹< ept i="2" >} < /ept >
< bpt i="3" type="font" >{\f1 < /bpt >Network Associates < ept i="3" >}< /ept >< bpt i="4" type="font" >{\f38 < /bpt >‚̘A- 悪‹L Ú‚³‚ê‚Ä‚¢‚Ü‚•¡< ept i="4" >}< /ept >< /seg >
< /tuv >
< /tu >
< tu creationdate="20051230T075530Z" creationid="AMBERSKI" changedate="20060207T092645Z" changeid="KGAO" >
< prop type="Att::Product" >EMC Smarts< /prop >
< prop type="Att::Type" >Other< /prop >
< tuv xml:lang="EN-US" >
< seg >Train customer service staff on handling recurring problems and provide the information they need to respond knowledgeably to customer inquiries.< /seg >
< /tuv >
< tuv xml:lang="ZH-CN" >
< seg >Åàѵ¿Í»§•þÎñÈËÔ±´¦ÀíÖØ¸´³öÏÖµÄÎÊÌâµÄÄÜÁ¦£¬²¢ÎªËûÃÇÌṩÄÚÐеػشð¿Í»§µÄѯÎÊËùÐèµÄÐÅÏ¢¡£< /seg >
< /tuv >
< /tu >
Solution: We are filtering out these corrupted segments as part of our cleaning process.
10. Target text is not translation. For example, sometimes we get an entirely English text in a German segment. We are not filtering these bad segments out at this point. It is not trivial to filter these segments out because many languages share the same characters. So, we basically have to compare at the word level and we probably need to guess the language base on some statistical analysis.


