TAUS Data Association

Friday
Jul 30th
Text size
  • Increase font size
  • Default font size
  • Decrease font size

TMX Quality

Common issues in TMX files

By TAUS Data Association

TDA supercloudThis note provides guidance to ensure good quality TMX files are uploaded into the TDA repository. The top eight most common issues and their solutions are listed.


1. Missing tags. It seems that sometimes users edit their TMX files and accidentally take out some important tags, such as        < /seg > or < /tuv >.

Solution: Validate the TMX files with an XML tool.


2. Mismatch on locale codes such as using “ES-XY” for Spanish (Spain). TDA uses “ES-ES” for Spanish (Spain).

Solution: Replace “ES-XY” with “ES-ES”.


3. Having multiple source or target locales within the same TMX file. For example, having some TUs with English UK as the source locale and some TUs with English US as the source locale.

Solution: Separate the English UK and English US TUs into two different TMX files.


4. Invalid XML characters in the TMX file.

Solution: Validate the TMX file with an XML tool or LISA’s TMXCheck tool.

 

5. Using “all” as the srclang in the header.
< ?xml version="1.0" ? >
< !DOCTYPE tmx SYSTEM "tmx14.dtd" >
< tmx version="1.4" >
< header creationtool="Logoport"
creationtoolversion="4.11"
segtype="sentence"
o-tmf="Logoport"
adminlang="EN-US"
srclang="*all*"
datatype="rtf" >
< /header >

Solution: Replace “all” to one source language code such as EN-US.


6. Using BOM for UTF8 TMX file.
< feff >< ?xml version="1.0" ? >
< !DOCTYPE tmx SYSTEM "tmx14.dtd" >
< tmx version = "1.4" >
< header creationtool="Logoport"
creationtoolversion="4.11"
segtype="sentence"
o-tmf="Logoport"
adminlang="EN-US"
srclang="*all*"
datatype="rtf"

Solution: Remove the BOM at the beginning of the file.

 

7. Corrupted segments
< tu creationdate="20040619T024911Z" creationid="YORIKO" >
< tuv xml:lang="EN-US" >
< seg >US >" target='_new'>Intel Customer Support.< /seg >
< /tuv >
< tuv xml:lang="JA" >
< seg >" target='_new'>????{\f63 }????{\f63 }??????????????< /seg >
< /tuv >
< /tu >

< tu >
< tuv xml:lang="JA" >
< seg >< bpt i="1" type="font" >{\f1 < /bpt >CONTACT < ept i="1" >}< /ept >< bpt i="2" type="font" >{\f38 < /bpt >ƒtƒ@ƒCƒ‹< ept i="2" >} < /ept >
< bpt i="3" type="font" >{\f1 < /bpt >Network Associates < ept i="3" >}< /ept >< bpt i="4" type="font" >{\f38 < /bpt >‚̘A- 悪‹L Ú‚³‚ê‚Ä‚¢‚Ü‚•¡< ept i="4" >}< /ept >< /seg >
< /tuv >
< /tu >

< tu creationdate="20051230T075530Z" creationid="AMBERSKI" changedate="20060207T092645Z" changeid="KGAO" >
< prop type="Att::Product" >EMC Smarts< /prop >
< prop type="Att::Type" >Other< /prop >
< tuv xml:lang="EN-US" >
< seg >Train customer service staff on handling recurring problems and provide the information they need to respond knowledgeably to customer inquiries.< /seg >
< /tuv >
< tuv xml:lang="ZH-CN" >
< seg >Åàѵ¿Í»§•þÎñÈËÔ±´¦ÀíÖØ¸´³öÏÖµÄÎÊÌâµÄÄÜÁ¦£¬²¢ÎªËûÃÇÌṩÄÚÐеػشð¿Í»§µÄѯÎÊËùÐèµÄÐÅÏ¢¡£< /seg >
< /tuv >
< /tu >

Solution: We are filtering out these corrupted segments as part of our cleaning process.


8. Target text is not translation. For example, sometimes we get an entirely English text in a German segment. We are not filtering these bad segments out at this point. It is not trivial to filter these segments out because many languages share the same characters. So, we basically have to compare at the word level and we probably need to guess the language base on some statistical analysis.