TAUS Data Association

Friday
Jul 30th
Text size
  • Increase font size
  • Default font size
  • Decrease font size

Opening access to new sectors

Pilot conducted by ProMT

TreeThis test illustrates that by opening up access to even a relatively small amount of domain specific language data, TDA helps machine translation companies to improve output quality, start gaining knowledge about content in new sectors, and increases the chances of winning the trust of organizations in these sectors.



TDA members are seeking quality improvements to better meet the demand for real-time translation as well as to need increase productivity rates in the industry.

At a glance:

  • The ProMT 8 engine was successfully customized to translate Public Health domain content with data extracted from the TDA dataset.
  • Customization on 120 segments allowed for a +0.25 BLEU increase/46% improvement in output quality over 500 sentences. Availability of more data would further increase the translation output
  • Customization was achieved primarily using automatic and statistical dictionary harvesting and translation memory improvement technologies coupled with human supervision.
  • Statistical machine translation would require a larger corpus to be able to yield significant improvement. However, availability of even smaller domain-specific corporate benefits the core RBMT engines and allows for more domain-relevant translation

Overview

A specific subset of the TDA dataset was used to customize the ProMT 8 translation engine for the Public Health domain. Phrase tables were built through statistical sub-sentential alignment algorithms based on the translation memory and through statistical algorithms based on frequency in the bilingual corpus with part of speech, gender, and number tagging. Translation memory units were enhanced through advanced leveraging algorithms. The goal was to achieve an objectively measurable performance increase in translation accuracy when using the engine customized with available in-domain

TDA data
Languages: English source text, Spanish translation
The input consisted of raw text extracted from questionnaires in Molina Healthcare’s translation memories.

Methodology

Content

The content used for training and testing was comparatively unstructured, on occasions it consisted of short, fragmentary entries rather than full sentences. The content was varied, consisted of good candidates for customization (names, constant expressions, collocations) and robustness (numeric data, acronyms) and contained both domain-specific and domain-general terminology traditionally not found in publicly available corpora. Content also contained conversational features (e.g. 2nd person pronouns and verbal morphology, simplified syntax).
A 500-segment subset of Molina data was used to test the engine using the newly created dictionaries (approximately 5,000 words in total). A 500-segment subset consisting of medical questionnaires other than Molina’s were extracted from different sources and translated without dictionaries in order to serve as the control condition.

Dictionary harvesting

New lexicon entries were extracted from the Molina subset, providing relevant in-domain “Public Health” translations for a set of random Molina and non-Molina segments (out of 500). Dictionary candidates and subsequent dictionary entries were built primarily automatically based on statistical algorithms. These included sub-sentential algorithms based on the translation memory, algorithms based on frequency in the bilingual corpus with part of speech, gender, and number tagging. Dictionary entries were imported using an automatic dictionary construction module with a lemmatizer. After import, unverified entries were verified for correct parsing.

Results

TDA supercloud

Experimental conditions tested:
1) Translation of the Molina subset with
2) Translation of the non-Molina dataset (without using dictionaries –control condition)

BLEU scores were calculated
a) for both subsets,
c) for both ProMT 8 (baseline) and Google’s statistical machine translation engine.


See samples of output before and after customization of MT engine.