Frequently Asked Questions (FAQs)
Matching Data picks the best matches from the large Data Cloud repository that TAUS built over the years with its members. Data Cloud contains around 35 billion words, in hundreds of language pairs for 17 well-defined industry domains and 9 content types.
In the recent years, TAUS has also been involved in multiple EU-funded projects to acquire and clean crawled data, so that can be an additional source for the matching data service. We can also perform ad hoc data crawling if needed for the specific corpus to be created.
Data Cloud translation data sets are human translations coming from:
Industry-shared TMs or individual TMs, contributed by TAUS users and members, but also freelance translators and language technology professionals
Public data made available to the community by large international and intergovernmental organisations (e.g. European Union institutions, United Nations, etc.)
Parallel data sets generated, processed, curated and/or improved by companies or academic institutions from:
Existing available public parallel data sets
Multilingual public sector websites
Multilingual websites that allow free use and reuse of their content
Domain as we refer to it in the Marching Data Library is defined by the query corpus. There is no standard category or naming convention that we follow - a domain can be related to the industry domain (life sciences, software, etc.), but can also refer to the purpose that the data needs to serve (conversational or ecommerce, etc.).
The difference is the degree of relevance/proximity to the initial query corpus used to trigger the data matching:
the Compact data set includes matches that are very close to the query corpus. This corpus size ensures the highest output relevance.
The Medium data set includes the data present in the Compact selection, but also segments which are less close to the initial query corpus. This data selection enables you to have the perfect balance between relevance and volume.
The Large data set includes the data present in the Compact and Medium selections, and even more additional segments but still relevant to the initial query corpus. With this data selection you won’t miss out on any possible match.
The Large data selection already contains Compact and Medium selections. If you download multiple selections of the same corpus, you will only be charged with the difference. For example, if you buy the Compact selection first and then decide to buy the Medium or Large one as well, when processing the payment you will be charged only for the price difference between the two selections.
You can first have a look at the available ready made corpor in our Matching Data Library. In case there is no corpus that matches your needs, TAUS will be happy to create an ad hoc one for you using our Matching Data technology.
It is very easy: Contact us and provide us with a query corpus of at least 20k monolingual or bilingual segments.
We will perform a clustered search using our Matching Data technology to find the best matches in the TAUS Data Cloud repository. If we do not find sufficient matching data in the TAUS Data Cloud, we may extend the search in the crawled data.
The search results will be fine tuned (cleaned and manually checked) and divided in three selections based on relevance proximity to the query corpus.
The fine-tuned results are then validated by the client initiating the query, who can decide if and which data selection to buy. The output corpus is then published in the Matching Data Library.
The optimal size of the query corpus is 20.000 segments, bilingual or monolingual. This sample will be used to determine the domain properties based on which the matching will be done.
It takes around 5 working days to complete the matching data, create different selections and do the final data cleaning. We are working on the fully automated workflow to be able to create corpora on the fly. The automated Matching Data feature will be available later in 2019.
If you are the initiator of a customized corpus, we won’t charge you for it until you have validated its content. Only at that point you will decide whether to buy the corpus or not, and which selection (Compact, Medium, Large) better fits your needs.
If you are willing to provide a testimonial on the corpus, you will benefit from a 10% discount on the purchase of the corpora that are generated from your query corpus.
If you are interested in purchasing one of the ready corpora in our Library, you can click on the ‘prices and more details ’ button on your desired corpus and you will be able to see a sample of its content.
Corpora available in our Library are compiled with TAUS Matching Data search applied to the TAUS Data Cloud repository and/or crawled data. This already guarantees that the corpora include highly relevant data to the query corpus. To make sure that this data is of high quality as well, several curation steps are applied, including data cleaning. Furthermore, corpora for which a testimonial is available have been validated by the corpus initiator.
TAUS members who have credits (Data Cloud or Partner credits) can use them to purchase corpora from the Matching Data Library.
If you are a member of TAUS and you upload your own translation memory data to the TAUS Data Cloud, you will earn credits that you can use to purchase corpora from the Matching Data Library. Pooling ratios depend on your membership level.
The corpus price depends on multiple factors, such as the scarcity of domain, language pair and the data volumes. Languages and domains for which data is more scarce are more valuable.