Matching Data Library

Off-the-Shelf Matching Data

No time to create a customized data corpus? You can choose from our Matching Data Library built and validated in cooperation with TAUS members. The library corpora were compiled with TAUS Matching Data search applied to the TAUS Data Cloud repository.

 
Convenient
Ready to download whenever you need it
 
Clean data
Data and corpora validated by TAUS members
 
Easy overview
Volumes, segment domain origin and test bed rating
 
Volume discount
25% off on bulk purchases of minimum 5 corpora

Oracle Colloquial Corpus


 Initiator: Oracle
 Domain: Colloquial Text
 Language(s):
English - Spanish (International) English - Portuguese (Brazil) English - Chinese (PRC) English - Korean English - Japanese

Is your chat bot not chatty enough? Or your MT engine looks at you puzzled when it has to deal with informal business communication or user generated content? This corpus will give the conversation with your local audience a friendly, casual tone.

From product user reviews and blog post comments to everyday business small talk, you will get a wide range of conversational content clustered from several domains - content which will give your MT engine the right tune to handle even the most creative user voices.

This corpus was created in cooperation with Oracle, and the output of TAUS Matching Data was scored with an 84% average acceptance rate by their linguists!

Click on the testimonial tab to read the complete testimonial from Oracle.

To view samples please login.
English - Spanish (International) Tokens
Corpus Size Segments Source Target
Compact 364K 5.9M 6.7M
Medium 942K 17.5M 20.1M
Large 1.3M 25.8M 29.6M
Sample Login to view
Generic placeholder image
English - Portuguese (Brazil) Tokens
Corpus Size Segments Source Target
Compact 1.4M 8.6M 8.1M
Medium 4.8M 32.0M 30.3M
Large 7.7M 52.7M 49.8M
Sample Login to view
Generic placeholder image
English - Chinese (PRC) Tokens
Corpus Size Segments Source Target
Compact 1.8M 15.3M 17.1M
Medium 7.0M 62.7M 69.6M
Large 11.9M 110M 122M
Sample Login to view
Generic placeholder image
English - Korean Tokens
Corpus Size Segments Source Target
Compact 469K 4.8M 3.8M
Medium 1.4M 15.6M 12.3M
Large 2.0M 23.6M 18.5M
Sample Login to view
Generic placeholder image
English - Japanese Tokens
Corpus Size Segments Source Target
Compact 499K 4.9M 7.6M
Medium 1.8M 20.8M 32.0M
Large 3.1M 36.8M 56.5M
Sample Login to view
Generic placeholder image
Testimonial from Oracle

Oracle International Product Solutions has worked with TAUS on a joint pilot project to enable data discovery within TAUS's Data Cloud corpora. The process consisted in Oracle IPS supplying TAUS with a sample of approximately 30K English strings, representing content that is aligned to Oracle projects.

TAUS used the sample to explore Data Cloud for similarity & proximity, across 5 languages, and reverted back with three categories of data output, with score ranges on similarity and proximity. Oracle IPS then performed a linguistic assessment of this output. Our in-depth linguistic review rendered positive results and the content supplied by TAUS was of good quality, appropriate to consume as aligned corpora to that supplied in the Oracle sample with an average score of 84% for across the 5 languages.

Oracle IPS will continue to work with TAUS to assess the effect that consuming this discovered corpora will have on engine quality. We look forward to having data search and discovery features on Data Cloud, whereby a user is capable of discovering their own project aligned content as a consumable self-service. We believe this will allow TAUS and its members to drive increased value from the TAUS data assets and in turn will likely continue to fuel growth in the pool of data and value-add services.

Language Pair
Compact
Medium
Large
English - Spanish (International)
Member Price
Price in Euro / Partner Credits
Price in Data Cloud Credits
€ 8000
15 million
€ 12000
40 million
€ 18000
60 million
Non-Member Price
Price in Euro
€ 9600
€ 14400
€ 21600
English - Portuguese (Brazil)
Member Price
Price in Euro / Partner Credits
Price in Data Cloud Credits
€ 8000
15 million
€ 12000
40 million
€ 18000
60 million
Non-Member Price
Price in Euro
€ 9600
€ 14400
€ 21600
English - Chinese (PRC)
Member Price
Price in Euro / Partner Credits
Price in Data Cloud Credits
€ 10000
35 million
€ 15000
140 million
€ 22500
250 million
Non-Member Price
Price in Euro
€ 12000
€ 18000
€ 27000
English - Korean
Member Price
Price in Euro / Partner Credits
Price in Data Cloud Credits
€ 6000
8 million
€ 9000
25 million
€ 13500
37 million
Non-Member Price
Price in Euro
€ 7200
€ 10800
€ 16200
English - Japanese
Member Price
Price in Euro / Partner Credits
Price in Data Cloud Credits
€ 9000
15 million
€ 13500
65 million
€ 20250
110 million
Non-Member Price
Price in Euro
€ 10800
€ 16200
€ 24300

Couldn't find what you were looking for?

Do you have a query corpus to submit?
Request Matching Data
Contact us to get more information
Contact us
500x500