TY - JOUR
T1 - Brains, not brawn
T2 - The use of "smart" comparable corpora in bilingual terminology mining
AU - Morin, Emmanuel
AU - Daille, Béatrice
AU - Takeuchi, Koichi
AU - Kageura, Kyo
PY - 2010/8
Y1 - 2010/8
N2 - Current research in text mining favors the quantity of texts over their representativeness. But for bilingual terminologymining, and for many language pairs, large comparable corpora are not available. More importantly, as terms are defined vis-à-vis a specific domain with a restricted register, it is expected that the representativeness rather than the quantity of the corpus matters more in terminology mining. Our hypothesis, therefore, is that the representativeness of the corpus is more important than the quantity and ensures the quality of the acquired terminological resources. This article tests this hypothesis on a French-Japanese bilingual term extraction task. To demonstrate how important the type of discourse is as a characteristic of the comparable corpora, we used a state-of-the-art multilingual terminology mining chain composed of two extraction programs, one in each language, and an alignment program. We evaluated the candidate translations using a reference list, and found that taking discourse type into account resulted in candidate translations of a better quality even when the corpus size was reduced by half.
AB - Current research in text mining favors the quantity of texts over their representativeness. But for bilingual terminologymining, and for many language pairs, large comparable corpora are not available. More importantly, as terms are defined vis-à-vis a specific domain with a restricted register, it is expected that the representativeness rather than the quantity of the corpus matters more in terminology mining. Our hypothesis, therefore, is that the representativeness of the corpus is more important than the quantity and ensures the quality of the acquired terminological resources. This article tests this hypothesis on a French-Japanese bilingual term extraction task. To demonstrate how important the type of discourse is as a characteristic of the comparable corpora, we used a state-of-the-art multilingual terminology mining chain composed of two extraction programs, one in each language, and an alignment program. We evaluated the candidate translations using a reference list, and found that taking discourse type into account resulted in candidate translations of a better quality even when the corpus size was reduced by half.
KW - Comparable corpora
KW - Lexical alignment
KW - Terminology mining
UR - http://www.scopus.com/inward/record.url?scp=77958030314&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77958030314&partnerID=8YFLogxK
U2 - 10.1145/1839478.1839479
DO - 10.1145/1839478.1839479
M3 - Article
AN - SCOPUS:77958030314
SN - 1550-4875
VL - 7
JO - ACM Transactions on Speech and Language Processing
JF - ACM Transactions on Speech and Language Processing
IS - 1
M1 - 1
ER -