Inducing a bilingual lexicon from short parallel multiword sequences

A Finch, T Harada, K Tanaka-Ishii… - ACM Transactions on …, 2017 - dl.acm.org
ACM Transactions on Asian and Low-Resource Language Information Processing …, 2017dl.acm.org
This article proposes a technique for mining bilingual lexicons from pairs of parallel short
word sequences. The technique builds a generative model from a corpus of training data
consisting of such pairs. The model is a hierarchical nonparametric Bayesian model that
directly induces a bilingual lexicon while training. The model learns in an unsupervised
manner and is designed to exploit characteristics of the language pairs being mined. The
proposed model is capable of utilizing commonly used word-pair frequency information and …
This article proposes a technique for mining bilingual lexicons from pairs of parallel short word sequences. The technique builds a generative model from a corpus of training data consisting of such pairs. The model is a hierarchical nonparametric Bayesian model that directly induces a bilingual lexicon while training. The model learns in an unsupervised manner and is designed to exploit characteristics of the language pairs being mined. The proposed model is capable of utilizing commonly used word-pair frequency information and additionally can employ the internal character alignments within the words themselves. It is thereby capable of mining transliterations and can use reliably aligned transliteration pairs to support the mining of other words in their context. The model is also capable of performing word reordering and word deletion during the alignment process, and it is furthermore capable of operating in the absence of full segmentation information. In this work, we study two mining tasks based on English-Japanese and English-Chinese language pairs, and compare the proposed approach to baselines based on a simpler models that use only word-pair frequency information. Our results show that the proposed method is able to mine bilingual word pairs at higher levels of precision and recall than the baselines.
ACM Digital Library
Showing the best result for this search. See all results