TY - GEN
T1 - A graph-based bilingual corpus selection approach for SMT
AU - Chao, Wenhan
AU - Li, Zhoujun
PY - 2011
Y1 - 2011
N2 - In statistical machine translation, the number of sentence pairs in the bilingual corpus is very important to the quality of translation. However, when the quantity reaches some extent, enlarging the corpus has less effect on the translation quality; whereas increasing greatly the time and space complexity to train the translation model, which hinders the development of statistical machine translation. In this paper, we propose a graph-based bilingual corpus selection approach, which makes use of the structural information of corpus to measure and update the importance of each sentence pair, and then selects a sentence pair with the highest importance each time. Our experiments in a Chinese-English translation task show that, selecting only 50% of the whole corpus by the graph-based selection approach as training set, we can obtain the near translation result with the one using the whole corpus.
AB - In statistical machine translation, the number of sentence pairs in the bilingual corpus is very important to the quality of translation. However, when the quantity reaches some extent, enlarging the corpus has less effect on the translation quality; whereas increasing greatly the time and space complexity to train the translation model, which hinders the development of statistical machine translation. In this paper, we propose a graph-based bilingual corpus selection approach, which makes use of the structural information of corpus to measure and update the importance of each sentence pair, and then selects a sentence pair with the highest importance each time. Our experiments in a Chinese-English translation task show that, selecting only 50% of the whole corpus by the graph-based selection approach as training set, we can obtain the near translation result with the one using the whole corpus.
KW - Corpus selection
KW - Graph
KW - Statistical machine translation
UR - https://www.scopus.com/pages/publications/84863873253
M3 - 会议稿件
AN - SCOPUS:84863873253
SN - 9784905166023
T3 - PACLIC 25 - Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation
SP - 120
EP - 129
BT - PACLIC 25 - Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation
T2 - 25th Pacific Asia Conference on Language, Information and Computation, PACLIC 25
Y2 - 16 December 2011 through 18 December 2011
ER -