Skip to main navigation Skip to search Skip to main content

A graph-based bilingual corpus selection approach for SMT

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In statistical machine translation, the number of sentence pairs in the bilingual corpus is very important to the quality of translation. However, when the quantity reaches some extent, enlarging the corpus has less effect on the translation quality; whereas increasing greatly the time and space complexity to train the translation model, which hinders the development of statistical machine translation. In this paper, we propose a graph-based bilingual corpus selection approach, which makes use of the structural information of corpus to measure and update the importance of each sentence pair, and then selects a sentence pair with the highest importance each time. Our experiments in a Chinese-English translation task show that, selecting only 50% of the whole corpus by the graph-based selection approach as training set, we can obtain the near translation result with the one using the whole corpus.

Original languageEnglish
Title of host publicationPACLIC 25 - Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation
Pages120-129
Number of pages10
StatePublished - 2011
Event25th Pacific Asia Conference on Language, Information and Computation, PACLIC 25 - , Singapore
Duration: 16 Dec 201118 Dec 2011

Publication series

NamePACLIC 25 - Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation

Conference

Conference25th Pacific Asia Conference on Language, Information and Computation, PACLIC 25
Country/TerritorySingapore
Period16/12/1118/12/11

Keywords

  • Corpus selection
  • Graph
  • Statistical machine translation

Fingerprint

Dive into the research topics of 'A graph-based bilingual corpus selection approach for SMT'. Together they form a unique fingerprint.

Cite this