LingAlign: 基于跨语言句向量的多语种句对齐方法研究

Translated title of the contribution: LingAlign: A Multilingual Sentence Aligner Using Cross-Lingual Sentence Embeddings

Research output: Contribution to journalArticlepeer-review

Abstract

[Objective] This paper develops a multilingual sentence aligner for parallel corpora-based research in digital humanities and machine translation. [Methods] The system first encodes the bitext to be aligned in a shared vector space, and then calculates the semantic relationship between sentences based on modified cosine similarity. Finally, a two-stage dynamic programming algorithm is used to automatically extract parallel sentence pairs. [Results] We use both intrinsic and extrinsic evaluation to calculate the performance of the system. The intrinsic evaluation shows that the average accuracy, recall and F1 values reached 0.950, 0.960 and 0.955. Furthermore, the chrF, chrF++ and COMET scores achieved in the extrinsic evaluation are 55.65, 55.85 and 87.31 respectively. [Limitations] A data capture platform that integrates document alignment and sentence alignment is yet to be developed. [Conclusions] The proposed approach outperforms existing methods in both intrinsic and extrinsic evaluation tasks, which may help to promote the construction of large and high quality multilingual parallel corpora.

Translated title of the contributionLingAlign: A Multilingual Sentence Aligner Using Cross-Lingual Sentence Embeddings
Original languageChinese (Traditional)
Pages (from-to)56-68
Number of pages13
JournalData Analysis and Knowledge Discovery
Volume8
Issue number6
DOIs
StatePublished - Jun 2024

Fingerprint

Dive into the research topics of 'LingAlign: A Multilingual Sentence Aligner Using Cross-Lingual Sentence Embeddings'. Together they form a unique fingerprint.

Cite this