Abstract
[Objective] This paper develops a multilingual sentence aligner for parallel corpora-based research in digital humanities and machine translation. [Methods] The system first encodes the bitext to be aligned in a shared vector space, and then calculates the semantic relationship between sentences based on modified cosine similarity. Finally, a two-stage dynamic programming algorithm is used to automatically extract parallel sentence pairs. [Results] We use both intrinsic and extrinsic evaluation to calculate the performance of the system. The intrinsic evaluation shows that the average accuracy, recall and F1 values reached 0.950, 0.960 and 0.955. Furthermore, the chrF, chrF++ and COMET scores achieved in the extrinsic evaluation are 55.65, 55.85 and 87.31 respectively. [Limitations] A data capture platform that integrates document alignment and sentence alignment is yet to be developed. [Conclusions] The proposed approach outperforms existing methods in both intrinsic and extrinsic evaluation tasks, which may help to promote the construction of large and high quality multilingual parallel corpora.
| Translated title of the contribution | LingAlign: A Multilingual Sentence Aligner Using Cross-Lingual Sentence Embeddings |
|---|---|
| Original language | Chinese (Traditional) |
| Pages (from-to) | 56-68 |
| Number of pages | 13 |
| Journal | Data Analysis and Knowledge Discovery |
| Volume | 8 |
| Issue number | 6 |
| DOIs | |
| State | Published - Jun 2024 |
Fingerprint
Dive into the research topics of 'LingAlign: A Multilingual Sentence Aligner Using Cross-Lingual Sentence Embeddings'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver