TY - GEN
T1 - Pre-training Cross-Modal Retrieval by Expansive Lexicon-Patch Alignment
AU - Yiyuan, Yang
AU - Long, Guodong
AU - Blumenstein, Michael
AU - Geng, Xiubo
AU - Tao, Chongyang
AU - Shen, Tao
AU - Jiang, Daxin
N1 - Publisher Copyright:
© 2024 ELRA Language Resource Association: CC BY-NC 4.0.
PY - 2024
Y1 - 2024
N2 - Recent large-scale vision-language pre-training depends on image-text global alignment by contrastive learning and is further boosted by fine-grained alignment in a weakly contrastive manner for cross-modal retrieval. Nonetheless, besides semantic matching learned by contrastive learning, cross-modal retrieval also largely relies on object matching between modalities. This necessitates fine-grained categorical discriminative learning, which however suffers from scarce data in full-supervised scenarios and information asymmetry in weakly-supervised scenarios when applied to cross-modal retrieval. To address these issues, we propose expansive lexicon-patch alignment (ELA) to align image patches with a vocabulary rather than only the words explicitly in the text for annotation-free alignment and information augmentation, thus enabling more effective fine-grained categorical discriminative learning for cross-modal retrieval. Experimental results show that ELA could effectively learn representative fine-grained information and outperform state-of-the-art methods on cross-modal retrieval.
AB - Recent large-scale vision-language pre-training depends on image-text global alignment by contrastive learning and is further boosted by fine-grained alignment in a weakly contrastive manner for cross-modal retrieval. Nonetheless, besides semantic matching learned by contrastive learning, cross-modal retrieval also largely relies on object matching between modalities. This necessitates fine-grained categorical discriminative learning, which however suffers from scarce data in full-supervised scenarios and information asymmetry in weakly-supervised scenarios when applied to cross-modal retrieval. To address these issues, we propose expansive lexicon-patch alignment (ELA) to align image patches with a vocabulary rather than only the words explicitly in the text for annotation-free alignment and information augmentation, thus enabling more effective fine-grained categorical discriminative learning for cross-modal retrieval. Experimental results show that ELA could effectively learn representative fine-grained information and outperform state-of-the-art methods on cross-modal retrieval.
KW - Cross-modal retrieval
KW - Lexicon-based representation
KW - Multi-modal alignment
KW - Open vocabulary
UR - https://www.scopus.com/pages/publications/85195965995
M3 - 会议稿件
AN - SCOPUS:85195965995
T3 - 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings
SP - 12977
EP - 12987
BT - 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings
A2 - Calzolari, Nicoletta
A2 - Kan, Min-Yen
A2 - Hoste, Veronique
A2 - Lenci, Alessandro
A2 - Sakti, Sakriani
A2 - Xue, Nianwen
PB - European Language Resources Association (ELRA)
T2 - Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024
Y2 - 20 May 2024 through 25 May 2024
ER -