跳到主要导航 跳到搜索 跳到主要内容

LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval

  • Ziyang Luo
  • , Pu Zhao
  • , Can Xu
  • , Xiubo Geng
  • , Tao Shen
  • , Chongyang Tao
  • , Jing Ma*
  • , Qingwei Lin
  • , Daxin Jiang*
  • *此作品的通讯作者
  • Hong Kong Baptist University
  • Microsoft USA

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Image-text retrieval (ITR) aims to retrieve images or texts that match a query originating from the other modality. The conventional dense retrieval paradigm relies on encoding images and texts into dense representations with dual-stream encoders. However, this approach is limited by slow retrieval speeds in large-scale scenarios. To address this issue, we propose a novel sparse retrieval paradigm for ITR that exploits sparse representations in the vocabulary space for images and texts. This paradigm enables us to leverage bag-of-words models and efficient inverted indexes, significantly reducing retrieval latency. A critical gap emerges from representing continuous image data in a sparse vocabulary space. To bridge this gap, we introduce a novel pre-training framework, Lexicon-Bottlenecked Language-Image Pre-Training (LexLIP), that learns importance-aware lexicon representations. By using lexicon-bottlenecked modules between the dual-stream encoders and weakened text decoders, we are able to construct continuous bag-of-words bottlenecks and learn lexicon-importance distributions. Upon pre-training with same-scale data, our LexLIP achieves state-of-the-art performance on two ITR benchmarks, MSCOCO and Flickr30k. Furthermore, in large-scale retrieval scenarios, LexLIP outperforms CLIP with 5.8× faster retrieval speed and 19.1× less index storage memory. Beyond this, LexLIP surpasses CLIP across 8 out of 10 zero-shot image classification tasks.

源语言英语
主期刊名Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
出版商Institute of Electrical and Electronics Engineers Inc.
11172-11183
页数12
ISBN(电子版)9798350307184
DOI
出版状态已出版 - 2023
已对外发布
活动2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Paris, 法国
期限: 2 10月 20236 10月 2023

出版系列

姓名Proceedings of the IEEE International Conference on Computer Vision
ISSN(印刷版)1550-5499

会议

会议2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
国家/地区法国
Paris
时期2/10/236/10/23

指纹

探究 'LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval' 的科研主题。它们共同构成独一无二的指纹。

引用此