跳到主要导航 跳到搜索 跳到主要内容

SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor

  • Xianfu Cheng
  • , Weixiao Zhou
  • , Xiang Li
  • , Jian Yang
  • , Hang Zhang
  • , Tao Sun
  • , Wei Zhang
  • , Yuying Mai
  • , Tongliang Li*
  • , Xiaoming Chen
  • , Zhoujun Li*
  • *此作品的通讯作者
  • Beihang University
  • Beijing Jiaotong University
  • Beijing Information Science & Technology University
  • Shenzhen Intelligent Strong Technology Co.,Ltd.

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Scene Text Recognition (STR) is an important and challenging upstream task for building structured information databases, which involves recognizing text within images of natural scenes. Although current state-of-the-art (SOTA) models for STR exhibit high performance, they typically suffer from low inference efficiency due to their reliance on hybrid architectures comprised of visual encoders and sequence decoders. In this work, we propose a VIsion Permutable extractor for fast and efficient Scene Text Recognition (SVIPTR), which achieves an impressive balance between high performance and rapid inference speeds in the domain of STR. Specifically, SVIPTR leverages a visual-semantic extractor with a pyramid structure, characterized by the Permutation and combination of local and global self-attention layers. This design results in a lightweight and efficient model and its inference is insensitive to input length. Extensive experimental results on various standard datasets for both Chinese and English scene text recognition validate the superiority of SVIPTR. Notably, the SVIPTR-T (Tiny) variant delivers highly competitive accuracy on par with other lightweight models and achieves SOTA inference speeds. Meanwhile, the SVIPTR-L (Large) attains SOTA accuracy in single-encoder-type models, while maintaining a low parameter count and favorable inference speed. Our proposed method provides a compelling solution for the STR challenge, which greatly benefits real-world applications requiring fast and efficient STR. The code is publicly available at https://github.com/cxfyxl/VIPTR.

源语言英语
主期刊名CIKM 2024 - Proceedings of the 33rd ACM International Conference on Information and Knowledge Management
出版商Association for Computing Machinery
365-373
页数9
ISBN(电子版)9798400704369
DOI
出版状态已出版 - 21 10月 2024
活动33rd ACM International Conference on Information and Knowledge Management, CIKM 2024 - Boise, 美国
期限: 21 10月 202425 10月 2024

出版系列

姓名International Conference on Information and Knowledge Management, Proceedings
ISSN(印刷版)2155-0751

会议

会议33rd ACM International Conference on Information and Knowledge Management, CIKM 2024
国家/地区美国
Boise
时期21/10/2425/10/24

指纹

探究 'SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor' 的科研主题。它们共同构成独一无二的指纹。

引用此