Skip to main navigation Skip to search Skip to main content

SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor

  • Xianfu Cheng
  • , Weixiao Zhou
  • , Xiang Li
  • , Jian Yang
  • , Hang Zhang
  • , Tao Sun
  • , Wei Zhang
  • , Yuying Mai
  • , Tongliang Li*
  • , Xiaoming Chen
  • , Zhoujun Li*
  • *Corresponding author for this work
  • Beihang University
  • Beijing Jiaotong University
  • Beijing Information Science & Technology University
  • Shenzhen Intelligent Strong Technology Co.,Ltd.

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Scene Text Recognition (STR) is an important and challenging upstream task for building structured information databases, which involves recognizing text within images of natural scenes. Although current state-of-the-art (SOTA) models for STR exhibit high performance, they typically suffer from low inference efficiency due to their reliance on hybrid architectures comprised of visual encoders and sequence decoders. In this work, we propose a VIsion Permutable extractor for fast and efficient Scene Text Recognition (SVIPTR), which achieves an impressive balance between high performance and rapid inference speeds in the domain of STR. Specifically, SVIPTR leverages a visual-semantic extractor with a pyramid structure, characterized by the Permutation and combination of local and global self-attention layers. This design results in a lightweight and efficient model and its inference is insensitive to input length. Extensive experimental results on various standard datasets for both Chinese and English scene text recognition validate the superiority of SVIPTR. Notably, the SVIPTR-T (Tiny) variant delivers highly competitive accuracy on par with other lightweight models and achieves SOTA inference speeds. Meanwhile, the SVIPTR-L (Large) attains SOTA accuracy in single-encoder-type models, while maintaining a low parameter count and favorable inference speed. Our proposed method provides a compelling solution for the STR challenge, which greatly benefits real-world applications requiring fast and efficient STR. The code is publicly available at https://github.com/cxfyxl/VIPTR.

Original languageEnglish
Title of host publicationCIKM 2024 - Proceedings of the 33rd ACM International Conference on Information and Knowledge Management
PublisherAssociation for Computing Machinery
Pages365-373
Number of pages9
ISBN (Electronic)9798400704369
DOIs
StatePublished - 21 Oct 2024
Event33rd ACM International Conference on Information and Knowledge Management, CIKM 2024 - Boise, United States
Duration: 21 Oct 202425 Oct 2024

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings
ISSN (Print)2155-0751

Conference

Conference33rd ACM International Conference on Information and Knowledge Management, CIKM 2024
Country/TerritoryUnited States
CityBoise
Period21/10/2425/10/24

Keywords

  • attention mechanism
  • length-insensitive
  • scene text recognition
  • vision transformer
  • visual-semantic analysis

Fingerprint

Dive into the research topics of 'SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor'. Together they form a unique fingerprint.

Cite this