Skip to main navigation Skip to search Skip to main content

Enhanced fine-grained visual classification through lightweight Transformer integration and auxiliary information fusion

  • Zhenyang Zhu
  • , Li Li
  • , Ketai He*
  • *Corresponding author for this work
  • Beihang University
  • University of Science and Technology Beijing

Research output: Contribution to journalArticlepeer-review

Abstract

Fine-grained visual classification (FGVC) involves classifying multiple subcategories within a unified major category, a task characterized by significant intra-class variability and minimal inter-class differences. Previous methods often rely on pre-trained visual models augmented with specialized modules, typically using large-scale models that are challenging for industrial deployment. Moreover, image data often comes with auxiliary information (e.g., spatiotemporal priors, attributes, and text descriptions), offering opportunities to enhance FGVC accuracy. Here we propose a novel lightweight Transformer-based approach that incorporates additional auxiliary information to enhance classification accuracy. Our method introduces a simplified pixel-focused aggregation attention to achieve local and global feature fusion and improves it with a separable aggregation attention to reduce model complexity. We also present the extra inside padding method for integrating auxiliary information with minimal additional parameters. Without pre-training, our model surpasses other lightweight neural networks on fine-grained datasets (e.g., a 5.3% increase in accuracy on CUB-200-2011), demonstrating a significant improvement. Our approach offers a promising direction for FGVC tasks, highlighting the effectiveness of integrating multimodal data for enhanced performance. Our source code is available at https://github.com/yang-zzy/SAA-EIP.

Original languageEnglish
Pages (from-to)11691-11704
Number of pages14
JournalVisual Computer
Volume41
Issue number13
DOIs
StatePublished - Oct 2025

Keywords

  • Extra information
  • Fine-grained visual classification
  • Light-weight vision transformer
  • Separable aggregated attention

Fingerprint

Dive into the research topics of 'Enhanced fine-grained visual classification through lightweight Transformer integration and auxiliary information fusion'. Together they form a unique fingerprint.

Cite this