Skip to main navigation Skip to search Skip to main content

Fine-Grained Indoor–Outdoor Scene Recognition: Leveraging Spatiotemporal Characteristics From Multimodal Sensors via CNN–ViT–TabTransformer Fusion

  • Hong Kong Polytechnic University
  • Beihang University
  • Ministry of Industry and Information Technology
  • South China University of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Fine-grained scene recognition has significant application value in fields, such as intelligent transportation, navigation and positioning, smart home, human–computer interaction, safety early warning, emergency rescue, and disaster assessment. However, traditional environmental and behavioral scene recognition generally faces issues, such as over-reliance on single sensors, insufficient data richness, inadequate extraction of spatiotemporal features, and suboptimal algorithm structure optimization. These problems result in only simple indoor–outdoor (IO) discrimination, low detection accuracy, and poor algorithm robustness, failing to effectively cope with the challenges of complex spatial structures and dynamic environmental changes. To address these challenges, this article proposes a CNN–ViT–TabTransformer (C–V–T) fine-grained scene recognition method based on multimodal spatiotemporal features from multisource sensors. First, we developed a multisensor data acquisition platform based on the Android system, which realizes high-frequency synchronous acquisition and local storage of data from built-in smartphone sensors such as Global Navigation Satellite Systems (GNSS), inertial measurement unit (IMU, including accelerometer and gyroscope), geomagnetism, air pressure, and light intensity in various IO environments, transportation vehicles, and building spaces. Subsequently, time interpolation synchronization, filtering denoising, and frequency-domain analysis are performed on the multisource sensor data of each scene to extract accurate and usable multimodal spatiotemporal features. Finally, a C–V–T recognition framework integrating multimodal spatiotemporal information enhancement strategies is designed to accurately detect the specific scene, where the user is located. Experimental results show that in tests involving ten types of IO fine-grained scenes, the collaborative enhancement of multisensor data enables the fine-grained scene recognition accuracy to exceed 99%, which is improved to varying degrees compared with single-sensor recognition schemes. In addition, compared with the six current mainstream models, the proposed fusion architecture performs better in terms of accuracy and generalization ability.

Original languageEnglish
Pages (from-to)7362-7378
Number of pages17
JournalIEEE Sensors Journal
Volume26
Issue number5
DOIs
StatePublished - 2026

Keywords

  • CNN–ViT–TabTransformer (C–V–T)
  • multimodal data fusion
  • scene recognition
  • spatiotemporal characteristics

Fingerprint

Dive into the research topics of 'Fine-Grained Indoor–Outdoor Scene Recognition: Leveraging Spatiotemporal Characteristics From Multimodal Sensors via CNN–ViT–TabTransformer Fusion'. Together they form a unique fingerprint.

Cite this