Abstract
Fine-grained scene recognition has significant application value in fields, such as intelligent transportation, navigation and positioning, smart home, human–computer interaction, safety early warning, emergency rescue, and disaster assessment. However, traditional environmental and behavioral scene recognition generally faces issues, such as over-reliance on single sensors, insufficient data richness, inadequate extraction of spatiotemporal features, and suboptimal algorithm structure optimization. These problems result in only simple indoor–outdoor (IO) discrimination, low detection accuracy, and poor algorithm robustness, failing to effectively cope with the challenges of complex spatial structures and dynamic environmental changes. To address these challenges, this article proposes a CNN–ViT–TabTransformer (C–V–T) fine-grained scene recognition method based on multimodal spatiotemporal features from multisource sensors. First, we developed a multisensor data acquisition platform based on the Android system, which realizes high-frequency synchronous acquisition and local storage of data from built-in smartphone sensors such as Global Navigation Satellite Systems (GNSS), inertial measurement unit (IMU, including accelerometer and gyroscope), geomagnetism, air pressure, and light intensity in various IO environments, transportation vehicles, and building spaces. Subsequently, time interpolation synchronization, filtering denoising, and frequency-domain analysis are performed on the multisource sensor data of each scene to extract accurate and usable multimodal spatiotemporal features. Finally, a C–V–T recognition framework integrating multimodal spatiotemporal information enhancement strategies is designed to accurately detect the specific scene, where the user is located. Experimental results show that in tests involving ten types of IO fine-grained scenes, the collaborative enhancement of multisensor data enables the fine-grained scene recognition accuracy to exceed 99%, which is improved to varying degrees compared with single-sensor recognition schemes. In addition, compared with the six current mainstream models, the proposed fusion architecture performs better in terms of accuracy and generalization ability.
| Original language | English |
|---|---|
| Pages (from-to) | 7362-7378 |
| Number of pages | 17 |
| Journal | IEEE Sensors Journal |
| Volume | 26 |
| Issue number | 5 |
| DOIs | |
| State | Published - 2026 |
Keywords
- CNN–ViT–TabTransformer (C–V–T)
- multimodal data fusion
- scene recognition
- spatiotemporal characteristics
Fingerprint
Dive into the research topics of 'Fine-Grained Indoor–Outdoor Scene Recognition: Leveraging Spatiotemporal Characteristics From Multimodal Sensors via CNN–ViT–TabTransformer Fusion'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver