Abstract
Detecting small and information-scarce objects within complex 3-D backgrounds remains a critical yet challenging task in industrial scenarios. While existing multimodal approaches leverage the cross-modal complementation between point clouds and intensity data to improve object representation, they still face limitations in addressing multimodal fine alignment and achieving precise 3-D positional regression. This article proposes a new 3-D object detection network based on feature fusion between point clouds and multiview images. First, we propose a dense multimodal feature fusion (DMFF) module that establishes point-to-pixel fine correspondence and effectively integrates multimodal feature channels. Then, we design a normalized 3-D positional embedding generator to enhance a transformer-based detection head, which improves localization accuracy through refined positional encoding (PE) of the fused features. Experimental results on a multimodal industrial dataset demonstrate that the proposed method achieves state-of-the-art performance with an AP of 0.903 and a recall of 94.10%, representing 6.35% and 3.58% improvements in comparison to the existing optimal method. Specifically, the method achieves optimal performance of 2.66 and 0.85 mm on the proposed metrics 3-D mATE and 3-D mASE, representing 19.64%, and 29.75% improvement over the existing optimal method, respectively.
| Original language | English |
|---|---|
| Article number | 2542210 |
| Journal | IEEE Transactions on Instrumentation and Measurement |
| Volume | 74 |
| DOIs | |
| State | Published - 2025 |
Keywords
- 3-D position embedding
- 3-D small object detection
- component inspection
- multiview images
- point cloud-image fusion
Fingerprint
Dive into the research topics of 'Multimodal Transformation for Small-Scale 3-D Object Detection in Industrial Scenarios'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver