Abstract
Sound source localization using deep learning presents great potential, but its widespread application is often hindered by the limited availability of real-world labeled data needed to train high-capacity models. This work introduces a deep learning-based framework designed to address this data scarcity challenge in indoor direction-of-arrival (DoA) estimation tasks. Specifically, complex Morlet wavelet transforms are used to generate high-resolution time-frequency representations from multichannel microphone array signals, capturing both temporal and spectral information, including crucial phase cues. These representations are then fed into a hybrid CoAtNet architecture that combines convolutional layers with self-attention mechanisms to enable effective local feature extraction and global spatial context modeling. To mitigate the dependence on extensive real-world datasets, a two-stage training strategy is adopted: large-scale synthetic data generated via Pyroomacoustics is used for pretraining, followed by fine-tuning on a small set of real-world samples for domain adaptation. Experimental results demonstrate that the proposed system achieves 98.22 % accuracy on real recordings and 95.21 % on the SLoClas benchmark dataset, outperforming baseline deep learning models. The proposed framework offers a practical and efficient solution for sound source localization in real-world applications where labeled data is limited.
| Original language | English |
|---|---|
| Article number | 105683 |
| Journal | Digital Signal Processing: A Review Journal |
| Volume | 168 |
| DOIs | |
| State | Published - Jan 2026 |
Keywords
- Coatnet model
- Complex morlet wavelet transform
- Direction of arrival
- Microphone arrays
Fingerprint
Dive into the research topics of 'A complex Morlet- convolutional attention network framework for robust direction-of-arrival estimation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver