Abstract
In recent years, depression detection has emerged as a prominent research focus. However, most existing studies rely predominantly on unimodal features, failing to fully exploit the complementary information embedded in both audio and textual modalities during interviews. Current multimodal fusion approaches often overlook the synergistic interaction between global and local information within and across modalities, which may result in models overly dependent on local cues while lacking comprehensive affective representations. To address this limitation, we propose a novel multimodal fusion network based on the multi-head attention mechanism for depression detection using audio and textual features. Specifically, for the audio modality, we adopt an enhanced Convolutional Autoencoder (CAE) model to automatically extract deep emotional representations directly from raw audio signals, thereby improving the model’s expressive capacity. For the textual modality, CNN-based architecture is employed to extract deep features from the textual sequences, enhancing sentiment analysis performance. In the multimodal fusion stage, by introducing global information fusion and an attention mechanism, the proposed BiLSTM Multi-head Attention Fusion Network (BMAFN) can more accurately capture emotional relationships between audio and text, thereby achieving better depression prediction performance. To evaluate the performance and effectiveness of the proposed multimodal fusion network, five-fold cross-validation is conducted on the EATD-Corpus and AVEC2014 dataset, and comparisons are made with other state-of-the-art studies in the field of multimodal fusion. On the EATD-Corpus, the model achieves the highest F1-score (0.75), precision (0.91), and accuracy (0.82), exceeding the baseline accuracy by 2 %-4 % and outperforming other multimodal fusion models.
| Original language | English |
|---|---|
| Article number | 130100 |
| Journal | Expert Systems with Applications |
| Volume | 299 |
| DOIs | |
| State | Published - 1 Mar 2026 |
Keywords
- Artificial intelligence
- CAE
- Depression detection
- Multimodal fusion network
Fingerprint
Dive into the research topics of 'Depression detection using BiLSTM multi-head attention fusion network'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver