Skip to main navigation Skip to search Skip to main content

Depression detection using BiLSTM multi-head attention fusion network

  • Xiaobo Zhang
  • , Xue Gong
  • , Wei Li
  • , Guoqing Liu*
  • , Yang Li
  • *Corresponding author for this work
  • Southwest Jiaotong University
  • The Third People’s Hospital of Chengdu

Research output: Contribution to journalArticlepeer-review

Abstract

In recent years, depression detection has emerged as a prominent research focus. However, most existing studies rely predominantly on unimodal features, failing to fully exploit the complementary information embedded in both audio and textual modalities during interviews. Current multimodal fusion approaches often overlook the synergistic interaction between global and local information within and across modalities, which may result in models overly dependent on local cues while lacking comprehensive affective representations. To address this limitation, we propose a novel multimodal fusion network based on the multi-head attention mechanism for depression detection using audio and textual features. Specifically, for the audio modality, we adopt an enhanced Convolutional Autoencoder (CAE) model to automatically extract deep emotional representations directly from raw audio signals, thereby improving the model’s expressive capacity. For the textual modality, CNN-based architecture is employed to extract deep features from the textual sequences, enhancing sentiment analysis performance. In the multimodal fusion stage, by introducing global information fusion and an attention mechanism, the proposed BiLSTM Multi-head Attention Fusion Network (BMAFN) can more accurately capture emotional relationships between audio and text, thereby achieving better depression prediction performance. To evaluate the performance and effectiveness of the proposed multimodal fusion network, five-fold cross-validation is conducted on the EATD-Corpus and AVEC2014 dataset, and comparisons are made with other state-of-the-art studies in the field of multimodal fusion. On the EATD-Corpus, the model achieves the highest F1-score (0.75), precision (0.91), and accuracy (0.82), exceeding the baseline accuracy by 2 %-4 % and outperforming other multimodal fusion models.

Original languageEnglish
Article number130100
JournalExpert Systems with Applications
Volume299
DOIs
StatePublished - 1 Mar 2026

Keywords

  • Artificial intelligence
  • CAE
  • Depression detection
  • Multimodal fusion network

Fingerprint

Dive into the research topics of 'Depression detection using BiLSTM multi-head attention fusion network'. Together they form a unique fingerprint.

Cite this