Abstract
Multi-modal emotion recognition in conversation aims to identify the emotion of the target utterance according to the multi-modal conversation context, which is the primary task of building empathetic dialogue systems (EDS). Existing works only consider multi-modal conversation itself while ignoring the knowledge information about the listener and the speaker, leading to the limit in capturing the emotional features of the target utterance. To solve this problem, a listening and speaking knowledge fusion network (LSKFN) is proposed, which introduces the external common sense knowledge and fuses it with multi-modal context efficiently. The proposed LSKFN consists of four stages, which are used to extract multi-modal context features, integrate listening and speaking knowledge features, eliminate redundant features, and predict emotional probability distribution. Experimental results on two public multi-modal conversation datasets demonstrate that the LSKFN can extract richer emotional features for the target utterance, and obtain better emotional recognition performance compared with other benchmark models.
| Translated title of the contribution | Listening and speaking knowledge fusion network for multi-modal emotion recognition in conversation |
|---|---|
| Original language | Chinese (Traditional) |
| Pages (from-to) | 2031-2040 |
| Number of pages | 10 |
| Journal | Kongzhi yu Juece/Control and Decision |
| Volume | 39 |
| Issue number | 6 |
| DOIs | |
| State | Published - Jun 2024 |
Fingerprint
Dive into the research topics of 'Listening and speaking knowledge fusion network for multi-modal emotion recognition in conversation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver