TY - JOUR
T1 - Multimodal Contrastive Transformer for Explainable Recommendation
AU - Liu, Zhuang
AU - Ma, Yunpu
AU - Schubert, Matthias
AU - Ouyang, Yuanxin
AU - Rong, Wenge
AU - Xiong, Zhang
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2024/4/1
Y1 - 2024/4/1
N2 - Explanations play an essential role in helping users evaluate results from recommender systems. Various natural language generation methods have been proposed to generate explanations for the recommendation. However, they usually suffer from two problems. First, since user-provided review text contains noisy data, the generated explanations may be irrelevant to the recommended items. Second, as lacking some supervision signals, most of the generated sentences are similar, which cannot meet the diversity and personalized needs of users. To tackle these problems, we propose a multimodal contrastive transformer (MMCT) model for an explainable recommendation, which incorporates multimodal information into the learning process, including sentiment features, item features, item images, and refined user reviews. Meanwhile, we propose a dynamic fusion mechanism during the decoding stage, which generates supervision signals to guide the explanation generation. Additionally, we develop a contrastive objective to generate diverse explainable texts. Comprehensive experiments on two real-world datasets show that the proposed model outperforms comparable explainable recommendation baselines in terms of explanation performance and recommendation performance. Efficiency analysis and robustness analysis verify the advantages of the proposed model. While ablation analysis establishes the relative contributions of the respective components and various modalities, the case study shows the working of our model from an intuitive sense.
AB - Explanations play an essential role in helping users evaluate results from recommender systems. Various natural language generation methods have been proposed to generate explanations for the recommendation. However, they usually suffer from two problems. First, since user-provided review text contains noisy data, the generated explanations may be irrelevant to the recommended items. Second, as lacking some supervision signals, most of the generated sentences are similar, which cannot meet the diversity and personalized needs of users. To tackle these problems, we propose a multimodal contrastive transformer (MMCT) model for an explainable recommendation, which incorporates multimodal information into the learning process, including sentiment features, item features, item images, and refined user reviews. Meanwhile, we propose a dynamic fusion mechanism during the decoding stage, which generates supervision signals to guide the explanation generation. Additionally, we develop a contrastive objective to generate diverse explainable texts. Comprehensive experiments on two real-world datasets show that the proposed model outperforms comparable explainable recommendation baselines in terms of explanation performance and recommendation performance. Efficiency analysis and robustness analysis verify the advantages of the proposed model. While ablation analysis establishes the relative contributions of the respective components and various modalities, the case study shows the working of our model from an intuitive sense.
KW - Contrastive learning
KW - explainable recommendation
KW - multimodal information
KW - natural language generation
KW - transformer
UR - https://www.scopus.com/pages/publications/85161019457
U2 - 10.1109/TCSS.2023.3276273
DO - 10.1109/TCSS.2023.3276273
M3 - 文章
AN - SCOPUS:85161019457
SN - 2329-924X
VL - 11
SP - 2632
EP - 2643
JO - IEEE Transactions on Computational Social Systems
JF - IEEE Transactions on Computational Social Systems
IS - 2
ER -