TY - GEN
T1 - Learning social image embedding with deep multimodal attention networks
AU - Huang, Feiran
AU - Zhang, Xiaoming
AU - Li, Zhoujun
AU - Mei, Tao
AU - He, Yueying
AU - Zhao, Zhonghua
N1 - Publisher Copyright:
© 2017 Association for Computing Machinery.
PY - 2017/10/23
Y1 - 2017/10/23
N2 - Learning social media data embedding by deep models has attracted extensive research interest as well as boomed a lot of applications, such as link prediction, classification, and cross-modal search. However, for social images which contain both link information and multimodal contents (e.g., text description, and visual content), simply employing the embedding learnt from network structure or data content results in sub-optimal social image representation. In this paper, we propose a novel social image embedding approach called Deep Multimodal Attention Networks (DMAN), which employs a deep model to jointly embed multimodal contents and link information. Specifically, to effectively capture the correlations between multimodal contents, we propose a multimodal attention network to encode the fine-granularity relation between image regions and textual words. To leverage the network structure for embedding learning, a novel Siamese-Triplet neural network is proposed to model the links among images. With the joint deep model, the learnt embedding can capture both the multimodal contents and the nonlinear network information. Extensive experiments are conducted to investigate the effectiveness of our approach in the applications of multi-label classification and cross-modal search. Compared to state-of-the-art image embeddings, our proposed DMAN achieves significant improvement in the tasks of multi-label classification and cross-modal search.
AB - Learning social media data embedding by deep models has attracted extensive research interest as well as boomed a lot of applications, such as link prediction, classification, and cross-modal search. However, for social images which contain both link information and multimodal contents (e.g., text description, and visual content), simply employing the embedding learnt from network structure or data content results in sub-optimal social image representation. In this paper, we propose a novel social image embedding approach called Deep Multimodal Attention Networks (DMAN), which employs a deep model to jointly embed multimodal contents and link information. Specifically, to effectively capture the correlations between multimodal contents, we propose a multimodal attention network to encode the fine-granularity relation between image regions and textual words. To leverage the network structure for embedding learning, a novel Siamese-Triplet neural network is proposed to model the links among images. With the joint deep model, the learnt embedding can capture both the multimodal contents and the nonlinear network information. Extensive experiments are conducted to investigate the effectiveness of our approach in the applications of multi-label classification and cross-modal search. Compared to state-of-the-art image embeddings, our proposed DMAN achieves significant improvement in the tasks of multi-label classification and cross-modal search.
KW - Attention model
KW - Deep learning
KW - Siamese-Triplet
KW - Social embedding
UR - https://www.scopus.com/pages/publications/85034817592
U2 - 10.1145/3126686.3126720
DO - 10.1145/3126686.3126720
M3 - 会议稿件
AN - SCOPUS:85034817592
T3 - Thematic Workshops 2017 - Proceedings of the Thematic Workshops of ACM Multimedia 2017, co-located with MM 2017
SP - 460
EP - 468
BT - Thematic Workshops 2017 - Proceedings of the Thematic Workshops of ACM Multimedia 2017, co-located with MM 2017
PB - Association for Computing Machinery, Inc
T2 - 1st International ACM Thematic Workshops, Thematic Workshops 2017
Y2 - 23 October 2017 through 27 October 2017
ER -