Learning Joint Multimodal Representation Based on Multi-fusion Deep Neural Networks

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recently, learning joint representation of multimodal data has received more and more attentions. Multimodal features are concept-level compositive features which are more effective than those single-modality features. Most existing methods only mine interactions between modalities on the top of their networks for one time to learn multi-modal representation. In this paper, we propose a multi-fusion deep learning framework which learns multimodal features richer in semantic. The framework sets multiple fusing points in different level of feature spaces, and then integrates and passes the fusing information step by step from the low level to higher levels. Moreover, we propose a multi-channel decoding network with alternate fine-tuning strategy to fully mine the modality–specific information and cross-modality correlations. We are also the first to introduce deep learning features into multimodal deep learning, alleviating the semantic and statistical property differences between modalities to learn better features. Extensive experiments on real-world datasets demonstrate that, our proposed method achieves superior performance compared with the state-of-the-art methods.

Original languageEnglish
Title of host publicationNeural Information Processing - 24th International Conference, ICONIP 2017, Proceedings
EditorsDongbin Zhao, El-Sayed M. El-Alfy, Derong Liu, Shengli Xie, Yuanqing Li
PublisherSpringer Verlag
Pages276-285
Number of pages10
ISBN (Print)9783319700953
DOIs
StatePublished - 2017
Event24th International Conference on Neural Information Processing, ICONIP 2017 - Guangzhou, China
Duration: 14 Nov 201718 Nov 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10635 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference24th International Conference on Neural Information Processing, ICONIP 2017
Country/TerritoryChina
CityGuangzhou
Period14/11/1718/11/17

Keywords

  • Deep learning
  • Multi-fusion
  • Multimodal
  • Semantic integration

Fingerprint

Dive into the research topics of 'Learning Joint Multimodal Representation Based on Multi-fusion Deep Neural Networks'. Together they form a unique fingerprint.

Cite this