Modeling text with graph convolutional network for cross-modal information retrieval

  • Jing Yu
  • , Yuhang Lu
  • , Zengchang Qin*
  • , Weifeng Zhang
  • , Yanbing Liu
  • , Jianlong Tan
  • , Li Guo
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Cross-modal information retrieval aims to find heterogeneous data of various modalities from a given query of one modality. The main challenge is to map different modalities into a common semantic space, in which distance between concepts in different modalities can be well modeled. For cross-modal information retrieval between images and texts, existing work mostly uses off-the-shelf Convolutional Neural Network (CNN) for image feature extraction. For texts, word-level features such as bag-of-words or word2vec are employed to build deep learning models to represent texts. Besides word-level semantics, the semantic relations between words are also informative but less explored. In this paper, we model texts by graphs using similarity measure based on word2vec. A dual-path neural network model is proposed for couple feature learning in cross-modal information retrieval. One path utilizes Graph Convolutional Network (GCN) for text modeling based on graph representations. The other path uses a neural network with layers of nonlinearities for image modeling based on off-the-shelf features. The model is trained by a pairwise similarity loss function to maximize the similarity of relevant text-image pairs and minimize the similarity of irrelevant pairs. Experimental results show that the proposed model outperforms the state-of-the-art methods significantly, with 17% improvement on accuracy for the best case.

Original languageEnglish
Title of host publicationAdvances in Multimedia Information Processing – PCM 2018 - 19th Pacific-Rim Conference on Multimedia, 2018, Proceedings
EditorsChong-Wah Ngo, Richang Hong, Meng Wang, Wen-Huang Cheng, Toshihiko Yamasaki
PublisherSpringer Verlag
Pages223-234
Number of pages12
ISBN (Print)9783030007751
DOIs
StatePublished - 2018
Event19th Pacific-Rim Conference on Multimedia, PCM 2018 - Hefei, China
Duration: 21 Sep 201822 Sep 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11164 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference19th Pacific-Rim Conference on Multimedia, PCM 2018
Country/TerritoryChina
CityHefei
Period21/09/1822/09/18

Fingerprint

Dive into the research topics of 'Modeling text with graph convolutional network for cross-modal information retrieval'. Together they form a unique fingerprint.

Cite this