Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition

  • Guojun Yin
  • , Lu Sheng
  • , Bin Liu
  • , Nenghai Yu
  • , Xiaogang Wang
  • , Jing Shao*
  • , Chen Change Loy
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recognizing visual relationships ˂subject-predicate-object˃ among any pair of localized objects is pivotal for image understanding. Previous studies have shown remarkable progress in exploiting linguistic priors or external textual information to improve the performance. In this work, we investigate an orthogonal perspective based on feature interactions. We show that by encouraging deep message propagation and interactions between local object features and global predicate features, one can achieve compelling performance in recognizing complex relationships without using any linguistic priors. To this end, we present two new pooling cells to encourage feature interactions: (i) Contrastive ROI Pooling Cell, which has a unique deROI pooling that inversely pools local object features to the corresponding area of global predicate features. (ii) Pyramid ROI Pooling Cell, which broadcasts global predicate features to reinforce local object features. The two cells constitute a Spatiality-Context-Appearance Module (SCA-M), which can be further stacked consecutively to form our final Zoom-Net. We further shed light on how one could resolve ambiguous and noisy object and predicate annotations by Intra-Hierarchical trees (IH-tree). Extensive experiments conducted on Visual Genome dataset demonstrate the effectiveness of our feature-oriented approach compared to state-of-the-art methods (Acc@1 11.42% from 8.16%) that depend on explicit modeling of linguistic interactions. We further show that SCA-M can be incorporated seamlessly into existing approaches to improve the performance by a large margin.

Original languageEnglish
Title of host publicationComputer Vision – ECCV 2018 - 15th European Conference, 2018, Proceedings
EditorsYair Weiss, Martial Hebert, Vittorio Ferrari, Cristian Sminchisescu
PublisherSpringer Verlag
Pages330-347
Number of pages18
ISBN (Print)9783030012182
DOIs
StatePublished - 2018
Externally publishedYes
Event15th European Conference on Computer Vision, ECCV 2018 - Munich, Germany
Duration: 8 Sep 201814 Sep 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11207 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference15th European Conference on Computer Vision, ECCV 2018
Country/TerritoryGermany
CityMunich
Period8/09/1814/09/18

Fingerprint

Dive into the research topics of 'Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition'. Together they form a unique fingerprint.

Cite this