VGGAN: Visual Grounding GAN Using Panoptic Transformers

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Visual Grounding is an important part of image annotation generation. The existing methods usually use data alignment based on the similarity calculation of visual text features in location inference and multi-modal fusion, which will lose visual and text information to some extent, and is more likely to make the model overfit the data of specific scenes. To solve this problem, we propose a Visual Grounding Generative Adversarial Network (VGGAN) for visual text fusion using the panoptic transformer. We use the generative adversarial network to generate the prediction, judge the accuracy, and design the visual text transformer according to the panoptic theory. The model can retain the feature information, realize the full interactions between features, thereby better supporting the feature fusion of visual and text. Experimental results on the COCO dataset of complex daily scenes verify the effectiveness of our model, and our model achieves the highest prediction accuracy compared with the state-of-the-art methods.

Original languageEnglish
Title of host publication2023 8th International Conference on Image, Vision and Computing, ICIVC 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages591-597
Number of pages7
ISBN (Electronic)9798350335231
DOIs
StatePublished - 2023
Event8th International Conference on Image, Vision and Computing, ICIVC 2023 - Dalian, China
Duration: 27 Jul 202329 Jul 2023

Publication series

Name2023 8th International Conference on Image, Vision and Computing, ICIVC 2023

Conference

Conference8th International Conference on Image, Vision and Computing, ICIVC 2023
Country/TerritoryChina
CityDalian
Period27/07/2329/07/23

Keywords

  • generative adversarial network
  • panoptic theory
  • transformer
  • visual grounding

Fingerprint

Dive into the research topics of 'VGGAN: Visual Grounding GAN Using Panoptic Transformers'. Together they form a unique fingerprint.

Cite this