TY - GEN
T1 - VGGAN
T2 - 8th International Conference on Image, Vision and Computing, ICIVC 2023
AU - Quan, Fengnan
AU - Lang, Bo
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Visual Grounding is an important part of image annotation generation. The existing methods usually use data alignment based on the similarity calculation of visual text features in location inference and multi-modal fusion, which will lose visual and text information to some extent, and is more likely to make the model overfit the data of specific scenes. To solve this problem, we propose a Visual Grounding Generative Adversarial Network (VGGAN) for visual text fusion using the panoptic transformer. We use the generative adversarial network to generate the prediction, judge the accuracy, and design the visual text transformer according to the panoptic theory. The model can retain the feature information, realize the full interactions between features, thereby better supporting the feature fusion of visual and text. Experimental results on the COCO dataset of complex daily scenes verify the effectiveness of our model, and our model achieves the highest prediction accuracy compared with the state-of-the-art methods.
AB - Visual Grounding is an important part of image annotation generation. The existing methods usually use data alignment based on the similarity calculation of visual text features in location inference and multi-modal fusion, which will lose visual and text information to some extent, and is more likely to make the model overfit the data of specific scenes. To solve this problem, we propose a Visual Grounding Generative Adversarial Network (VGGAN) for visual text fusion using the panoptic transformer. We use the generative adversarial network to generate the prediction, judge the accuracy, and design the visual text transformer according to the panoptic theory. The model can retain the feature information, realize the full interactions between features, thereby better supporting the feature fusion of visual and text. Experimental results on the COCO dataset of complex daily scenes verify the effectiveness of our model, and our model achieves the highest prediction accuracy compared with the state-of-the-art methods.
KW - generative adversarial network
KW - panoptic theory
KW - transformer
KW - visual grounding
UR - https://www.scopus.com/pages/publications/85175618355
U2 - 10.1109/ICIVC58118.2023.10270121
DO - 10.1109/ICIVC58118.2023.10270121
M3 - 会议稿件
AN - SCOPUS:85175618355
T3 - 2023 8th International Conference on Image, Vision and Computing, ICIVC 2023
SP - 591
EP - 597
BT - 2023 8th International Conference on Image, Vision and Computing, ICIVC 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 27 July 2023 through 29 July 2023
ER -