Skip to main navigation Skip to search Skip to main content

Control image captioning spatially and temporally

  • Kun Yan*
  • , Lei Ji
  • , Huaishao Luo
  • , Ming Zhou
  • , Nan Duan
  • , Shuai Ma
  • *Corresponding author for this work
  • Beihang University
  • CAS - Institute of Computing Technology
  • University of Chinese Academy of Sciences
  • Southwest Jiaotong University
  • Microsoft USA

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Generating image captions with user intention is an emerging need. The recently published Localized Narratives dataset takes mouse traces as another input to the image captioning task, which is an intuitive and efficient way for a user to control what to describe in the image. However, how to effectively employ traces to improve generation quality and controllability is still under exploration. This paper aims to solve this problem by proposing a novel model called LoopCAG, which connects Contrastive constraints and Attention Guidance in a Loop manner, engaged explicit spatial and temporal constraints to the generating process. Precisely, each generated sentence is temporally aligned to the corresponding trace sequence through a contrastive learning strategy. Besides, each generated text token is supervised to attend to the correct visual objects under heuristic spatial attention guidance. Comprehensive experimental results demonstrate that our LoopCAG model learns better correspondence among the three modalities(vision, language, and traces) and achieves SOTA performance on trace controlled image captioning task. Moreover, the controllability and explainability of LoopCAG are validated by analyzing spatial and temporal sensitivity during the generation process.

Original languageEnglish
Title of host publicationACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages2014-2025
Number of pages12
ISBN (Electronic)9781954085527
DOIs
StatePublished - 2021
EventJoint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021 - Virtual, Online
Duration: 1 Aug 20216 Aug 2021

Publication series

NameACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference
Volume1

Conference

ConferenceJoint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021
CityVirtual, Online
Period1/08/216/08/21

Fingerprint

Dive into the research topics of 'Control image captioning spatially and temporally'. Together they form a unique fingerprint.

Cite this