跳到主要导航 跳到搜索 跳到主要内容

Variational structured semantic inference for diverse image captioning

  • Fuhai Chen
  • , Rongrong Ji*
  • , Jiayi Ji
  • , Xiaoshuai Sun
  • , Baochang Zhang
  • , Xuri Ge
  • , Yongjian Wu
  • , Feiyue Huang
  • , Yan Wang
  • *此作品的通讯作者

科研成果: 期刊稿件会议文章同行评审

摘要

Despite the exciting progress in image captioning, generating diverse captions for a given image remains as an open problem. Existing methods typically apply generative models such as Variational Auto-Encoder to diversify the captions, which however neglect two key factors of diverse expression, i.e., the lexical diversity and the syntactic diversity. To model these two inherent diversities in image captioning, we propose a Variational Structured Semantic Inferring model (termed VSSI-cap) executed in a novel structured encoder-inferer-decoder schema. VSSI-cap mainly innovates in a novel structure, i.e., Variational Multi-modal Inferring tree (termed VarMI-tree). In particular, conditioned on the visual-textual features from the encoder, the VarMI-tree models the lexical and syntactic diversities by inferring their latent variables (with variations) in an approximate posterior inference guided by a visual semantic prior. Then, a reconstruction loss and the posterior-prior KL-divergence are jointly estimated to optimize the VSSI-cap model. Finally, diverse captions are generated upon the visual features and the latent variables from this structured encoder-inferer-decoder model. Experiments on the benchmark dataset show that the proposed VSSI-cap achieves significant improvements over the state-of-the-arts.

源语言英语
期刊Advances in Neural Information Processing Systems
32
出版状态已出版 - 2019
活动33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019 - Vancouver, 加拿大
期限: 8 12月 201914 12月 2019

指纹

探究 'Variational structured semantic inference for diverse image captioning' 的科研主题。它们共同构成独一无二的指纹。

引用此