Reject Decoding via Language-Vision Models for Text-to-Image Synthesis

  • Fuxiang Wu
  • , Liu Liu
  • , Fusheng Hao
  • , Fengxiang He
  • , Lei Wang
  • , Jun Cheng*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Transformer-based text-to-image synthesis generates images from abstractive textual conditions and achieves prompt results. Since transformer-based models predict visual tokens step by step in testing, where the early error is hard to be corrected and would be propagated. To alleviate this issue, the common practice is drawing multi-paths from the transformer-based models and re-ranking the multi-images decoded from multi-paths to find the best one and filter out others. Therefore, the computing procedure of excluding images may be inefficient. To improve the effectiveness and efficiency of decoding, we exploit a reject decoding algorithm with tiny multi-modal models to enlarge the searching space and exclude the useless paths as early as possible. Specifically, we build tiny multi-modal models to evaluate the similarities between the partial paths and the caption at multi scales. Then, we propose a reject decoding algorithm to exclude some lowest quality partial paths at the inner steps. Thus, under the same computing load as the original decoding, we could search across more multi-paths to improve the decoding efficiency and synthesizing quality. The experiments conducted on the MS-COCO dataset and large-scale datasets show that the proposed reject decoding algorithm can exclude the useless paths and enlarge the searching paths to improve the synthesizing quality by consuming less time.

Original languageEnglish
Title of host publicationAAAI-23 Technical Tracks 3
EditorsBrian Williams, Yiling Chen, Jennifer Neville
PublisherAAAI press
Pages2785-2794
Number of pages10
ISBN (Electronic)9781577358800
DOIs
StatePublished - 27 Jun 2023
Externally publishedYes
Event37th AAAI Conference on Artificial Intelligence, AAAI 2023 - Washington, United States
Duration: 7 Feb 202314 Feb 2023

Publication series

NameProceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023
Volume37

Conference

Conference37th AAAI Conference on Artificial Intelligence, AAAI 2023
Country/TerritoryUnited States
CityWashington
Period7/02/2314/02/23

Fingerprint

Dive into the research topics of 'Reject Decoding via Language-Vision Models for Text-to-Image Synthesis'. Together they form a unique fingerprint.

Cite this