TY - GEN
T1 - FreeGen
T2 - 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
AU - Wang, Wenzhuang
AU - Ma, Mingcan
AU - Chen, Yong
AU - Xia, Changqun
AU - Liang, Zhenbao
AU - Li, Jia
N1 - Publisher Copyright:
Copyright © 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2025/4/11
Y1 - 2025/4/11
N2 - Text-to-image diffusion model has inspired research into text-to-data synthesis without human intervention, where spatial attentions correlated with semantic entities in text prompts are primarily interpreted as pseudo-masks. However, these vannila attentions often deliver visual-linguistic discrepancies, in which the associations between image features and entity-level tokens are unstable and divergent, yielding inferior masks for realistic applications, especially in more practical open-vocabulary settings. To tackle this issue, we propose a novel text-guided self-driven generative paradigm, termed FreeGen, which addresses the discrepancies by recalibrating intrinsic visual-linguistic correlations and serves as a real-data-free method to automatically synthesize open-vocabulary pixel-level data for arbitrary entities. Specifically, we first learn an Attention Self-Rectification mechanism to reproject the inherent attention matrices to achieve robust semantic alignment, thereby obtaining class-discriminative masks. A Temporal Fluctuation Factor is present to assess mask quality based on its variation over uniform sampling timesteps, enabling the selection of reliable masks. These masks are then employed as self-supervised signals to support the learning of an Entity-level Grounding Decoder in a self-training manner, thus producing open-vocabulary segmentation results. Extensive experiments show that the existing segmenters trained on FreeGen narrow the performance gap with real data counterparts and remarkably outperform the state-of-the-art methods.
AB - Text-to-image diffusion model has inspired research into text-to-data synthesis without human intervention, where spatial attentions correlated with semantic entities in text prompts are primarily interpreted as pseudo-masks. However, these vannila attentions often deliver visual-linguistic discrepancies, in which the associations between image features and entity-level tokens are unstable and divergent, yielding inferior masks for realistic applications, especially in more practical open-vocabulary settings. To tackle this issue, we propose a novel text-guided self-driven generative paradigm, termed FreeGen, which addresses the discrepancies by recalibrating intrinsic visual-linguistic correlations and serves as a real-data-free method to automatically synthesize open-vocabulary pixel-level data for arbitrary entities. Specifically, we first learn an Attention Self-Rectification mechanism to reproject the inherent attention matrices to achieve robust semantic alignment, thereby obtaining class-discriminative masks. A Temporal Fluctuation Factor is present to assess mask quality based on its variation over uniform sampling timesteps, enabling the selection of reliable masks. These masks are then employed as self-supervised signals to support the learning of an Entity-level Grounding Decoder in a self-training manner, thus producing open-vocabulary segmentation results. Extensive experiments show that the existing segmenters trained on FreeGen narrow the performance gap with real data counterparts and remarkably outperform the state-of-the-art methods.
UR - https://www.scopus.com/pages/publications/105004274636
U2 - 10.1609/aaai.v39i8.32853
DO - 10.1609/aaai.v39i8.32853
M3 - 会议稿件
AN - SCOPUS:105004274636
T3 - Proceedings of the AAAI Conference on Artificial Intelligence
SP - 7916
EP - 7924
BT - Special Track on AI Alignment
A2 - Walsh, Toby
A2 - Shah, Julie
A2 - Kolter, Zico
PB - Association for the Advancement of Artificial Intelligence
Y2 - 25 February 2025 through 4 March 2025
ER -