TY - GEN
T1 - Look Before You Leap
T2 - 30th ACM International Conference on Multimedia, MM 2022
AU - Wang, Zijie
AU - Zhu, Aichun
AU - Xue, Jingyi
AU - Wan, Xili
AU - Liu, Chao
AU - Wang, Tian
AU - Li, Yifeng
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/10/10
Y1 - 2022/10/10
N2 - The core problem of text-based person retrieval is how to bridge the heterogeneous gap between multi-modal data. Many previous approaches contrive to learning a latent common manifold mapping paradigm following a cross-modal distribution consensus prediction (CDCP) manner. When mapping features from distribution of one certain modality into the common manifold, feature distribution of the opposite modality is completely invisible. That is to say, how to achieve a cross-modal distribution consensus so as to embed and align the multi-modal features in a constructed cross-modal common manifold all depends on the experience of the model itself, instead of the actual situation. With such methods, it is inevitable that the multi-modal data can not be well aligned in the common manifold, which finally leads to a sub-optimal retrieval performance. To overcome this CDCP dilemma, we propose a novel algorithm termed LBUL to learn a Consistent Cross-modal Common Manifold (C3 M) for text-based person retrieval. The core idea of our method, just as a Chinese saying goes, is to 'san si er hou xing', namely, to Look Before yoU Leap (LBUL). The common manifold mapping mechanism of LBUL contains a looking step and a leaping step. Compared to CDCP-based methods, LBUL considers distribution characteristics of both the visual and textual modalities before embedding data from one certain modality into C3 M to achieve a more solid cross-modal distribution consensus, and hence achieve a superior retrieval accuracy. We evaluate our proposed method on two text-based person retrieval datasets CUHK-PEDES and RSTPReid. Experimental results demonstrate that the proposed LBUL outperforms previous methods and achieves the state-of-the-art performance.
AB - The core problem of text-based person retrieval is how to bridge the heterogeneous gap between multi-modal data. Many previous approaches contrive to learning a latent common manifold mapping paradigm following a cross-modal distribution consensus prediction (CDCP) manner. When mapping features from distribution of one certain modality into the common manifold, feature distribution of the opposite modality is completely invisible. That is to say, how to achieve a cross-modal distribution consensus so as to embed and align the multi-modal features in a constructed cross-modal common manifold all depends on the experience of the model itself, instead of the actual situation. With such methods, it is inevitable that the multi-modal data can not be well aligned in the common manifold, which finally leads to a sub-optimal retrieval performance. To overcome this CDCP dilemma, we propose a novel algorithm termed LBUL to learn a Consistent Cross-modal Common Manifold (C3 M) for text-based person retrieval. The core idea of our method, just as a Chinese saying goes, is to 'san si er hou xing', namely, to Look Before yoU Leap (LBUL). The common manifold mapping mechanism of LBUL contains a looking step and a leaping step. Compared to CDCP-based methods, LBUL considers distribution characteristics of both the visual and textual modalities before embedding data from one certain modality into C3 M to achieve a more solid cross-modal distribution consensus, and hence achieve a superior retrieval accuracy. We evaluate our proposed method on two text-based person retrieval datasets CUHK-PEDES and RSTPReid. Experimental results demonstrate that the proposed LBUL outperforms previous methods and achieves the state-of-the-art performance.
KW - cross-modal retrieval
KW - person retrieval
KW - text-based person re-identification
UR - https://www.scopus.com/pages/publications/85150622366
U2 - 10.1145/3503161.3548166
DO - 10.1145/3503161.3548166
M3 - 会议稿件
AN - SCOPUS:85150622366
T3 - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
SP - 1984
EP - 1992
BT - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 10 October 2022 through 14 October 2022
ER -