TY - JOUR
T1 - Soft-label guided multi-granularity prompts learning for human-object interaction detection
AU - Han, Xiaoqian
AU - Zhang, Xiaowei
AU - Niu, Guanglin
AU - Zhou, Mingliang
AU - Pan, Zhenkuan
N1 - Publisher Copyright:
© 2026 Elsevier B.V.
PY - 2026/4
Y1 - 2026/4
N2 - Vision-language models (VLMs) have driven substantial progress in human-object interaction (HOI) detection. However, existing VLM-based HOI detectors typically rely on coarse multimodal prompts for knowledge transfer, which makes it difficult to comprehensively capture interaction-relevant contextual cues and consequently weakens generalization to HOI detection. Meanwhile, hard-label supervised learning fundamentally ignores semantic correlations among interaction categories, which tends to suppress knowledge transfer due to misalignment with the continuous semantic similarity structure encoded by VLM representations in the embedding space. To address these challenges, we propose SMPL, a Soft-label guided Multi-granularity Prompt Learning model for HOI detection, which facilitates prompt learning by jointly capturing multi-level interaction cues and providing semantically calibrated supervision aligned with VLM embeddings. Specifically, we design multi-granularity visual and textual prompts to capture interaction cues at different levels of detail, thereby improving generalization to interaction categories. Moreover, we introduce soft-label learning to jointly optimize interaction classification with the hard-labels and soft-label supervision, which naturally reflects interaction-level semantic similarity, enabling the model to learn implicit interaction relations without additional annotations. Extensive experiments demonstrate that SMPL achieves 38.97 mAP on the HICO-DET dataset and improves performance by 2.64 mAP over the current state of the art on the challenging Rare split. SMPL also performs strongly under multiple zero-shot HOI settings, demonstrating excellent generalization to unseen interactions. The code and models are available at https://github.com/hxqstree/SMPL.
AB - Vision-language models (VLMs) have driven substantial progress in human-object interaction (HOI) detection. However, existing VLM-based HOI detectors typically rely on coarse multimodal prompts for knowledge transfer, which makes it difficult to comprehensively capture interaction-relevant contextual cues and consequently weakens generalization to HOI detection. Meanwhile, hard-label supervised learning fundamentally ignores semantic correlations among interaction categories, which tends to suppress knowledge transfer due to misalignment with the continuous semantic similarity structure encoded by VLM representations in the embedding space. To address these challenges, we propose SMPL, a Soft-label guided Multi-granularity Prompt Learning model for HOI detection, which facilitates prompt learning by jointly capturing multi-level interaction cues and providing semantically calibrated supervision aligned with VLM embeddings. Specifically, we design multi-granularity visual and textual prompts to capture interaction cues at different levels of detail, thereby improving generalization to interaction categories. Moreover, we introduce soft-label learning to jointly optimize interaction classification with the hard-labels and soft-label supervision, which naturally reflects interaction-level semantic similarity, enabling the model to learn implicit interaction relations without additional annotations. Extensive experiments demonstrate that SMPL achieves 38.97 mAP on the HICO-DET dataset and improves performance by 2.64 mAP over the current state of the art on the challenging Rare split. SMPL also performs strongly under multiple zero-shot HOI settings, demonstrating excellent generalization to unseen interactions. The code and models are available at https://github.com/hxqstree/SMPL.
KW - Human object interaction detection
KW - Large vision-language model
KW - Prompt learning
KW - Zero-shot detection
UR - https://www.scopus.com/pages/publications/105029410639
U2 - 10.1016/j.asoc.2026.114765
DO - 10.1016/j.asoc.2026.114765
M3 - 文章
AN - SCOPUS:105029410639
SN - 1568-4946
VL - 192
JO - Applied Soft Computing
JF - Applied Soft Computing
M1 - 114765
ER -