TY - GEN
T1 - Knowledge Distilled Group Prompts Learning for HOI Detection with Large Vision-Language Models
AU - Han, Xiaoqian
AU - Niu, Guanglin
AU - Zhou, Mingliang
AU - Zhang, Xiaowei
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Large vision-language models (VLMs) have significantly advanced human-object interaction (HOI) detection. However, existing VLM-based HOI detectors primarily rely on simple text prompt paradigms, specifically in relation to knowledge hallucination, with limited exploration of the intrinsic attributes or extrinsic context. In this paper, we propose a knowledge distilled group prompts learning method for HOI detection, termed GPL-HOI, which transfer knowledge from vision-language models via group prompts and knowledge distillation. Specifically, we design visual-textual group prompts by combining scene-aware, region-aware, and pose-aware prompt to guide knowledge transfer from VLMs. Additionally, we introduce a cross-modal group distillation module,which aligns the semantic features of both the vision and text models via KL divergence, encouraging the visual encoder to generate similar probability distributions to the text encoder through the learnable prompts. Extensive experiments demonstrate that our method surpasses state-of-the-art approaches in both conventional and zero-shot settings, achieving improvements of +2.04 mAP and +1.84 mAP on HICO-DET, respectively. Code will be available at https://github.com/hxqstree/GPL-HOI.
AB - Large vision-language models (VLMs) have significantly advanced human-object interaction (HOI) detection. However, existing VLM-based HOI detectors primarily rely on simple text prompt paradigms, specifically in relation to knowledge hallucination, with limited exploration of the intrinsic attributes or extrinsic context. In this paper, we propose a knowledge distilled group prompts learning method for HOI detection, termed GPL-HOI, which transfer knowledge from vision-language models via group prompts and knowledge distillation. Specifically, we design visual-textual group prompts by combining scene-aware, region-aware, and pose-aware prompt to guide knowledge transfer from VLMs. Additionally, we introduce a cross-modal group distillation module,which aligns the semantic features of both the vision and text models via KL divergence, encouraging the visual encoder to generate similar probability distributions to the text encoder through the learnable prompts. Extensive experiments demonstrate that our method surpasses state-of-the-art approaches in both conventional and zero-shot settings, achieving improvements of +2.04 mAP and +1.84 mAP on HICO-DET, respectively. Code will be available at https://github.com/hxqstree/GPL-HOI.
KW - Human-object interaction detection
KW - Large vision-language model
KW - Prompt learning
UR - https://www.scopus.com/pages/publications/105022659384
U2 - 10.1109/ICME59968.2025.11208960
DO - 10.1109/ICME59968.2025.11208960
M3 - 会议稿件
AN - SCOPUS:105022659384
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - 2025 IEEE International Conference on Multimedia and Expo
PB - IEEE Computer Society
T2 - 2025 IEEE International Conference on Multimedia and Expo, ICME 2025
Y2 - 30 June 2025 through 4 July 2025
ER -