Skip to main navigation Skip to search Skip to main content

Knowledge Distilled Group Prompts Learning for HOI Detection with Large Vision-Language Models

  • Xiaoqian Han
  • , Guanglin Niu
  • , Mingliang Zhou
  • , Xiaowei Zhang*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Large vision-language models (VLMs) have significantly advanced human-object interaction (HOI) detection. However, existing VLM-based HOI detectors primarily rely on simple text prompt paradigms, specifically in relation to knowledge hallucination, with limited exploration of the intrinsic attributes or extrinsic context. In this paper, we propose a knowledge distilled group prompts learning method for HOI detection, termed GPL-HOI, which transfer knowledge from vision-language models via group prompts and knowledge distillation. Specifically, we design visual-textual group prompts by combining scene-aware, region-aware, and pose-aware prompt to guide knowledge transfer from VLMs. Additionally, we introduce a cross-modal group distillation module,which aligns the semantic features of both the vision and text models via KL divergence, encouraging the visual encoder to generate similar probability distributions to the text encoder through the learnable prompts. Extensive experiments demonstrate that our method surpasses state-of-the-art approaches in both conventional and zero-shot settings, achieving improvements of +2.04 mAP and +1.84 mAP on HICO-DET, respectively. Code will be available at https://github.com/hxqstree/GPL-HOI.

Original languageEnglish
Title of host publication2025 IEEE International Conference on Multimedia and Expo
Subtitle of host publicationJourney to the Center of Machine Imagination, ICME 2025 - Conference Proceedings
PublisherIEEE Computer Society
ISBN (Electronic)9798331594954
DOIs
StatePublished - 2025
Event2025 IEEE International Conference on Multimedia and Expo, ICME 2025 - Nantes, France
Duration: 30 Jun 20254 Jul 2025

Publication series

NameProceedings - IEEE International Conference on Multimedia and Expo
ISSN (Print)1945-7871
ISSN (Electronic)1945-788X

Conference

Conference2025 IEEE International Conference on Multimedia and Expo, ICME 2025
Country/TerritoryFrance
CityNantes
Period30/06/254/07/25

Keywords

  • Human-object interaction detection
  • Large vision-language model
  • Prompt learning

Fingerprint

Dive into the research topics of 'Knowledge Distilled Group Prompts Learning for HOI Detection with Large Vision-Language Models'. Together they form a unique fingerprint.

Cite this