Soft-label guided multi-granularity prompts learning for human-object interaction detection

  • Xiaoqian Han
  • , Xiaowei Zhang*
  • , Guanglin Niu
  • , Mingliang Zhou
  • , Zhenkuan Pan
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Vision-language models (VLMs) have driven substantial progress in human-object interaction (HOI) detection. However, existing VLM-based HOI detectors typically rely on coarse multimodal prompts for knowledge transfer, which makes it difficult to comprehensively capture interaction-relevant contextual cues and consequently weakens generalization to HOI detection. Meanwhile, hard-label supervised learning fundamentally ignores semantic correlations among interaction categories, which tends to suppress knowledge transfer due to misalignment with the continuous semantic similarity structure encoded by VLM representations in the embedding space. To address these challenges, we propose SMPL, a Soft-label guided Multi-granularity Prompt Learning model for HOI detection, which facilitates prompt learning by jointly capturing multi-level interaction cues and providing semantically calibrated supervision aligned with VLM embeddings. Specifically, we design multi-granularity visual and textual prompts to capture interaction cues at different levels of detail, thereby improving generalization to interaction categories. Moreover, we introduce soft-label learning to jointly optimize interaction classification with the hard-labels and soft-label supervision, which naturally reflects interaction-level semantic similarity, enabling the model to learn implicit interaction relations without additional annotations. Extensive experiments demonstrate that SMPL achieves 38.97 mAP on the HICO-DET dataset and improves performance by 2.64 mAP over the current state of the art on the challenging Rare split. SMPL also performs strongly under multiple zero-shot HOI settings, demonstrating excellent generalization to unseen interactions. The code and models are available at https://github.com/hxqstree/SMPL.

Original languageEnglish
Article number114765
JournalApplied Soft Computing
Volume192
DOIs
StatePublished - Apr 2026

Keywords

  • Human object interaction detection
  • Large vision-language model
  • Prompt learning
  • Zero-shot detection

Fingerprint

Dive into the research topics of 'Soft-label guided multi-granularity prompts learning for human-object interaction detection'. Together they form a unique fingerprint.

Cite this