Abstract
Vision-language models (VLMs) have driven substantial progress in human-object interaction (HOI) detection. However, existing VLM-based HOI detectors typically rely on coarse multimodal prompts for knowledge transfer, which makes it difficult to comprehensively capture interaction-relevant contextual cues and consequently weakens generalization to HOI detection. Meanwhile, hard-label supervised learning fundamentally ignores semantic correlations among interaction categories, which tends to suppress knowledge transfer due to misalignment with the continuous semantic similarity structure encoded by VLM representations in the embedding space. To address these challenges, we propose SMPL, a Soft-label guided Multi-granularity Prompt Learning model for HOI detection, which facilitates prompt learning by jointly capturing multi-level interaction cues and providing semantically calibrated supervision aligned with VLM embeddings. Specifically, we design multi-granularity visual and textual prompts to capture interaction cues at different levels of detail, thereby improving generalization to interaction categories. Moreover, we introduce soft-label learning to jointly optimize interaction classification with the hard-labels and soft-label supervision, which naturally reflects interaction-level semantic similarity, enabling the model to learn implicit interaction relations without additional annotations. Extensive experiments demonstrate that SMPL achieves 38.97 mAP on the HICO-DET dataset and improves performance by 2.64 mAP over the current state of the art on the challenging Rare split. SMPL also performs strongly under multiple zero-shot HOI settings, demonstrating excellent generalization to unseen interactions. The code and models are available at https://github.com/hxqstree/SMPL.
| Original language | English |
|---|---|
| Article number | 114765 |
| Journal | Applied Soft Computing |
| Volume | 192 |
| DOIs | |
| State | Published - Apr 2026 |
Keywords
- Human object interaction detection
- Large vision-language model
- Prompt learning
- Zero-shot detection
Fingerprint
Dive into the research topics of 'Soft-label guided multi-granularity prompts learning for human-object interaction detection'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver