CASEG: CLIP-BASED ACTION SEGMENTATION WITH LEARNABLE TEXT PROMPT

  • Suyuan Huang
  • , Haoxin Zhang
  • , Yanyu Xu
  • , Yan Gao
  • , Yao Hu
  • , Zengchang Qin*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Video action segmentation aims to identify and localize actions. Existing models have achieved impressive performance with pre-extracted frame-level features, but this may limit zero-shot learning and cross-dataset inference, especially for new actions or scenes. To overcome this problem, we propose a novel end-to-end network designed for robust performance across both familiar and novel action segmentation scenarios. Our approach combines a plug-and-play visual prompt module enhancing CLIP features' temporal understanding, and a learnable text prompt that enriches label semantics and refines the model's focus, significantly boosting performance. Our results demonstrate that CLIP features can assist in action segmentation tasks, and prompts can improve task effectiveness. Furthermore, our findings show that CLIP features contain information that i3d features do not. We evaluate the proposed method on several video datasets, including Georgia Tech Egocentric Activities (GTEA), 50Salads, and Breakfast, and the results show that the proposed model outperforms existing SOTA models.

Original languageEnglish
Title of host publication2024 IEEE International Conference on Image Processing, ICIP 2024 - Proceedings
PublisherIEEE Computer Society
Pages2201-2207
Number of pages7
ISBN (Electronic)9798350349399
DOIs
StatePublished - 2024
Event31st IEEE International Conference on Image Processing, ICIP 2024 - Abu Dhabi, United Arab Emirates
Duration: 27 Oct 202430 Oct 2024

Publication series

NameProceedings - International Conference on Image Processing, ICIP
ISSN (Print)1522-4880

Conference

Conference31st IEEE International Conference on Image Processing, ICIP 2024
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period27/10/2430/10/24

Keywords

  • Action Segmentation
  • Cross-dataset Inference
  • Learnable Text Prompt
  • Video Understanding
  • Zero-shot Learning

Fingerprint

Dive into the research topics of 'CASEG: CLIP-BASED ACTION SEGMENTATION WITH LEARNABLE TEXT PROMPT'. Together they form a unique fingerprint.

Cite this